All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: IO scheduler based IO controller V10
@ 2009-10-02 10:55 Corrado Zoccolo
  2009-10-02 11:04 ` Jens Axboe
                   ` (2 more replies)
  0 siblings, 3 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 10:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

Hi Jens,
On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Fri, Oct 02 2009, Ingo Molnar wrote:
>>
>> * Jens Axboe <jens.axboe@oracle.com> wrote:
>>
>
> It's really not that simple, if we go and do easy latency bits, then
> throughput drops 30% or more. You can't say it's black and white latency
> vs throughput issue, that's just not how the real world works. The
> server folks would be most unpleased.
Could we be more selective when the latency optimization is introduced?

The code that is currently touched by Vivek's patch is:
        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
            (cfqd->hw_tag && CIC_SEEKY(cic)))
                enable_idle = 0;
basically, when fairness=1, it becomes just:
        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
                enable_idle = 0;

Note that, even if we enable idling here, the cfq_arm_slice_timer will use
a different idle window for seeky (2ms) than for normal I/O.

I think that the 2ms idle window is good for a single rotational SATA disk scenario,
even if it supports NCQ. Realistic access times for those disks are still around 8ms
(but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
request may pay off, not only in latency and fairness, but also in throughput.

What we don't want to do is to enable idling for NCQ enabled SSDs
(and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
If we agree that hardware RAIDs should be marked as non-rotational, then that
code could become:

        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
            (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
                enable_idle = 0;
        else if (sample_valid(cic->ttime_samples)) {
		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
		if (cic->ttime_mean > idle_time)
                        enable_idle = 0;
                else
                        enable_idle = 1;
        }

Thanks,
Corrado

>
> --
> Jens Axboe
>

-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2009-10-02 11:04   ` Jens Axboe
  2009-10-02 12:49   ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 11:04 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02 2009, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
> 
> The code that is currently touched by Vivek's patch is:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
> basically, when fairness=1, it becomes just:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
>                 enable_idle = 0;
> 
> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
> 
> I think that the 2ms idle window is good for a single rotational SATA
> disk scenario, even if it supports NCQ. Realistic access times for
> those disks are still around 8ms (but it is proportional to seek
> lenght), and waiting 2ms to see if we get a nearby request may pay
> off, not only in latency and fairness, but also in throughput.

I agree, that change looks good.

> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.

Right, it was part of the bigger SSD optimization stuff I did a few
revisions back.

> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
> 
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
>         else if (sample_valid(cic->ttime_samples)) {
> 		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> 		if (cic->ttime_mean > idle_time)
>                         enable_idle = 0;
>                 else
>                         enable_idle = 1;
>         }

Yes agree on that too. We probably should make a different flag for
hardware raids, telling the io scheduler that this device is really
composed if several others. If it's composited only by SSD's (or has a
frontend similar to that), then non-rotational applies.

But yes, we should pass that information down.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo
@ 2009-10-02 11:04 ` Jens Axboe
       [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2009-10-02 12:49   ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 11:04 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02 2009, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <jens.axboe@oracle.com> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
> 
> The code that is currently touched by Vivek's patch is:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
> basically, when fairness=1, it becomes just:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
>                 enable_idle = 0;
> 
> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
> 
> I think that the 2ms idle window is good for a single rotational SATA
> disk scenario, even if it supports NCQ. Realistic access times for
> those disks are still around 8ms (but it is proportional to seek
> lenght), and waiting 2ms to see if we get a nearby request may pay
> off, not only in latency and fairness, but also in throughput.

I agree, that change looks good.

> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.

Right, it was part of the bigger SSD optimization stuff I did a few
revisions back.

> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
> 
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
>         else if (sample_valid(cic->ttime_samples)) {
> 		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> 		if (cic->ttime_mean > idle_time)
>                         enable_idle = 0;
>                 else
>                         enable_idle = 1;
>         }

Yes agree on that too. We probably should make a different flag for
hardware raids, telling the io scheduler that this device is really
composed if several others. If it's composited only by SSD's (or has a
frontend similar to that), then non-rotational applies.

But yes, we should pass that information down.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2009-10-02 11:04   ` Jens Axboe
@ 2009-10-02 12:49   ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 12:49 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
> 
> The code that is currently touched by Vivek's patch is:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
> basically, when fairness=1, it becomes just:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
>                 enable_idle = 0;
> 

Actually I am not touching this code. Looking at the V10, I have not
changed anything here in idling code.

I think we are seeing latency improvements with fairness=1 because, CFQ
does pure roundrobin and once a seeky reader expires, it is put at the
end of the queue.

I retained the same behavior if fairness=0 but if fairness=1, then I don't
put the seeky reader at the end of queue, instead it gets vdisktime based
on the disk it has used. So it should get placed ahead of sync readers.

I think following is the code snippet in "elevator-fq.c" which is making a
difference.

        /*
         * We don't want to charge more than allocated slice otherwise
         * this
         * queue can miss one dispatch round doubling max latencies. On
         * the
         * other hand we don't want to charge less than allocated slice as
         * we stick to CFQ theme of queue loosing its share if it does not
         * use the slice and moves to the back of service tree (almost).
         */
        if (!ioq->efqd->fairness)
                queue_charge = allocated_slice;
 
So if a sync readers consumes 100ms and an seeky reader dispatches only
one request, then in CFQ, seeky reader gets to dispatch next request after
another 100ms.

With fairness=1, it should get a lower vdisktime when it comes with a new
request because its last slice usage was less (like CFS sleepers as mike
said). But this will make a difference only if there are more than one
processes in the system otherwise a vtime jump will take place by the time
seeky readers gets backlogged.

Anyway, once I started timestamping the queues and started keeping a cache
of expired queues, then any queue which got new request almost
immediately, should get a lower vdisktime assigned if it did not use the
full time slice in the previous dispatch round. Hence  with fairness=1,
seeky readers kind of get more share of disk (fair share), because these
are now placed ahead of streaming readers and hence get better latencies.

In short, most likely, better latencies are being experienced because
seeky reader is getting lower time stamp (vdisktime), because it did not
use its full time slice in previous dispatch round, and not because we kept
the idling enabled on seeky reader.

Thanks
Vivek

> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
> 
> I think that the 2ms idle window is good for a single rotational SATA disk scenario,
> even if it supports NCQ. Realistic access times for those disks are still around 8ms
> (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
> request may pay off, not only in latency and fairness, but also in throughput.
> 
> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
> 
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
>         else if (sample_valid(cic->ttime_samples)) {
> 		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> 		if (cic->ttime_mean > idle_time)
>                         enable_idle = 0;
>                 else
>                         enable_idle = 1;
>         }
> 
> Thanks,
> Corrado
> 
> >
> > --
> > Jens Axboe
> >
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo
@ 2009-10-02 12:49   ` Vivek Goyal
       [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  2009-10-02 12:49   ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 12:49 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jens Axboe, Ingo Molnar, Mike Galbraith, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <jens.axboe@oracle.com> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
> 
> The code that is currently touched by Vivek's patch is:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
> basically, when fairness=1, it becomes just:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
>                 enable_idle = 0;
> 

Actually I am not touching this code. Looking at the V10, I have not
changed anything here in idling code.

I think we are seeing latency improvements with fairness=1 because, CFQ
does pure roundrobin and once a seeky reader expires, it is put at the
end of the queue.

I retained the same behavior if fairness=0 but if fairness=1, then I don't
put the seeky reader at the end of queue, instead it gets vdisktime based
on the disk it has used. So it should get placed ahead of sync readers.

I think following is the code snippet in "elevator-fq.c" which is making a
difference.

        /*
         * We don't want to charge more than allocated slice otherwise
         * this
         * queue can miss one dispatch round doubling max latencies. On
         * the
         * other hand we don't want to charge less than allocated slice as
         * we stick to CFQ theme of queue loosing its share if it does not
         * use the slice and moves to the back of service tree (almost).
         */
        if (!ioq->efqd->fairness)
                queue_charge = allocated_slice;
 
So if a sync readers consumes 100ms and an seeky reader dispatches only
one request, then in CFQ, seeky reader gets to dispatch next request after
another 100ms.

With fairness=1, it should get a lower vdisktime when it comes with a new
request because its last slice usage was less (like CFS sleepers as mike
said). But this will make a difference only if there are more than one
processes in the system otherwise a vtime jump will take place by the time
seeky readers gets backlogged.

Anyway, once I started timestamping the queues and started keeping a cache
of expired queues, then any queue which got new request almost
immediately, should get a lower vdisktime assigned if it did not use the
full time slice in the previous dispatch round. Hence  with fairness=1,
seeky readers kind of get more share of disk (fair share), because these
are now placed ahead of streaming readers and hence get better latencies.

In short, most likely, better latencies are being experienced because
seeky reader is getting lower time stamp (vdisktime), because it did not
use its full time slice in previous dispatch round, and not because we kept
the idling enabled on seeky reader.

Thanks
Vivek

> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
> 
> I think that the 2ms idle window is good for a single rotational SATA disk scenario,
> even if it supports NCQ. Realistic access times for those disks are still around 8ms
> (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
> request may pay off, not only in latency and fairness, but also in throughput.
> 
> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
> 
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
>         else if (sample_valid(cic->ttime_samples)) {
> 		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> 		if (cic->ttime_mean > idle_time)
>                         enable_idle = 0;
>                 else
>                         enable_idle = 1;
>         }
> 
> Thanks,
> Corrado
> 
> >
> > --
> > Jens Axboe
> >
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 12:49   ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 12:49 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds

On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> Hi Jens,
> On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> >>
> >> * Jens Axboe <jens.axboe@oracle.com> wrote:
> >>
> >
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more. You can't say it's black and white latency
> > vs throughput issue, that's just not how the real world works. The
> > server folks would be most unpleased.
> Could we be more selective when the latency optimization is introduced?
> 
> The code that is currently touched by Vivek's patch is:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
> basically, when fairness=1, it becomes just:
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
>                 enable_idle = 0;
> 

Actually I am not touching this code. Looking at the V10, I have not
changed anything here in idling code.

I think we are seeing latency improvements with fairness=1 because, CFQ
does pure roundrobin and once a seeky reader expires, it is put at the
end of the queue.

I retained the same behavior if fairness=0 but if fairness=1, then I don't
put the seeky reader at the end of queue, instead it gets vdisktime based
on the disk it has used. So it should get placed ahead of sync readers.

I think following is the code snippet in "elevator-fq.c" which is making a
difference.

        /*
         * We don't want to charge more than allocated slice otherwise
         * this
         * queue can miss one dispatch round doubling max latencies. On
         * the
         * other hand we don't want to charge less than allocated slice as
         * we stick to CFQ theme of queue loosing its share if it does not
         * use the slice and moves to the back of service tree (almost).
         */
        if (!ioq->efqd->fairness)
                queue_charge = allocated_slice;
 
So if a sync readers consumes 100ms and an seeky reader dispatches only
one request, then in CFQ, seeky reader gets to dispatch next request after
another 100ms.

With fairness=1, it should get a lower vdisktime when it comes with a new
request because its last slice usage was less (like CFS sleepers as mike
said). But this will make a difference only if there are more than one
processes in the system otherwise a vtime jump will take place by the time
seeky readers gets backlogged.

Anyway, once I started timestamping the queues and started keeping a cache
of expired queues, then any queue which got new request almost
immediately, should get a lower vdisktime assigned if it did not use the
full time slice in the previous dispatch round. Hence  with fairness=1,
seeky readers kind of get more share of disk (fair share), because these
are now placed ahead of streaming readers and hence get better latencies.

In short, most likely, better latencies are being experienced because
seeky reader is getting lower time stamp (vdisktime), because it did not
use its full time slice in previous dispatch round, and not because we kept
the idling enabled on seeky reader.

Thanks
Vivek

> Note that, even if we enable idling here, the cfq_arm_slice_timer will use
> a different idle window for seeky (2ms) than for normal I/O.
> 
> I think that the 2ms idle window is good for a single rotational SATA disk scenario,
> even if it supports NCQ. Realistic access times for those disks are still around 8ms
> (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
> request may pay off, not only in latency and fairness, but also in throughput.
> 
> What we don't want to do is to enable idling for NCQ enabled SSDs
> (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
> If we agree that hardware RAIDs should be marked as non-rotational, then that
> code could become:
> 
>         if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
>             (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
>                 enable_idle = 0;
>         else if (sample_valid(cic->ttime_samples)) {
> 		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
> 		if (cic->ttime_mean > idle_time)
>                         enable_idle = 0;
>                 else
>                         enable_idle = 1;
>         }
> 
> Thanks,
> Corrado
> 
> >
> > --
> > Jens Axboe
> >
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 12:49   ` Vivek Goyal
@ 2009-10-02 15:27       ` Corrado Zoccolo
  -1 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 15:27 UTC (permalink / raw)
  To: Vivek Goyal, Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
>
> Actually I am not touching this code. Looking at the V10, I have not
> changed anything here in idling code.

I based my analisys on the original patch:
http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html

Mike, can you confirm which version of the fairness patch did you use
in your tests?

Corrado

> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 15:27       ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 15:27 UTC (permalink / raw)
  To: Vivek Goyal, Mike Galbraith
  Cc: Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers,
	dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, riel

On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
>
> Actually I am not touching this code. Looking at the V10, I have not
> changed anything here in idling code.

I based my analisys on the original patch:
http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html

Mike, can you confirm which version of the fairness patch did you use
in your tests?

Corrado

> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <4e5e476b0910020827s23e827b1n847c64e355999d4a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-02 15:31         ` Vivek Goyal
  2009-10-02 15:32         ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 15:31 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02, 2009 at 05:27:55PM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
> 
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> 

Oh.., you are talking about fairness for seeky process patch. I thought
you are talking about current IO controller patches. Actually they both
have this notion of "fairness=1" parameter but do different things in 
patches, hence the confusion. 

Thanks
Vivek


> Mike, can you confirm which version of the fairness patch did you use
> in your tests?
> 
> Corrado
> 
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:27       ` Corrado Zoccolo
@ 2009-10-02 15:31         ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 15:31 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02, 2009 at 05:27:55PM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
> 
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> 

Oh.., you are talking about fairness for seeky process patch. I thought
you are talking about current IO controller patches. Actually they both
have this notion of "fairness=1" parameter but do different things in 
patches, hence the confusion. 

Thanks
Vivek


> Mike, can you confirm which version of the fairness patch did you use
> in your tests?
> 
> Corrado
> 
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 15:31         ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 15:31 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds

On Fri, Oct 02, 2009 at 05:27:55PM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
> 
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> 

Oh.., you are talking about fairness for seeky process patch. I thought
you are talking about current IO controller patches. Actually they both
have this notion of "fairness=1" parameter but do different things in 
patches, hence the confusion. 

Thanks
Vivek


> Mike, can you confirm which version of the fairness patch did you use
> in your tests?
> 
> Corrado
> 
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <4e5e476b0910020827s23e827b1n847c64e355999d4a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-10-02 15:31         ` Vivek Goyal
@ 2009-10-02 15:32         ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 15:32 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
> 
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> 
> Mike, can you confirm which version of the fairness patch did you use
> in your tests?

That would be this one-liner.

o CFQ provides fair access to disk in terms of disk time used to processes.
  Fairness is provided for the applications which have their think time with
  in slice_idle (8ms default) limit.

o CFQ currently disables idling for seeky processes. So even if a process
  has think time with-in slice_idle limits, it will still not get fair share
  of disk. Disabling idling for a seeky process seems good from throughput
  perspective but not necessarily from fairness perspecitve.

0 Do not disable idling based on seek pattern of process if a user has set
  /sys/block/<disk>/queue/iosched/fairness = 1.

Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
---
 block/cfq-iosched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
+	    (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:27       ` Corrado Zoccolo
@ 2009-10-02 15:32         ` Mike Galbraith
  -1 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 15:32 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel,
	containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
> 
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> 
> Mike, can you confirm which version of the fairness patch did you use
> in your tests?

That would be this one-liner.

o CFQ provides fair access to disk in terms of disk time used to processes.
  Fairness is provided for the applications which have their think time with
  in slice_idle (8ms default) limit.

o CFQ currently disables idling for seeky processes. So even if a process
  has think time with-in slice_idle limits, it will still not get fair share
  of disk. Disabling idling for a seeky process seems good from throughput
  perspective but not necessarily from fairness perspecitve.

0 Do not disable idling based on seek pattern of process if a user has set
  /sys/block/<disk>/queue/iosched/fairness = 1.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
+	    (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 15:32         ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 15:32 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, linux-kernel, akpm, righi.andrea, torvalds

On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> >
> > Actually I am not touching this code. Looking at the V10, I have not
> > changed anything here in idling code.
> 
> I based my analisys on the original patch:
> http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> 
> Mike, can you confirm which version of the fairness patch did you use
> in your tests?

That would be this one-liner.

o CFQ provides fair access to disk in terms of disk time used to processes.
  Fairness is provided for the applications which have their think time with
  in slice_idle (8ms default) limit.

o CFQ currently disables idling for seeky processes. So even if a process
  has think time with-in slice_idle limits, it will still not get fair share
  of disk. Disabling idling for a seeky process seems good from throughput
  perspective but not necessarily from fairness perspecitve.

0 Do not disable idling based on seek pattern of process if a user has set
  /sys/block/<disk>/queue/iosched/fairness = 1.

Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
---
 block/cfq-iosched.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
+	    (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]         ` <1254497520.10392.11.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02 15:40           ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 15:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > >
> > > Actually I am not touching this code. Looking at the V10, I have not
> > > changed anything here in idling code.
> > 
> > I based my analisys on the original patch:
> > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> > 
> > Mike, can you confirm which version of the fairness patch did you use
> > in your tests?
> 
> That would be this one-liner.
> 

Ok. Thanks. Sorry, I got confused and thought that you are using "io
controller patches" with fairness=1.

In that case, Corrado's suggestion of refining it further and disabling idling
for seeky process only on non-rotational media (SSD and hardware RAID), makes
sense to me.

Thanks
Vivek
  
> o CFQ provides fair access to disk in terms of disk time used to processes.
>   Fairness is provided for the applications which have their think time with
>   in slice_idle (8ms default) limit.
> 
> o CFQ currently disables idling for seeky processes. So even if a process
>   has think time with-in slice_idle limits, it will still not get fair share
>   of disk. Disabling idling for a seeky process seems good from throughput
>   perspective but not necessarily from fairness perspecitve.
> 
> 0 Do not disable idling based on seek pattern of process if a user has set
>   /sys/block/<disk>/queue/iosched/fairness = 1.
> 
> Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> ---
>  block/cfq-iosched.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
>  	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>  
>  	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> -	    (cfqd->hw_tag && CIC_SEEKY(cic)))
> +	    (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
>  		enable_idle = 0;
>  	else if (sample_valid(cic->ttime_samples)) {
>  		if (cic->ttime_mean > cfqd->cfq_slice_idle)
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:32         ` Mike Galbraith
@ 2009-10-02 15:40           ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 15:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Corrado Zoccolo, Jens Axboe, Ingo Molnar, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > >
> > > Actually I am not touching this code. Looking at the V10, I have not
> > > changed anything here in idling code.
> > 
> > I based my analisys on the original patch:
> > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> > 
> > Mike, can you confirm which version of the fairness patch did you use
> > in your tests?
> 
> That would be this one-liner.
> 

Ok. Thanks. Sorry, I got confused and thought that you are using "io
controller patches" with fairness=1.

In that case, Corrado's suggestion of refining it further and disabling idling
for seeky process only on non-rotational media (SSD and hardware RAID), makes
sense to me.

Thanks
Vivek
  
> o CFQ provides fair access to disk in terms of disk time used to processes.
>   Fairness is provided for the applications which have their think time with
>   in slice_idle (8ms default) limit.
> 
> o CFQ currently disables idling for seeky processes. So even if a process
>   has think time with-in slice_idle limits, it will still not get fair share
>   of disk. Disabling idling for a seeky process seems good from throughput
>   perspective but not necessarily from fairness perspecitve.
> 
> 0 Do not disable idling based on seek pattern of process if a user has set
>   /sys/block/<disk>/queue/iosched/fairness = 1.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/cfq-iosched.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
>  	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>  
>  	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> -	    (cfqd->hw_tag && CIC_SEEKY(cic)))
> +	    (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
>  		enable_idle = 0;
>  	else if (sample_valid(cic->ttime_samples)) {
>  		if (cic->ttime_mean > cfqd->cfq_slice_idle)
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 15:40           ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 15:40 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, Corrado Zoccolo, dm-devel, dpshah, Jens Axboe,
	agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas,
	mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf,
	fchecconi, containers, linux-kernel, akpm, righi.andrea,
	torvalds

On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > >
> > > Actually I am not touching this code. Looking at the V10, I have not
> > > changed anything here in idling code.
> > 
> > I based my analisys on the original patch:
> > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> > 
> > Mike, can you confirm which version of the fairness patch did you use
> > in your tests?
> 
> That would be this one-liner.
> 

Ok. Thanks. Sorry, I got confused and thought that you are using "io
controller patches" with fairness=1.

In that case, Corrado's suggestion of refining it further and disabling idling
for seeky process only on non-rotational media (SSD and hardware RAID), makes
sense to me.

Thanks
Vivek
  
> o CFQ provides fair access to disk in terms of disk time used to processes.
>   Fairness is provided for the applications which have their think time with
>   in slice_idle (8ms default) limit.
> 
> o CFQ currently disables idling for seeky processes. So even if a process
>   has think time with-in slice_idle limits, it will still not get fair share
>   of disk. Disabling idling for a seeky process seems good from throughput
>   perspective but not necessarily from fairness perspecitve.
> 
> 0 Do not disable idling based on seek pattern of process if a user has set
>   /sys/block/<disk>/queue/iosched/fairness = 1.
> 
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> ---
>  block/cfq-iosched.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data *
>  	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>  
>  	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> -	    (cfqd->hw_tag && CIC_SEEKY(cic)))
> +	    (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic)))
>  		enable_idle = 0;
>  	else if (sample_valid(cic->ttime_samples)) {
>  		if (cic->ttime_mean > cfqd->cfq_slice_idle)
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]           ` <20091002154020.GC4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-02 16:03             ` Mike Galbraith
  2009-10-02 16:50             ` Valdis.Kletnieks-PjAqaU27lzQ
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 16:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 2009-10-02 at 11:40 -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > > >
> > > > Actually I am not touching this code. Looking at the V10, I have not
> > > > changed anything here in idling code.
> > > 
> > > I based my analisys on the original patch:
> > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> > > 
> > > Mike, can you confirm which version of the fairness patch did you use
> > > in your tests?
> > 
> > That would be this one-liner.
> > 
> 
> Ok. Thanks. Sorry, I got confused and thought that you are using "io
> controller patches" with fairness=1.
> 
> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

One thing that might help with that is to have new tasks start out life
meeting the seeky criteria.  If there's anything going on, they will be.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:40           ` Vivek Goyal
  (?)
  (?)
@ 2009-10-02 16:03           ` Mike Galbraith
  -1 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 16:03 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Corrado Zoccolo, Jens Axboe, Ingo Molnar, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, 2009-10-02 at 11:40 -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote:
> > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote:
> > > >
> > > > Actually I am not touching this code. Looking at the V10, I have not
> > > > changed anything here in idling code.
> > > 
> > > I based my analisys on the original patch:
> > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html
> > > 
> > > Mike, can you confirm which version of the fairness patch did you use
> > > in your tests?
> > 
> > That would be this one-liner.
> > 
> 
> Ok. Thanks. Sorry, I got confused and thought that you are using "io
> controller patches" with fairness=1.
> 
> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

One thing that might help with that is to have new tasks start out life
meeting the seeky criteria.  If there's anything going on, they will be.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]           ` <20091002154020.GC4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-10-02 16:03             ` Mike Galbraith
@ 2009-10-02 16:50             ` Valdis.Kletnieks-PjAqaU27lzQ
  1 sibling, 0 replies; 349+ messages in thread
From: Valdis.Kletnieks-PjAqaU27lzQ @ 2009-10-02 16:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


[-- Attachment #1.1: Type: text/plain, Size: 563 bytes --]

On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:

> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

Umm... I got petabytes of hardware RAID across the hall that very definitely
*is* rotating.  Did you mean "SSD and disk systems with big honking caches
that cover up the rotation"?  Because "RAID" and "big honking caches" are
not *quite* the same thing, and I can just see that corner case coming out
to bite somebody on the ass...


[-- Attachment #1.2: Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:40           ` Vivek Goyal
@ 2009-10-02 16:50             ` Valdis.Kletnieks
  -1 siblings, 0 replies; 349+ messages in thread
From: Valdis.Kletnieks @ 2009-10-02 16:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Mike Galbraith, Corrado Zoccolo, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

[-- Attachment #1: Type: text/plain, Size: 563 bytes --]

On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:

> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

Umm... I got petabytes of hardware RAID across the hall that very definitely
*is* rotating.  Did you mean "SSD and disk systems with big honking caches
that cover up the rotation"?  Because "RAID" and "big honking caches" are
not *quite* the same thing, and I can just see that corner case coming out
to bite somebody on the ass...


[-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --]

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 16:50             ` Valdis.Kletnieks
  0 siblings, 0 replies; 349+ messages in thread
From: Valdis.Kletnieks @ 2009-10-02 16:50 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, Corrado Zoccolo, dm-devel, dpshah, Jens Axboe,
	agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas,
	mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf,
	fchecconi, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds


[-- Attachment #1.1: Type: text/plain, Size: 563 bytes --]

On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:

> In that case, Corrado's suggestion of refining it further and disabling idling
> for seeky process only on non-rotational media (SSD and hardware RAID), makes
> sense to me.

Umm... I got petabytes of hardware RAID across the hall that very definitely
*is* rotating.  Did you mean "SSD and disk systems with big honking caches
that cover up the rotation"?  Because "RAID" and "big honking caches" are
not *quite* the same thing, and I can just see that corner case coming out
to bite somebody on the ass...


[-- Attachment #1.2: Type: application/pgp-signature, Size: 227 bytes --]

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]             ` <12774.1254502217-+bZmOdGhbsPr6rcHtW+onFJE71vCis6O@public.gmane.org>
@ 2009-10-02 19:58               ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 19:58 UTC (permalink / raw)
  To: Valdis.Kletnieks-PjAqaU27lzQ
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks-PjAqaU27lzQ@public.gmane.org wrote:
> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> 
> > In that case, Corrado's suggestion of refining it further and disabling idling
> > for seeky process only on non-rotational media (SSD and hardware RAID), makes
> > sense to me.
> 
> Umm... I got petabytes of hardware RAID across the hall that very definitely
> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
> that cover up the rotation"?  Because "RAID" and "big honking caches" are
> not *quite* the same thing, and I can just see that corner case coming out
> to bite somebody on the ass...
>

I guess both. The systems which have big caches and cover up for rotation,
we probably need not idle for seeky process. An in case of big hardware
RAID, having multiple rotating disks, instead of idling and keeping rest
of the disks free, we probably are better off dispatching requests from
next queue (hoping it is going to a different disk altogether).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 16:50             ` Valdis.Kletnieks
@ 2009-10-02 19:58               ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 19:58 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: Mike Galbraith, Corrado Zoccolo, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> 
> > In that case, Corrado's suggestion of refining it further and disabling idling
> > for seeky process only on non-rotational media (SSD and hardware RAID), makes
> > sense to me.
> 
> Umm... I got petabytes of hardware RAID across the hall that very definitely
> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
> that cover up the rotation"?  Because "RAID" and "big honking caches" are
> not *quite* the same thing, and I can just see that corner case coming out
> to bite somebody on the ass...
>

I guess both. The systems which have big caches and cover up for rotation,
we probably need not idle for seeky process. An in case of big hardware
RAID, having multiple rotating disks, instead of idling and keeping rest
of the disks free, we probably are better off dispatching requests from
next queue (hoping it is going to a different disk altogether).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 19:58               ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 19:58 UTC (permalink / raw)
  To: Valdis.Kletnieks
  Cc: dhaval, peterz, Corrado Zoccolo, dm-devel, dpshah, Jens Axboe,
	agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas,
	mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf,
	fchecconi, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> 
> > In that case, Corrado's suggestion of refining it further and disabling idling
> > for seeky process only on non-rotational media (SSD and hardware RAID), makes
> > sense to me.
> 
> Umm... I got petabytes of hardware RAID across the hall that very definitely
> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
> that cover up the rotation"?  Because "RAID" and "big honking caches" are
> not *quite* the same thing, and I can just see that corner case coming out
> to bite somebody on the ass...
>

I guess both. The systems which have big caches and cover up for rotation,
we probably need not idle for seeky process. An in case of big hardware
RAID, having multiple rotating disks, instead of idling and keeping rest
of the disks free, we probably are better off dispatching requests from
next queue (hoping it is going to a different disk altogether).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]               ` <20091002195815.GE4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-02 22:14                 ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 22:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
>>
>> Umm... I got petabytes of hardware RAID across the hall that very definitely
>> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
>> that cover up the rotation"?  Because "RAID" and "big honking caches" are
>> not *quite* the same thing, and I can just see that corner case coming out
>> to bite somebody on the ass...
>>
>
> I guess both. The systems which have big caches and cover up for rotation,
> we probably need not idle for seeky process. An in case of big hardware
> RAID, having multiple rotating disks, instead of idling and keeping rest
> of the disks free, we probably are better off dispatching requests from
> next queue (hoping it is going to a different disk altogether).

In fact I think that the 'rotating' flag name is misleading.
All the checks we are doing are actually checking if the device truly
supports multiple parallel operations, and this feature is shared by
hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
NCQ-enabled SATA disk.

If we really wanted a "seek is cheap" flag, we could measure seek time
in the io-scheduler itself, but in the current code base we don't have
it used in this meaning anywhere.

Thanks,
Corrado

>
> Thanks
> Vivek
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 19:58               ` Vivek Goyal
@ 2009-10-02 22:14                 ` Corrado Zoccolo
  -1 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 22:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
>>
>> Umm... I got petabytes of hardware RAID across the hall that very definitely
>> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
>> that cover up the rotation"?  Because "RAID" and "big honking caches" are
>> not *quite* the same thing, and I can just see that corner case coming out
>> to bite somebody on the ass...
>>
>
> I guess both. The systems which have big caches and cover up for rotation,
> we probably need not idle for seeky process. An in case of big hardware
> RAID, having multiple rotating disks, instead of idling and keeping rest
> of the disks free, we probably are better off dispatching requests from
> next queue (hoping it is going to a different disk altogether).

In fact I think that the 'rotating' flag name is misleading.
All the checks we are doing are actually checking if the device truly
supports multiple parallel operations, and this feature is shared by
hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
NCQ-enabled SATA disk.

If we really wanted a "seek is cheap" flag, we could measure seek time
in the io-scheduler itself, but in the current code base we don't have
it used in this meaning anywhere.

Thanks,
Corrado

>
> Thanks
> Vivek
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 22:14                 ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 22:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
>> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
>>
>> Umm... I got petabytes of hardware RAID across the hall that very definitely
>> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
>> that cover up the rotation"?  Because "RAID" and "big honking caches" are
>> not *quite* the same thing, and I can just see that corner case coming out
>> to bite somebody on the ass...
>>
>
> I guess both. The systems which have big caches and cover up for rotation,
> we probably need not idle for seeky process. An in case of big hardware
> RAID, having multiple rotating disks, instead of idling and keeping rest
> of the disks free, we probably are better off dispatching requests from
> next queue (hoping it is going to a different disk altogether).

In fact I think that the 'rotating' flag name is misleading.
All the checks we are doing are actually checking if the device truly
supports multiple parallel operations, and this feature is shared by
hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
NCQ-enabled SATA disk.

If we really wanted a "seek is cheap" flag, we could measure seek time
in the io-scheduler itself, but in the current code base we don't have
it used in this meaning anywhere.

Thanks,
Corrado

>
> Thanks
> Vivek
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                 ` <4e5e476b0910021514i1b461229t667bed94fd67f140-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-02 22:27                   ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 22:27 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks-PjAqaU27lzQ@public.gmane.org wrote:
> >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> >>
> >> Umm... I got petabytes of hardware RAID across the hall that very definitely
> >> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
> >> that cover up the rotation"?  Because "RAID" and "big honking caches" are
> >> not *quite* the same thing, and I can just see that corner case coming out
> >> to bite somebody on the ass...
> >>
> >
> > I guess both. The systems which have big caches and cover up for rotation,
> > we probably need not idle for seeky process. An in case of big hardware
> > RAID, having multiple rotating disks, instead of idling and keeping rest
> > of the disks free, we probably are better off dispatching requests from
> > next queue (hoping it is going to a different disk altogether).
> 
> In fact I think that the 'rotating' flag name is misleading.
> All the checks we are doing are actually checking if the device truly
> supports multiple parallel operations, and this feature is shared by
> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> NCQ-enabled SATA disk.
> 

While we are at it, what happens to notion of priority of tasks on SSDs?
Without idling there is not continuous time slice and there is no
fairness. So ioprio is out of the window for SSDs?

On SSDs, will it make more sense to provide fairness in terms of number or
IO or size of IO and not in terms of time slices.

Thanks
Vivek

> If we really wanted a "seek is cheap" flag, we could measure seek time
> in the io-scheduler itself, but in the current code base we don't have
> it used in this meaning anywhere.
> 
> Thanks,
> Corrado
> 
> >
> > Thanks
> > Vivek
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
>                                Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 22:14                 ` Corrado Zoccolo
@ 2009-10-02 22:27                   ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 22:27 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
> >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> >>
> >> Umm... I got petabytes of hardware RAID across the hall that very definitely
> >> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
> >> that cover up the rotation"?  Because "RAID" and "big honking caches" are
> >> not *quite* the same thing, and I can just see that corner case coming out
> >> to bite somebody on the ass...
> >>
> >
> > I guess both. The systems which have big caches and cover up for rotation,
> > we probably need not idle for seeky process. An in case of big hardware
> > RAID, having multiple rotating disks, instead of idling and keeping rest
> > of the disks free, we probably are better off dispatching requests from
> > next queue (hoping it is going to a different disk altogether).
> 
> In fact I think that the 'rotating' flag name is misleading.
> All the checks we are doing are actually checking if the device truly
> supports multiple parallel operations, and this feature is shared by
> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> NCQ-enabled SATA disk.
> 

While we are at it, what happens to notion of priority of tasks on SSDs?
Without idling there is not continuous time slice and there is no
fairness. So ioprio is out of the window for SSDs?

On SSDs, will it make more sense to provide fairness in terms of number or
IO or size of IO and not in terms of time slices.

Thanks
Vivek

> If we really wanted a "seek is cheap" flag, we could measure seek time
> in the io-scheduler itself, but in the current code base we don't have
> it used in this meaning anywhere.
> 
> Thanks,
> Corrado
> 
> >
> > Thanks
> > Vivek
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
>                                Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 22:27                   ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02 22:27 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote:
> >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said:
> >>
> >> Umm... I got petabytes of hardware RAID across the hall that very definitely
> >> *is* rotating.  Did you mean "SSD and disk systems with big honking caches
> >> that cover up the rotation"?  Because "RAID" and "big honking caches" are
> >> not *quite* the same thing, and I can just see that corner case coming out
> >> to bite somebody on the ass...
> >>
> >
> > I guess both. The systems which have big caches and cover up for rotation,
> > we probably need not idle for seeky process. An in case of big hardware
> > RAID, having multiple rotating disks, instead of idling and keeping rest
> > of the disks free, we probably are better off dispatching requests from
> > next queue (hoping it is going to a different disk altogether).
> 
> In fact I think that the 'rotating' flag name is misleading.
> All the checks we are doing are actually checking if the device truly
> supports multiple parallel operations, and this feature is shared by
> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> NCQ-enabled SATA disk.
> 

While we are at it, what happens to notion of priority of tasks on SSDs?
Without idling there is not continuous time slice and there is no
fairness. So ioprio is out of the window for SSDs?

On SSDs, will it make more sense to provide fairness in terms of number or
IO or size of IO and not in terms of time slices.

Thanks
Vivek

> If we really wanted a "seek is cheap" flag, we could measure seek time
> in the io-scheduler itself, but in the current code base we don't have
> it used in this meaning anywhere.
> 
> Thanks,
> Corrado
> 
> >
> > Thanks
> > Vivek
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
>                                Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                   ` <20091002222756.GG4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-03 12:43                     ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-03 12:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> In fact I think that the 'rotating' flag name is misleading.
>> All the checks we are doing are actually checking if the device truly
>> supports multiple parallel operations, and this feature is shared by
>> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> NCQ-enabled SATA disk.
>>
>
> While we are at it, what happens to notion of priority of tasks on SSDs?
This is not changed by proposed patch w.r.t. current CFQ.
> Without idling there is not continuous time slice and there is no
> fairness. So ioprio is out of the window for SSDs?
I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
me that the way in which queues are sorted in the rr tree may still
provide some sort of fairness and service differentiation for
priorities, in terms of number of IOs.
Non-NCQ SSDs, instead, will still have the idle window enabled, so it
is not an issue for them.
>
> On SSDs, will it make more sense to provide fairness in terms of number or
> IO or size of IO and not in terms of time slices.
Not on all SSDs. There are still ones that have a non-negligible
penalty on non-sequential access pattern (hopefully the ones without
NCQ, but if we find otherwise, then we will have to benchmark access
time in I/O scheduler to select the best policy). For those, time
based may still be needed.

Thanks,
Corrado

>
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 22:27                   ` Vivek Goyal
  (?)
@ 2009-10-03 12:43                   ` Corrado Zoccolo
  2009-10-03 13:38                       ` Vivek Goyal
       [not found]                     ` <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-03 12:43 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> In fact I think that the 'rotating' flag name is misleading.
>> All the checks we are doing are actually checking if the device truly
>> supports multiple parallel operations, and this feature is shared by
>> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> NCQ-enabled SATA disk.
>>
>
> While we are at it, what happens to notion of priority of tasks on SSDs?
This is not changed by proposed patch w.r.t. current CFQ.
> Without idling there is not continuous time slice and there is no
> fairness. So ioprio is out of the window for SSDs?
I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
me that the way in which queues are sorted in the rr tree may still
provide some sort of fairness and service differentiation for
priorities, in terms of number of IOs.
Non-NCQ SSDs, instead, will still have the idle window enabled, so it
is not an issue for them.
>
> On SSDs, will it make more sense to provide fairness in terms of number or
> IO or size of IO and not in terms of time slices.
Not on all SSDs. There are still ones that have a non-negligible
penalty on non-sequential access pattern (hopefully the ones without
NCQ, but if we find otherwise, then we will have to benchmark access
time in I/O scheduler to select the best policy). For those, time
based may still be needed.

Thanks,
Corrado

>
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
       [not found]                     ` <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-03 13:38                       ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-03 13:38 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> >> In fact I think that the 'rotating' flag name is misleading.
> >> All the checks we are doing are actually checking if the device truly
> >> supports multiple parallel operations, and this feature is shared by
> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> >> NCQ-enabled SATA disk.
> >>
> >
> > While we are at it, what happens to notion of priority of tasks on SSDs?
> This is not changed by proposed patch w.r.t. current CFQ.

This is a general question irrespective of current patch. Want to know
what is our statement w.r.t ioprio and what it means for user? When do
we support it and when do we not.

> > Without idling there is not continuous time slice and there is no
> > fairness. So ioprio is out of the window for SSDs?
> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
> me that the way in which queues are sorted in the rr tree may still
> provide some sort of fairness and service differentiation for
> priorities, in terms of number of IOs.

I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
not. I guess this happens because sometimes idling is enabled and sometmes
not because of dyanamic nature of hw_tag.

I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
third prio7.

(prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
(prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
(prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec

Note there is almost no difference between prio 0 and prio 4 job and prio7
job has been penalized heavily (gets less than 10% BW of prio 4 job).

> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
> is not an issue for them.

Agree.

> >
> > On SSDs, will it make more sense to provide fairness in terms of number or
> > IO or size of IO and not in terms of time slices.
> Not on all SSDs. There are still ones that have a non-negligible
> penalty on non-sequential access pattern (hopefully the ones without
> NCQ, but if we find otherwise, then we will have to benchmark access
> time in I/O scheduler to select the best policy). For those, time
> based may still be needed.

Ok.

So on better SSDs out there with NCQ, we probably don't support the notion of
ioprio? Or, I am missing something.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-03 12:43                   ` Corrado Zoccolo
@ 2009-10-03 13:38                       ` Vivek Goyal
       [not found]                     ` <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-03 13:38 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> >> In fact I think that the 'rotating' flag name is misleading.
> >> All the checks we are doing are actually checking if the device truly
> >> supports multiple parallel operations, and this feature is shared by
> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> >> NCQ-enabled SATA disk.
> >>
> >
> > While we are at it, what happens to notion of priority of tasks on SSDs?
> This is not changed by proposed patch w.r.t. current CFQ.

This is a general question irrespective of current patch. Want to know
what is our statement w.r.t ioprio and what it means for user? When do
we support it and when do we not.

> > Without idling there is not continuous time slice and there is no
> > fairness. So ioprio is out of the window for SSDs?
> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
> me that the way in which queues are sorted in the rr tree may still
> provide some sort of fairness and service differentiation for
> priorities, in terms of number of IOs.

I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
not. I guess this happens because sometimes idling is enabled and sometmes
not because of dyanamic nature of hw_tag.

I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
third prio7.

(prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
(prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
(prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec

Note there is almost no difference between prio 0 and prio 4 job and prio7
job has been penalized heavily (gets less than 10% BW of prio 4 job).

> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
> is not an issue for them.

Agree.

> >
> > On SSDs, will it make more sense to provide fairness in terms of number or
> > IO or size of IO and not in terms of time slices.
> Not on all SSDs. There are still ones that have a non-negligible
> penalty on non-sequential access pattern (hopefully the ones without
> NCQ, but if we find otherwise, then we will have to benchmark access
> time in I/O scheduler to select the best policy). For those, time
> based may still be needed.

Ok.

So on better SSDs out there with NCQ, we probably don't support the notion of
ioprio? Or, I am missing something.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
@ 2009-10-03 13:38                       ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-03 13:38 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> >> In fact I think that the 'rotating' flag name is misleading.
> >> All the checks we are doing are actually checking if the device truly
> >> supports multiple parallel operations, and this feature is shared by
> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> >> NCQ-enabled SATA disk.
> >>
> >
> > While we are at it, what happens to notion of priority of tasks on SSDs?
> This is not changed by proposed patch w.r.t. current CFQ.

This is a general question irrespective of current patch. Want to know
what is our statement w.r.t ioprio and what it means for user? When do
we support it and when do we not.

> > Without idling there is not continuous time slice and there is no
> > fairness. So ioprio is out of the window for SSDs?
> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
> me that the way in which queues are sorted in the rr tree may still
> provide some sort of fairness and service differentiation for
> priorities, in terms of number of IOs.

I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
not. I guess this happens because sometimes idling is enabled and sometmes
not because of dyanamic nature of hw_tag.

I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
third prio7.

(prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
(prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
(prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec

Note there is almost no difference between prio 0 and prio 4 job and prio7
job has been penalized heavily (gets less than 10% BW of prio 4 job).

> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
> is not an issue for them.

Agree.

> >
> > On SSDs, will it make more sense to provide fairness in terms of number or
> > IO or size of IO and not in terms of time slices.
> Not on all SSDs. There are still ones that have a non-negligible
> penalty on non-sequential access pattern (hopefully the ones without
> NCQ, but if we find otherwise, then we will have to benchmark access
> time in I/O scheduler to select the best policy). For those, time
> based may still be needed.

Ok.

So on better SSDs out there with NCQ, we probably don't support the notion of
ioprio? Or, I am missing something.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-03 13:38                       ` Vivek Goyal
  (?)
@ 2009-10-04  9:15                       ` Corrado Zoccolo
  2009-10-04 12:11                         ` Vivek Goyal
  -1 siblings, 1 reply; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-04  9:15 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

Hi Vivek,
On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
>> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
>> >> In fact I think that the 'rotating' flag name is misleading.
>> >> All the checks we are doing are actually checking if the device truly
>> >> supports multiple parallel operations, and this feature is shared by
>> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
>> >> NCQ-enabled SATA disk.
>> >>
>> >
>> > While we are at it, what happens to notion of priority of tasks on SSDs?
>> This is not changed by proposed patch w.r.t. current CFQ.
>
> This is a general question irrespective of current patch. Want to know
> what is our statement w.r.t ioprio and what it means for user? When do
> we support it and when do we not.
>
>> > Without idling there is not continuous time slice and there is no
>> > fairness. So ioprio is out of the window for SSDs?
>> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
>> me that the way in which queues are sorted in the rr tree may still
>> provide some sort of fairness and service differentiation for
>> priorities, in terms of number of IOs.
>
> I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
> not. I guess this happens because sometimes idling is enabled and sometmes
> not because of dyanamic nature of hw_tag.
>
My guess is that the formula that is used to handle this case is not
very stable.
The culprit code is (in cfq_service_tree_add):
        } else if (!add_front) {
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
                rb_key += cfqq->slice_resid;
                cfqq->slice_resid = 0;
        } else

cfq_slice_offset is defined as:

static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
                                      struct cfq_queue *cfqq)
{
        /*
         * just an approximation, should be ok.
         */
	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
                       cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
}

Can you try changing the latter to a simpler (we already observed that
busy_queues is unstable, and I think that here it is not needed at
all):
	return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
and remove the 'rb_key += cfqq->slice_resid; ' from the former.

This should give a higher probability of being first on the queue to
larger slice tasks, so it will work if we don't idle, but it needs
some adjustment if we idle.

> I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
> third prio7.
>
> (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
> (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
> (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec
>
> Note there is almost no difference between prio 0 and prio 4 job and prio7
> job has been penalized heavily (gets less than 10% BW of prio 4 job).
>
>> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
>> is not an issue for them.
>
> Agree.
>
>> >
>> > On SSDs, will it make more sense to provide fairness in terms of number or
>> > IO or size of IO and not in terms of time slices.
>> Not on all SSDs. There are still ones that have a non-negligible
>> penalty on non-sequential access pattern (hopefully the ones without
>> NCQ, but if we find otherwise, then we will have to benchmark access
>> time in I/O scheduler to select the best policy). For those, time
>> based may still be needed.
>
> Ok.
>
> So on better SSDs out there with NCQ, we probably don't support the notion of
> ioprio? Or, I am missing something.

I think we try, but the current formula is simply not good enough.

Thanks,
Corrado

>
> Thanks
> Vivek
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-04  9:15                       ` Corrado Zoccolo
@ 2009-10-04 12:11                         ` Vivek Goyal
  2009-10-04 12:46                           ` Corrado Zoccolo
  0 siblings, 1 reply; 349+ messages in thread
From: Vivek Goyal @ 2009-10-04 12:11 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote:
> >> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote:
> >> >> In fact I think that the 'rotating' flag name is misleading.
> >> >> All the checks we are doing are actually checking if the device truly
> >> >> supports multiple parallel operations, and this feature is shared by
> >> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single
> >> >> NCQ-enabled SATA disk.
> >> >>
> >> >
> >> > While we are at it, what happens to notion of priority of tasks on SSDs?
> >> This is not changed by proposed patch w.r.t. current CFQ.
> >
> > This is a general question irrespective of current patch. Want to know
> > what is our statement w.r.t ioprio and what it means for user? When do
> > we support it and when do we not.
> >
> >> > Without idling there is not continuous time slice and there is no
> >> > fairness. So ioprio is out of the window for SSDs?
> >> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to
> >> me that the way in which queues are sorted in the rr tree may still
> >> provide some sort of fairness and service differentiation for
> >> priorities, in terms of number of IOs.
> >
> > I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do
> > not. I guess this happens because sometimes idling is enabled and sometmes
> > not because of dyanamic nature of hw_tag.
> >
> My guess is that the formula that is used to handle this case is not
> very stable.

In general I agree that formula to calculate the slice offset is very 
puzzling as busy_queues varies and that changes the position of the task
sometimes.

> The culprit code is (in cfq_service_tree_add):
>         } else if (!add_front) {
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
>                 rb_key += cfqq->slice_resid;
>                 cfqq->slice_resid = 0;
>         } else
> 
> cfq_slice_offset is defined as:
> 
> static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
>                                       struct cfq_queue *cfqq)
> {
>         /*
>          * just an approximation, should be ok.
>          */
> 	return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) -
>                        cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
> }
> 
> Can you try changing the latter to a simpler (we already observed that
> busy_queues is unstable, and I think that here it is not needed at
> all):
> 	return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio));
> and remove the 'rb_key += cfqq->slice_resid; ' from the former.
> 
> This should give a higher probability of being first on the queue to
> larger slice tasks, so it will work if we don't idle, but it needs
> some adjustment if we idle.

I am not sure what's the intent here by removing busy_queues stuff. I have
got two questions though.

- Why don't we keep it simple round robin where a task is simply placed at
  the end of service tree.

- Secondly, CFQ provides full slice length to queues only which are
  idling (in case of sequenatial reader). If we do not enable idling, as
  in case of NCQ enabled SSDs, then CFQ will expire the queue almost 
  immediately and put the queue at the end of service tree (almost).

So if we don't enable idling, at max we can provide fairness, we
esseitially just let every queue dispatch one request and put  at the end
of the end of service tree. Hence no fairness....

Thanks
Vivek

> 
> > I ran three fio reads for 10 seconds. First job is prio0, second prio4 and
> > third prio7.
> >
> > (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec
> > (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec
> > (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec
> >
> > Note there is almost no difference between prio 0 and prio 4 job and prio7
> > job has been penalized heavily (gets less than 10% BW of prio 4 job).
> >
> >> Non-NCQ SSDs, instead, will still have the idle window enabled, so it
> >> is not an issue for them.
> >
> > Agree.
> >
> >> >
> >> > On SSDs, will it make more sense to provide fairness in terms of number or
> >> > IO or size of IO and not in terms of time slices.
> >> Not on all SSDs. There are still ones that have a non-negligible
> >> penalty on non-sequential access pattern (hopefully the ones without
> >> NCQ, but if we find otherwise, then we will have to benchmark access
> >> time in I/O scheduler to select the best policy). For those, time
> >> based may still be needed.
> >
> > Ok.
> >
> > So on better SSDs out there with NCQ, we probably don't support the notion of
> > ioprio? Or, I am missing something.
> 
> I think we try, but the current formula is simply not good enough.
> 
> Thanks,
> Corrado
> 
> >
> > Thanks
> > Vivek
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------
> The self-confidence of a warrior is not the self-confidence of the average
> man. The average man seeks certainty in the eyes of the onlooker and calls
> that self-confidence. The warrior seeks impeccability in his own eyes and
> calls that humbleness.
>                                Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-04 12:11                         ` Vivek Goyal
@ 2009-10-04 12:46                           ` Corrado Zoccolo
  2009-10-04 16:20                             ` Fabio Checconi
                                               ` (2 more replies)
  0 siblings, 3 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-04 12:46 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

Hi Vivek,
On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
>> Hi Vivek,
>> My guess is that the formula that is used to handle this case is not
>> very stable.
>
> In general I agree that formula to calculate the slice offset is very
> puzzling as busy_queues varies and that changes the position of the task
> sometimes.
>
> I am not sure what's the intent here by removing busy_queues stuff. I have
> got two questions though.

In the ideal case steady state, busy_queues will be a constant. Since
we are just comparing the values between themselves, we can just
remove this constant completely.

Whenever it is not constant, it seems to me that it can cause wrong
behaviour, i.e. when the number of processes with ready I/O reduces, a
later coming request can jump before older requests.
So it seems it does more harm than good, hence I suggest to remove it.

Moreover, I suggest removing also the slice_resid part, since its
semantics doesn't seem consistent.
When computed, it is not the residency, but the remaining time slice.
Then it is used to postpone, instead of anticipate, the position of
the queue in the RR, that seems counterintuitive (it would be
intuitive, though, if it was actually a residency, not a remaining
slice, i.e. you already got your full share, so you can wait longer to
be serviced again).

>
> - Why don't we keep it simple round robin where a task is simply placed at
>  the end of service tree.

This should work for the idling case, since we provide service
differentiation by means of time slice.
For non-idling case, though, the appropriate placement of queues in
the tree (as given by my formula) can still provide it.

>
> - Secondly, CFQ provides full slice length to queues only which are
>  idling (in case of sequenatial reader). If we do not enable idling, as
>  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
>  immediately and put the queue at the end of service tree (almost).
>
> So if we don't enable idling, at max we can provide fairness, we
> esseitially just let every queue dispatch one request and put  at the end
> of the end of service tree. Hence no fairness....

We should distinguish the two terms fairness and service
differentiation. Fairness is when every queue gets the same amount of
service share. This is not what we want when priorities are different
(we want the service differentiation, instead), but is what we get if
we do just round robin without idling.

To fix this, we can alter the placement in the tree, so that if we
have Q1 with slice S1, and Q2 with slice S2, always ready to perform
I/O, we get that Q1 is in front of the three with probability
S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
This is what my formula should achieve.

Thanks,
Corrado

>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-04 12:46                           ` Corrado Zoccolo
@ 2009-10-04 16:20                             ` Fabio Checconi
       [not found]                               ` <20091004162005.GH4650-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
  2009-10-05 21:21                                 ` Corrado Zoccolo
       [not found]                             ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-10-06 21:36                               ` Vivek Goyal
  2 siblings, 2 replies; 349+ messages in thread
From: Fabio Checconi @ 2009-10-04 16:20 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, paolo.valente, ryov, fernando,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

> From: Corrado Zoccolo <czoccolo@gmail.com>
> Date: Sun, Oct 04, 2009 02:46:44PM +0200
>
> Hi Vivek,
> On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> My guess is that the formula that is used to handle this case is not
> >> very stable.
> >
> > In general I agree that formula to calculate the slice offset is very
> > puzzling as busy_queues varies and that changes the position of the task
> > sometimes.
> >
> > I am not sure what's the intent here by removing busy_queues stuff. I have
> > got two questions though.
> 
> In the ideal case steady state, busy_queues will be a constant. Since
> we are just comparing the values between themselves, we can just
> remove this constant completely.
> 
> Whenever it is not constant, it seems to me that it can cause wrong
> behaviour, i.e. when the number of processes with ready I/O reduces, a
> later coming request can jump before older requests.
> So it seems it does more harm than good, hence I suggest to remove it.
> 
> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.
> Then it is used to postpone, instead of anticipate, the position of
> the queue in the RR, that seems counterintuitive (it would be
> intuitive, though, if it was actually a residency, not a remaining
> slice, i.e. you already got your full share, so you can wait longer to
> be serviced again).
> 
> >
> > - Why don't we keep it simple round robin where a task is simply placed at
> >  the end of service tree.
> 
> This should work for the idling case, since we provide service
> differentiation by means of time slice.
> For non-idling case, though, the appropriate placement of queues in
> the tree (as given by my formula) can still provide it.
> 
> >
> > - Secondly, CFQ provides full slice length to queues only which are
> >  idling (in case of sequenatial reader). If we do not enable idling, as
> >  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
> >  immediately and put the queue at the end of service tree (almost).
> >
> > So if we don't enable idling, at max we can provide fairness, we
> > esseitially just let every queue dispatch one request and put  at the end
> > of the end of service tree. Hence no fairness....
> 
> We should distinguish the two terms fairness and service
> differentiation. Fairness is when every queue gets the same amount of
> service share. This is not what we want when priorities are different
> (we want the service differentiation, instead), but is what we get if
> we do just round robin without idling.
> 
> To fix this, we can alter the placement in the tree, so that if we
> have Q1 with slice S1, and Q2 with slice S2, always ready to perform
> I/O, we get that Q1 is in front of the three with probability
> S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
> This is what my formula should achieve.
> 

But if the ``always ready to perform I/O'' assumption held then even RR
would have provided service differentiation, always seeing backlogged
queues and serving them according to their weights.

In this case the problem is what Vivek described some time ago as the
interlocked service of sync queues, where the scheduler is trying to
differentiate between the queues, but they are not always asking for
service (as they are synchronous and they are backlogged only for short
time intervals).

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-04 12:46                           ` Corrado Zoccolo
@ 2009-10-05 15:06                                 ` Jeff Moyer
       [not found]                             ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-10-06 21:36                               ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: Jeff Moyer @ 2009-10-05 15:06 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Corrado Zoccolo <czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.

It stands for residual, not residency.  Make more sense?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
@ 2009-10-05 15:06                                 ` Jeff Moyer
  0 siblings, 0 replies; 349+ messages in thread
From: Jeff Moyer @ 2009-10-05 15:06 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

Corrado Zoccolo <czoccolo@gmail.com> writes:

> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.

It stands for residual, not residency.  Make more sense?

Cheers,
Jeff

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
       [not found]                                 ` <x49my457uef.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
@ 2009-10-05 21:09                                   ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-05 21:09 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Corrado Zoccolo <czoccolo@gmail.com> writes:
>
>> Moreover, I suggest removing also the slice_resid part, since its
>> semantics doesn't seem consistent.
>> When computed, it is not the residency, but the remaining time slice.
>
> It stands for residual, not residency.  Make more sense?
It makes sense when computed, but not when used in rb_key computation.
Why should we postpone queues that where preempted, instead of giving
them a boost?

Thanks,
Corrado

>
> Cheers,
> Jeff
>
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-05 15:06                                 ` Jeff Moyer
@ 2009-10-05 21:09                                   ` Corrado Zoccolo
  -1 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-05 21:09 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Corrado Zoccolo <czoccolo@gmail.com> writes:
>
>> Moreover, I suggest removing also the slice_resid part, since its
>> semantics doesn't seem consistent.
>> When computed, it is not the residency, but the remaining time slice.
>
> It stands for residual, not residency.  Make more sense?
It makes sense when computed, but not when used in rb_key computation.
Why should we postpone queues that where preempted, instead of giving
them a boost?

Thanks,
Corrado

>
> Cheers,
> Jeff
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
@ 2009-10-05 21:09                                   ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-05 21:09 UTC (permalink / raw)
  To: Jeff Moyer
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> Corrado Zoccolo <czoccolo@gmail.com> writes:
>
>> Moreover, I suggest removing also the slice_resid part, since its
>> semantics doesn't seem consistent.
>> When computed, it is not the residency, but the remaining time slice.
>
> It stands for residual, not residency.  Make more sense?
It makes sense when computed, but not when used in rb_key computation.
Why should we postpone queues that where preempted, instead of giving
them a boost?

Thanks,
Corrado

>
> Cheers,
> Jeff
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
       [not found]                               ` <20091004162005.GH4650-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
@ 2009-10-05 21:21                                 ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-05 21:21 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Oct 4, 2009 at 6:20 PM, Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> But if the ``always ready to perform I/O'' assumption held then even RR
> would have provided service differentiation, always seeing backlogged
> queues and serving them according to their weights.

Right, this property is too strong. But also a weaker "the two queues
have think times less than the disk access time" will be enough to
achieve the same goal by means of proper placement in the RR tree.

If both think times are greater than access time, then each queue will
get a service level equivalent to it being the only queue in the
system, so in this case service differentiation will not apply (do we
need to differentiate when everyone gets exactly what he needs?).

If one think time is less, and the other is more than the access time,
then we should decide what kind of fairness we want to have,
especially if the one with larger think time has also higher priority.

> In this case the problem is what Vivek described some time ago as the
> interlocked service of sync queues, where the scheduler is trying to
> differentiate between the queues, but they are not always asking for
> service (as they are synchronous and they are backlogged only for short
> time intervals).

Corrado

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-04 16:20                             ` Fabio Checconi
@ 2009-10-05 21:21                                 ` Corrado Zoccolo
  2009-10-05 21:21                                 ` Corrado Zoccolo
  1 sibling, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-05 21:21 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, paolo.valente, ryov, fernando,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sun, Oct 4, 2009 at 6:20 PM, Fabio Checconi <fchecconi@gmail.com> wrote:
> But if the ``always ready to perform I/O'' assumption held then even RR
> would have provided service differentiation, always seeing backlogged
> queues and serving them according to their weights.

Right, this property is too strong. But also a weaker "the two queues
have think times less than the disk access time" will be enough to
achieve the same goal by means of proper placement in the RR tree.

If both think times are greater than access time, then each queue will
get a service level equivalent to it being the only queue in the
system, so in this case service differentiation will not apply (do we
need to differentiate when everyone gets exactly what he needs?).

If one think time is less, and the other is more than the access time,
then we should decide what kind of fairness we want to have,
especially if the one with larger think time has also higher priority.

> In this case the problem is what Vivek described some time ago as the
> interlocked service of sync queues, where the scheduler is trying to
> differentiate between the queues, but they are not always asking for
> service (as they are synchronous and they are backlogged only for short
> time intervals).

Corrado

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
@ 2009-10-05 21:21                                 ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-05 21:21 UTC (permalink / raw)
  To: Fabio Checconi
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Sun, Oct 4, 2009 at 6:20 PM, Fabio Checconi <fchecconi@gmail.com> wrote:
> But if the ``always ready to perform I/O'' assumption held then even RR
> would have provided service differentiation, always seeing backlogged
> queues and serving them according to their weights.

Right, this property is too strong. But also a weaker "the two queues
have think times less than the disk access time" will be enough to
achieve the same goal by means of proper placement in the RR tree.

If both think times are greater than access time, then each queue will
get a service level equivalent to it being the only queue in the
system, so in this case service differentiation will not apply (do we
need to differentiate when everyone gets exactly what he needs?).

If one think time is less, and the other is more than the access time,
then we should decide what kind of fairness we want to have,
especially if the one with larger think time has also higher priority.

> In this case the problem is what Vivek described some time ago as the
> interlocked service of sync queues, where the scheduler is trying to
> differentiate between the queues, but they are not always asking for
> service (as they are synchronous and they are backlogged only for short
> time intervals).

Corrado

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
       [not found]                                   ` <4e5e476b0910051409x33f8365flf32e8e7548d72e79-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-06  8:41                                     ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-06  8:41 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, Jeff Moyer, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Corrado Zoccolo <czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
> >
> >> Moreover, I suggest removing also the slice_resid part, since its
> >> semantics doesn't seem consistent.
> >> When computed, it is not the residency, but the remaining time slice.
> >
> > It stands for residual, not residency.  Make more sense?
> It makes sense when computed, but not when used in rb_key computation.
> Why should we postpone queues that where preempted, instead of giving
> them a boost?

We should not, if it is/was working correctly, it should allow both for
increase/descrease of tree position (hence it's a long and can go
negative) to account for both over and under time.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-05 21:09                                   ` Corrado Zoccolo
@ 2009-10-06  8:41                                     ` Jens Axboe
  -1 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-06  8:41 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Corrado Zoccolo <czoccolo@gmail.com> writes:
> >
> >> Moreover, I suggest removing also the slice_resid part, since its
> >> semantics doesn't seem consistent.
> >> When computed, it is not the residency, but the remaining time slice.
> >
> > It stands for residual, not residency.  Make more sense?
> It makes sense when computed, but not when used in rb_key computation.
> Why should we postpone queues that where preempted, instead of giving
> them a boost?

We should not, if it is/was working correctly, it should allow both for
increase/descrease of tree position (hence it's a long and can go
negative) to account for both over and under time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
@ 2009-10-06  8:41                                     ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-06  8:41 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, Jeff Moyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> > Corrado Zoccolo <czoccolo@gmail.com> writes:
> >
> >> Moreover, I suggest removing also the slice_resid part, since its
> >> semantics doesn't seem consistent.
> >> When computed, it is not the residency, but the remaining time slice.
> >
> > It stands for residual, not residency.  Make more sense?
> It makes sense when computed, but not when used in rb_key computation.
> Why should we postpone queues that where preempted, instead of giving
> them a boost?

We should not, if it is/was working correctly, it should allow both for
increase/descrease of tree position (hence it's a long and can go
negative) to account for both over and under time.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
       [not found]                                     ` <20091006084120.GJ5216-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-06  9:00                                       ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-06  9:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, Jeff Moyer, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Mon, Oct 05 2009, Corrado Zoccolo wrote:
>> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> > It stands for residual, not residency.  Make more sense?
>> It makes sense when computed, but not when used in rb_key computation.
>> Why should we postpone queues that where preempted, instead of giving
>> them a boost?
>
> We should not, if it is/was working correctly, it should allow both for
> increase/descrease of tree position (hence it's a long and can go
> negative) to account for both over and under time.

I'm doing some tests with and without it.
How it is working now is:
definition:
        if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
                cfqq->slice_resid = cfqq->slice_end - jiffies;
                cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
cfqq->slice_resid);
        }
* here resid is > 0 if there was residual time, and < 0 if the queue
overrun its slice.
use:
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
                rb_key += cfqq->slice_resid;
                cfqq->slice_resid = 0;
* here if residual is > 0, we postpone, i.e. penalize.  If residual is
< 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.

So this is likely not what we want.
I did some tests with and without it, or changing the sign, and it
doesn't matter at all for pure sync workloads.

The only case in which it matters a little, from my experiments, is
for sync vs async workload. Here, since async queues are preempted,
the current form of the code penalizes them, so they get larger
delays, and we get more bandwidth for sync.
This is, btw, the only positive outcome (I can think of) from the
current form of the code, and I think we could obtain it more easily
by unconditionally adding a delay for async queues:
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
		if (!cfq_cfqq_sync(cfqq)) {
                        rb_key += CFQ_ASYNC_DELAY;
	        }

removing completely the resid stuff (or at least leaving us with the
ability of using it with the proper sign).

Corrado
>
> --
> Jens Axboe
>
>
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler  based IO controller V10)
  2009-10-06  8:41                                     ` Jens Axboe
@ 2009-10-06  9:00                                       ` Corrado Zoccolo
  -1 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-06  9:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Mon, Oct 05 2009, Corrado Zoccolo wrote:
>> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> > It stands for residual, not residency.  Make more sense?
>> It makes sense when computed, but not when used in rb_key computation.
>> Why should we postpone queues that where preempted, instead of giving
>> them a boost?
>
> We should not, if it is/was working correctly, it should allow both for
> increase/descrease of tree position (hence it's a long and can go
> negative) to account for both over and under time.

I'm doing some tests with and without it.
How it is working now is:
definition:
        if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
                cfqq->slice_resid = cfqq->slice_end - jiffies;
                cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
cfqq->slice_resid);
        }
* here resid is > 0 if there was residual time, and < 0 if the queue
overrun its slice.
use:
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
                rb_key += cfqq->slice_resid;
                cfqq->slice_resid = 0;
* here if residual is > 0, we postpone, i.e. penalize.  If residual is
< 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.

So this is likely not what we want.
I did some tests with and without it, or changing the sign, and it
doesn't matter at all for pure sync workloads.

The only case in which it matters a little, from my experiments, is
for sync vs async workload. Here, since async queues are preempted,
the current form of the code penalizes them, so they get larger
delays, and we get more bandwidth for sync.
This is, btw, the only positive outcome (I can think of) from the
current form of the code, and I think we could obtain it more easily
by unconditionally adding a delay for async queues:
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
		if (!cfq_cfqq_sync(cfqq)) {
                        rb_key += CFQ_ASYNC_DELAY;
	        }

removing completely the resid stuff (or at least leaving us with the
ability of using it with the proper sign).

Corrado
>
> --
> Jens Axboe
>
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
@ 2009-10-06  9:00                                       ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-06  9:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, Jeff Moyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Mon, Oct 05 2009, Corrado Zoccolo wrote:
>> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
>> > It stands for residual, not residency.  Make more sense?
>> It makes sense when computed, but not when used in rb_key computation.
>> Why should we postpone queues that where preempted, instead of giving
>> them a boost?
>
> We should not, if it is/was working correctly, it should allow both for
> increase/descrease of tree position (hence it's a long and can go
> negative) to account for both over and under time.

I'm doing some tests with and without it.
How it is working now is:
definition:
        if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
                cfqq->slice_resid = cfqq->slice_end - jiffies;
                cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
cfqq->slice_resid);
        }
* here resid is > 0 if there was residual time, and < 0 if the queue
overrun its slice.
use:
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
                rb_key += cfqq->slice_resid;
                cfqq->slice_resid = 0;
* here if residual is > 0, we postpone, i.e. penalize.  If residual is
< 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.

So this is likely not what we want.
I did some tests with and without it, or changing the sign, and it
doesn't matter at all for pure sync workloads.

The only case in which it matters a little, from my experiments, is
for sync vs async workload. Here, since async queues are preempted,
the current form of the code penalizes them, so they get larger
delays, and we get more bandwidth for sync.
This is, btw, the only positive outcome (I can think of) from the
current form of the code, and I think we could obtain it more easily
by unconditionally adding a delay for async queues:
                rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
		if (!cfq_cfqq_sync(cfqq)) {
                        rb_key += CFQ_ASYNC_DELAY;
	        }

removing completely the resid stuff (or at least leaving us with the
ability of using it with the proper sign).

Corrado
>
> --
> Jens Axboe
>
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
       [not found]                                       ` <4e5e476b0910060200i7c028b3fr4c235bf5f18c3aa1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-06 18:53                                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-06 18:53 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, Jeff Moyer, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Oct 06 2009, Corrado Zoccolo wrote:
> On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> > It stands for residual, not residency.  Make more sense?
> >> It makes sense when computed, but not when used in rb_key computation.
> >> Why should we postpone queues that where preempted, instead of giving
> >> them a boost?
> >
> > We should not, if it is/was working correctly, it should allow both for
> > increase/descrease of tree position (hence it's a long and can go
> > negative) to account for both over and under time.
> 
> I'm doing some tests with and without it.
> How it is working now is:
> definition:
>         if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
>                 cfqq->slice_resid = cfqq->slice_end - jiffies;
>                 cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
> cfqq->slice_resid);
>         }
> * here resid is > 0 if there was residual time, and < 0 if the queue
> overrun its slice.
> use:
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
>                 rb_key += cfqq->slice_resid;
>                 cfqq->slice_resid = 0;
> * here if residual is > 0, we postpone, i.e. penalize.  If residual is
> < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.
> 
> So this is likely not what we want.

Indeed, that should be -= cfqq->slice_resid.

> I did some tests with and without it, or changing the sign, and it
> doesn't matter at all for pure sync workloads.

For most cases it will not change things a lot, but it should be
technically correct.

> The only case in which it matters a little, from my experiments, is
> for sync vs async workload. Here, since async queues are preempted,
> the current form of the code penalizes them, so they get larger
> delays, and we get more bandwidth for sync.

Right

> This is, btw, the only positive outcome (I can think of) from the
> current form of the code, and I think we could obtain it more easily
> by unconditionally adding a delay for async queues:
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
> 		if (!cfq_cfqq_sync(cfqq)) {
>                         rb_key += CFQ_ASYNC_DELAY;
> 	        }
> 
> removing completely the resid stuff (or at least leaving us with the
> ability of using it with the proper sign).

It's more likely for the async queue to overrun, but it can happen for
others as well. I'm keeping the residual count, but making the sign
change of course.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-06  9:00                                       ` Corrado Zoccolo
@ 2009-10-06 18:53                                         ` Jens Axboe
  -1 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-06 18:53 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith,
	Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm,
	peterz, jmarchan, torvalds, riel

On Tue, Oct 06 2009, Corrado Zoccolo wrote:
> On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >> > It stands for residual, not residency.  Make more sense?
> >> It makes sense when computed, but not when used in rb_key computation.
> >> Why should we postpone queues that where preempted, instead of giving
> >> them a boost?
> >
> > We should not, if it is/was working correctly, it should allow both for
> > increase/descrease of tree position (hence it's a long and can go
> > negative) to account for both over and under time.
> 
> I'm doing some tests with and without it.
> How it is working now is:
> definition:
>         if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
>                 cfqq->slice_resid = cfqq->slice_end - jiffies;
>                 cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
> cfqq->slice_resid);
>         }
> * here resid is > 0 if there was residual time, and < 0 if the queue
> overrun its slice.
> use:
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
>                 rb_key += cfqq->slice_resid;
>                 cfqq->slice_resid = 0;
> * here if residual is > 0, we postpone, i.e. penalize.  If residual is
> < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.
> 
> So this is likely not what we want.

Indeed, that should be -= cfqq->slice_resid.

> I did some tests with and without it, or changing the sign, and it
> doesn't matter at all for pure sync workloads.

For most cases it will not change things a lot, but it should be
technically correct.

> The only case in which it matters a little, from my experiments, is
> for sync vs async workload. Here, since async queues are preempted,
> the current form of the code penalizes them, so they get larger
> delays, and we get more bandwidth for sync.

Right

> This is, btw, the only positive outcome (I can think of) from the
> current form of the code, and I think we could obtain it more easily
> by unconditionally adding a delay for async queues:
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
> 		if (!cfq_cfqq_sync(cfqq)) {
>                         rb_key += CFQ_ASYNC_DELAY;
> 	        }
> 
> removing completely the resid stuff (or at least leaving us with the
> ability of using it with the proper sign).

It's more likely for the async queue to overrun, but it can happen for
others as well. I'm keeping the residual count, but making the sign
change of course.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
@ 2009-10-06 18:53                                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-06 18:53 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, Jeff Moyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Tue, Oct 06 2009, Corrado Zoccolo wrote:
> On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > On Mon, Oct 05 2009, Corrado Zoccolo wrote:
> >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote:
> >> > It stands for residual, not residency.  Make more sense?
> >> It makes sense when computed, but not when used in rb_key computation.
> >> Why should we postpone queues that where preempted, instead of giving
> >> them a boost?
> >
> > We should not, if it is/was working correctly, it should allow both for
> > increase/descrease of tree position (hence it's a long and can go
> > negative) to account for both over and under time.
> 
> I'm doing some tests with and without it.
> How it is working now is:
> definition:
>         if (timed_out && !cfq_cfqq_slice_new(cfqq)) {
>                 cfqq->slice_resid = cfqq->slice_end - jiffies;
>                 cfq_log_cfqq(cfqd, cfqq, "resid=%ld",
> cfqq->slice_resid);
>         }
> * here resid is > 0 if there was residual time, and < 0 if the queue
> overrun its slice.
> use:
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
>                 rb_key += cfqq->slice_resid;
>                 cfqq->slice_resid = 0;
> * here if residual is > 0, we postpone, i.e. penalize.  If residual is
> < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it.
> 
> So this is likely not what we want.

Indeed, that should be -= cfqq->slice_resid.

> I did some tests with and without it, or changing the sign, and it
> doesn't matter at all for pure sync workloads.

For most cases it will not change things a lot, but it should be
technically correct.

> The only case in which it matters a little, from my experiments, is
> for sync vs async workload. Here, since async queues are preempted,
> the current form of the code penalizes them, so they get larger
> delays, and we get more bandwidth for sync.

Right

> This is, btw, the only positive outcome (I can think of) from the
> current form of the code, and I think we could obtain it more easily
> by unconditionally adding a delay for async queues:
>                 rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
> 		if (!cfq_cfqq_sync(cfqq)) {
>                         rb_key += CFQ_ASYNC_DELAY;
> 	        }
> 
> removing completely the resid stuff (or at least leaving us with the
> ability of using it with the proper sign).

It's more likely for the async queue to overrun, but it can happen for
others as well. I'm keeping the residual count, but making the sign
change of course.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
       [not found]                             ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-10-05 15:06                                 ` Jeff Moyer
@ 2009-10-06 21:36                               ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-06 21:36 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Valdis.Kletnieks-PjAqaU27lzQ,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Oct 04, 2009 at 02:46:44PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> My guess is that the formula that is used to handle this case is not
> >> very stable.
> >
> > In general I agree that formula to calculate the slice offset is very
> > puzzling as busy_queues varies and that changes the position of the task
> > sometimes.
> >
> > I am not sure what's the intent here by removing busy_queues stuff. I have
> > got two questions though.
> 
> In the ideal case steady state, busy_queues will be a constant. Since
> we are just comparing the values between themselves, we can just
> remove this constant completely.
> 
> Whenever it is not constant, it seems to me that it can cause wrong
> behaviour, i.e. when the number of processes with ready I/O reduces, a
> later coming request can jump before older requests.
> So it seems it does more harm than good, hence I suggest to remove it.
> 

I agree here. busy_queues can vary, especially given the fact that CFQ
removes the queue from service tree immediately after the dispatch, if the
queue is empty, and then it waits for request completion from the queue
and idles on the queue.

So consider following scenration where two thinking readers and one writer
are executing. Readers preempt the writers and writers gets back into the
tree. When writer gets backlogged, at that point of time busy_queues=2 
and when a readers gets backlogged, busy_queues=1 (most of the time,
because a reader is idling), and hence many a time readers gets placed ahead
of writer.

This is so subtle, that I am not sure it was the designed that way.

So dependence on busy_queues can change queue ordering in unpredicatable
ways.


> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.
> Then it is used to postpone, instead of anticipate, the position of
> the queue in the RR, that seems counterintuitive (it would be
> intuitive, though, if it was actually a residency, not a remaining
> slice, i.e. you already got your full share, so you can wait longer to
> be serviced again).
> 
> >
> > - Why don't we keep it simple round robin where a task is simply placed at
> >  the end of service tree.
> 
> This should work for the idling case, since we provide service
> differentiation by means of time slice.
> For non-idling case, though, the appropriate placement of queues in
> the tree (as given by my formula) can still provide it.
> 

So for non-idling case, instead of providing service differentiation by 
number of times queue is scheduled to run then by providing a bigger slice
to the queue?

This will work only to an extent and depends on size of IO being
dispatched from each queue. If some queue is having bigger requests size
and some smaller size (can be easily driven by changing block size), then
again you will not see fairness numbers? In that case it might make sense
to provide fairness in terms of size of IO/number of IO. 

So to me it boils down to what is the seek cose of the underlying media.
If seek cost is high, provide fairness in terms of time slice and if seek
cost is really low, one can afford to faster switching of queues without
loosing too much on throughput side and in that case fairness in terms of
size of IO should be good.

Now if on good SSDs with NCQ, seek cost is low, I am wondering if it will
make sense to tweak CFQ to change mode dynamically and start providing
fairness in terms of size of IO/number of IO?

> >
> > - Secondly, CFQ provides full slice length to queues only which are
> >  idling (in case of sequenatial reader). If we do not enable idling, as
> >  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
> >  immediately and put the queue at the end of service tree (almost).
> >
> > So if we don't enable idling, at max we can provide fairness, we
> > esseitially just let every queue dispatch one request and put  at the end
> > of the end of service tree. Hence no fairness....
> 
> We should distinguish the two terms fairness and service
> differentiation. Fairness is when every queue gets the same amount of
> service share.

Will it not be "proportionate amount of service share" instead of "same
amount of service share"

> This is not what we want when priorities are different
> (we want the service differentiation, instead), but is what we get if
> we do just round robin without idling.
> 
> To fix this, we can alter the placement in the tree, so that if we
> have Q1 with slice S1, and Q2 with slice S2, always ready to perform
> I/O, we get that Q1 is in front of the three with probability
> S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
> This is what my formula should achieve.

I have yet to get into details but as I said, this sounds like fairness 
by frequency or by number of times a queue is scheduled to dispatch. So it
will help up to some extent on NCQ enabled SSDs but will become unfair is
size of IO each queue dispatches is very different.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
  2009-10-04 12:46                           ` Corrado Zoccolo
@ 2009-10-06 21:36                               ` Vivek Goyal
       [not found]                             ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-10-06 21:36                               ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-06 21:36 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, riel

On Sun, Oct 04, 2009 at 02:46:44PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> My guess is that the formula that is used to handle this case is not
> >> very stable.
> >
> > In general I agree that formula to calculate the slice offset is very
> > puzzling as busy_queues varies and that changes the position of the task
> > sometimes.
> >
> > I am not sure what's the intent here by removing busy_queues stuff. I have
> > got two questions though.
> 
> In the ideal case steady state, busy_queues will be a constant. Since
> we are just comparing the values between themselves, we can just
> remove this constant completely.
> 
> Whenever it is not constant, it seems to me that it can cause wrong
> behaviour, i.e. when the number of processes with ready I/O reduces, a
> later coming request can jump before older requests.
> So it seems it does more harm than good, hence I suggest to remove it.
> 

I agree here. busy_queues can vary, especially given the fact that CFQ
removes the queue from service tree immediately after the dispatch, if the
queue is empty, and then it waits for request completion from the queue
and idles on the queue.

So consider following scenration where two thinking readers and one writer
are executing. Readers preempt the writers and writers gets back into the
tree. When writer gets backlogged, at that point of time busy_queues=2 
and when a readers gets backlogged, busy_queues=1 (most of the time,
because a reader is idling), and hence many a time readers gets placed ahead
of writer.

This is so subtle, that I am not sure it was the designed that way.

So dependence on busy_queues can change queue ordering in unpredicatable
ways.


> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.
> Then it is used to postpone, instead of anticipate, the position of
> the queue in the RR, that seems counterintuitive (it would be
> intuitive, though, if it was actually a residency, not a remaining
> slice, i.e. you already got your full share, so you can wait longer to
> be serviced again).
> 
> >
> > - Why don't we keep it simple round robin where a task is simply placed at
> >  the end of service tree.
> 
> This should work for the idling case, since we provide service
> differentiation by means of time slice.
> For non-idling case, though, the appropriate placement of queues in
> the tree (as given by my formula) can still provide it.
> 

So for non-idling case, instead of providing service differentiation by 
number of times queue is scheduled to run then by providing a bigger slice
to the queue?

This will work only to an extent and depends on size of IO being
dispatched from each queue. If some queue is having bigger requests size
and some smaller size (can be easily driven by changing block size), then
again you will not see fairness numbers? In that case it might make sense
to provide fairness in terms of size of IO/number of IO. 

So to me it boils down to what is the seek cose of the underlying media.
If seek cost is high, provide fairness in terms of time slice and if seek
cost is really low, one can afford to faster switching of queues without
loosing too much on throughput side and in that case fairness in terms of
size of IO should be good.

Now if on good SSDs with NCQ, seek cost is low, I am wondering if it will
make sense to tweak CFQ to change mode dynamically and start providing
fairness in terms of size of IO/number of IO?

> >
> > - Secondly, CFQ provides full slice length to queues only which are
> >  idling (in case of sequenatial reader). If we do not enable idling, as
> >  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
> >  immediately and put the queue at the end of service tree (almost).
> >
> > So if we don't enable idling, at max we can provide fairness, we
> > esseitially just let every queue dispatch one request and put  at the end
> > of the end of service tree. Hence no fairness....
> 
> We should distinguish the two terms fairness and service
> differentiation. Fairness is when every queue gets the same amount of
> service share.

Will it not be "proportionate amount of service share" instead of "same
amount of service share"

> This is not what we want when priorities are different
> (we want the service differentiation, instead), but is what we get if
> we do just round robin without idling.
> 
> To fix this, we can alter the placement in the tree, so that if we
> have Q1 with slice S1, and Q2 with slice S2, always ready to perform
> I/O, we get that Q1 is in front of the three with probability
> S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
> This is what my formula should achieve.

I have yet to get into details but as I said, this sounds like fairness 
by frequency or by number of times a queue is scheduled to dispatch. So it
will help up to some extent on NCQ enabled SSDs but will become unfair is
size of IO each queue dispatches is very different.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10)
@ 2009-10-06 21:36                               ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-06 21:36 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi,
	Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm,
	righi.andrea, torvalds

On Sun, Oct 04, 2009 at 02:46:44PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> My guess is that the formula that is used to handle this case is not
> >> very stable.
> >
> > In general I agree that formula to calculate the slice offset is very
> > puzzling as busy_queues varies and that changes the position of the task
> > sometimes.
> >
> > I am not sure what's the intent here by removing busy_queues stuff. I have
> > got two questions though.
> 
> In the ideal case steady state, busy_queues will be a constant. Since
> we are just comparing the values between themselves, we can just
> remove this constant completely.
> 
> Whenever it is not constant, it seems to me that it can cause wrong
> behaviour, i.e. when the number of processes with ready I/O reduces, a
> later coming request can jump before older requests.
> So it seems it does more harm than good, hence I suggest to remove it.
> 

I agree here. busy_queues can vary, especially given the fact that CFQ
removes the queue from service tree immediately after the dispatch, if the
queue is empty, and then it waits for request completion from the queue
and idles on the queue.

So consider following scenration where two thinking readers and one writer
are executing. Readers preempt the writers and writers gets back into the
tree. When writer gets backlogged, at that point of time busy_queues=2 
and when a readers gets backlogged, busy_queues=1 (most of the time,
because a reader is idling), and hence many a time readers gets placed ahead
of writer.

This is so subtle, that I am not sure it was the designed that way.

So dependence on busy_queues can change queue ordering in unpredicatable
ways.


> Moreover, I suggest removing also the slice_resid part, since its
> semantics doesn't seem consistent.
> When computed, it is not the residency, but the remaining time slice.
> Then it is used to postpone, instead of anticipate, the position of
> the queue in the RR, that seems counterintuitive (it would be
> intuitive, though, if it was actually a residency, not a remaining
> slice, i.e. you already got your full share, so you can wait longer to
> be serviced again).
> 
> >
> > - Why don't we keep it simple round robin where a task is simply placed at
> >  the end of service tree.
> 
> This should work for the idling case, since we provide service
> differentiation by means of time slice.
> For non-idling case, though, the appropriate placement of queues in
> the tree (as given by my formula) can still provide it.
> 

So for non-idling case, instead of providing service differentiation by 
number of times queue is scheduled to run then by providing a bigger slice
to the queue?

This will work only to an extent and depends on size of IO being
dispatched from each queue. If some queue is having bigger requests size
and some smaller size (can be easily driven by changing block size), then
again you will not see fairness numbers? In that case it might make sense
to provide fairness in terms of size of IO/number of IO. 

So to me it boils down to what is the seek cose of the underlying media.
If seek cost is high, provide fairness in terms of time slice and if seek
cost is really low, one can afford to faster switching of queues without
loosing too much on throughput side and in that case fairness in terms of
size of IO should be good.

Now if on good SSDs with NCQ, seek cost is low, I am wondering if it will
make sense to tweak CFQ to change mode dynamically and start providing
fairness in terms of size of IO/number of IO?

> >
> > - Secondly, CFQ provides full slice length to queues only which are
> >  idling (in case of sequenatial reader). If we do not enable idling, as
> >  in case of NCQ enabled SSDs, then CFQ will expire the queue almost
> >  immediately and put the queue at the end of service tree (almost).
> >
> > So if we don't enable idling, at max we can provide fairness, we
> > esseitially just let every queue dispatch one request and put  at the end
> > of the end of service tree. Hence no fairness....
> 
> We should distinguish the two terms fairness and service
> differentiation. Fairness is when every queue gets the same amount of
> service share.

Will it not be "proportionate amount of service share" instead of "same
amount of service share"

> This is not what we want when priorities are different
> (we want the service differentiation, instead), but is what we get if
> we do just round robin without idling.
> 
> To fix this, we can alter the placement in the tree, so that if we
> have Q1 with slice S1, and Q2 with slice S2, always ready to perform
> I/O, we get that Q1 is in front of the three with probability
> S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2).
> This is what my formula should achieve.

I have yet to get into details but as I said, this sounds like fairness 
by frequency or by number of times a queue is scheduled to dispatch. So it
will help up to some extent on NCQ enabled SSDs but will become unfair is
size of IO each queue dispatches is very different.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                       ` <4ACCC4B7.4050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-08 10:22                                         ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-08 10:22 UTC (permalink / raw)
  To: riel-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Rik,

Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Ryo Tsuruta wrote:
> 
> > If once dm-ioband is integrated into the LVM tools and bandwidth can
> > be assigned per device by lvcreate, the use of dm-tools is no longer
> > required for users.
> 
> A lot of large data center users have a SAN, with volume management
> handled SAN-side and dedicated LUNs for different applications or
> groups of applications.
> 
> Because of alignment issues, they typically use filesystems directly
> on top of the LUNs, without partitions or LVM layers.  We cannot rely
> on LVM for these systems, because people prefer not to use that.

Thank you for your explanation. So I have a plan to reimplement
dm-ioband into the block layer to make dm-tools no longer required.
My opinion I wrote above assumes if dm-ioband is used for a logical
volume which consists of multiple physical devices. If dm-ioband is
integrated into the LVM tools, then the use of the dm-tools is not
required and the underlying physical devices can be automatically
deteced and configured to use dm-ioband.

Thanks,
Ryo Tsuruta

> Besides ... isn't the goal of the cgroups io bandwidth controller
> to control the IO used by PROCESSES?
>
> If we want to control processes, why would we want the configuration
> to be applied to any other kind of object in the system?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-07 16:41                                       ` Rik van Riel
@ 2009-10-08 10:22                                         ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-08 10:22 UTC (permalink / raw)
  To: riel
  Cc: vgoyal, nauman, m-ikeda, linux-kernel, jens.axboe, containers,
	dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente,
	fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo,
	yoshikawa.takuya

Hi Rik,

Rik van Riel <riel@redhat.com> wrote:
> Ryo Tsuruta wrote:
> 
> > If once dm-ioband is integrated into the LVM tools and bandwidth can
> > be assigned per device by lvcreate, the use of dm-tools is no longer
> > required for users.
> 
> A lot of large data center users have a SAN, with volume management
> handled SAN-side and dedicated LUNs for different applications or
> groups of applications.
> 
> Because of alignment issues, they typically use filesystems directly
> on top of the LUNs, without partitions or LVM layers.  We cannot rely
> on LVM for these systems, because people prefer not to use that.

Thank you for your explanation. So I have a plan to reimplement
dm-ioband into the block layer to make dm-tools no longer required.
My opinion I wrote above assumes if dm-ioband is used for a logical
volume which consists of multiple physical devices. If dm-ioband is
integrated into the LVM tools, then the use of the dm-tools is not
required and the underlying physical devices can be automatically
deteced and configured to use dm-ioband.

Thanks,
Ryo Tsuruta

> Besides ... isn't the goal of the cgroups io bandwidth controller
> to control the IO used by PROCESSES?
>
> If we want to control processes, why would we want the configuration
> to be applied to any other kind of object in the system?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-08 10:22                                         ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-08 10:22 UTC (permalink / raw)
  To: riel
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, vgoyal, righi.andrea,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

Hi Rik,

Rik van Riel <riel@redhat.com> wrote:
> Ryo Tsuruta wrote:
> 
> > If once dm-ioband is integrated into the LVM tools and bandwidth can
> > be assigned per device by lvcreate, the use of dm-tools is no longer
> > required for users.
> 
> A lot of large data center users have a SAN, with volume management
> handled SAN-side and dedicated LUNs for different applications or
> groups of applications.
> 
> Because of alignment issues, they typically use filesystems directly
> on top of the LUNs, without partitions or LVM layers.  We cannot rely
> on LVM for these systems, because people prefer not to use that.

Thank you for your explanation. So I have a plan to reimplement
dm-ioband into the block layer to make dm-tools no longer required.
My opinion I wrote above assumes if dm-ioband is used for a logical
volume which consists of multiple physical devices. If dm-ioband is
integrated into the LVM tools, then the use of the dm-tools is not
required and the underlying physical devices can be automatically
deteced and configured to use dm-ioband.

Thanks,
Ryo Tsuruta

> Besides ... isn't the goal of the cgroups io bandwidth controller
> to control the IO used by PROCESSES?
>
> If we want to control processes, why would we want the configuration
> to be applied to any other kind of object in the system?

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                       ` <20091007150929.GB3674-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-08  2:18                                         ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-08  2:18 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Ok. Our numbers can vary a bit depending on fio settings like block size
> and underlying storage also. But that's not the important thing. Currently
> with this test I just wanted to point out that model of ioprio with-in group
> is currently broken with dm-ioband and good that you can reproduce that.
> 
> One minor nit, for max latency you need to look at "clat " row and "max=" field
> in fio output. Most of the time "max latency" will matter most. You seem to
> be currently grepping for "maxt" which is just seems to be telling how
> long did test run and in this case 30 seconds.
> 
> Assigning reads to right context in CFQ and not to dm-ioband thread might
> help a bit, but I am bit skeptical and following is the reason.
> 
> CFQ relies on time providing longer time slice length for higher priority
> process and if one does not use time slice, it looses its share. So the moment
> you buffer even single bio of a process in dm-layer, if CFQ was servicing that
> process at same time, that process will loose its share. CFQ will at max
> anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the
> queue and move on to next queue. Later if you submit same bio and with
> dm-ioband helper thread and even if CFQ attributes it to right process, it is
> not going to help much as process already lost it slice and now a new slice
> will start.

O.K. I would like to figure something out this issue.

> > > > Be that as it way, I think that if every bio can point the iocontext
> > > > of the process, then it makes it possible to handle IO priority in the
> > > > higher level controller. A patchse has already posted by Takhashi-san.
> > > > What do you think about this idea?
> > > > 
> > > >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > > >   Subject [RFC][PATCH 1/10] I/O context inheritance
> > > >   From Hirokazu Takahashi <>
> > > >   http://lkml.org/lkml/2008/4/22/195
> > > 
> > > So far you have been denying that there are issues with ioprio with-in
> > > group in higher level controller. Here you seems to be saying that there are
> > > issues with ioprio and we need to take this patch in to solve the issue? I am
> > > confused?
> > 
> > The true intention of this patch is to preserve the io-context of a
> > process which originate it, but I think that we could also make use of
> > this patch for one of the way to solve this issue.
> > 
> 
> Ok. Did you run the same test with this patch applied and how do numbers look
> like? Can you please forward port it to 2.6.31 and I will also like to
> play with it?

I'm sorry, I have no time to do that this week. I would like to do the
forward porting and test with it by the mini-summit when poissible.

> I am running more tests/numbers with 2.6.31 for all the IO controllers and
> planning to post it to lkml before we meet for IO mini summit. Numbers can
> help us understand the issue better.
> 
> In first phase I am planning to post numbers for IO scheudler controller
> and dm-ioband. Then will get to max bw controller of Andrea Righi.

That sounds good. Thank you for your work.

> > I created those patches against 2.6.32-rc1 and made sure the patches
> > can be cleanly applied to that version.
> 
> I am applying dm-ioband patch first and then bio cgroup patches. Is this
> right order? Will try again.

Yes, the order is right. Here are the sha1sums.
9f4e50878d77922c84a29be9913a8b5c3f66e6ec linux-2.6.32-rc1.tar.bz2
15d7cc9d801805327204296a2454d6c5346dd2ae dm-ioband-1.14.0.patch
5e0626c14a40c319fb79f2f78378d2de5cc97b02 blkio-cgroup-v13.tar.bz2

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-07 15:09                                       ` Vivek Goyal
@ 2009-10-08  2:18                                         ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-08  2:18 UTC (permalink / raw)
  To: vgoyal
  Cc: nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> Ok. Our numbers can vary a bit depending on fio settings like block size
> and underlying storage also. But that's not the important thing. Currently
> with this test I just wanted to point out that model of ioprio with-in group
> is currently broken with dm-ioband and good that you can reproduce that.
> 
> One minor nit, for max latency you need to look at "clat " row and "max=" field
> in fio output. Most of the time "max latency" will matter most. You seem to
> be currently grepping for "maxt" which is just seems to be telling how
> long did test run and in this case 30 seconds.
> 
> Assigning reads to right context in CFQ and not to dm-ioband thread might
> help a bit, but I am bit skeptical and following is the reason.
> 
> CFQ relies on time providing longer time slice length for higher priority
> process and if one does not use time slice, it looses its share. So the moment
> you buffer even single bio of a process in dm-layer, if CFQ was servicing that
> process at same time, that process will loose its share. CFQ will at max
> anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the
> queue and move on to next queue. Later if you submit same bio and with
> dm-ioband helper thread and even if CFQ attributes it to right process, it is
> not going to help much as process already lost it slice and now a new slice
> will start.

O.K. I would like to figure something out this issue.

> > > > Be that as it way, I think that if every bio can point the iocontext
> > > > of the process, then it makes it possible to handle IO priority in the
> > > > higher level controller. A patchse has already posted by Takhashi-san.
> > > > What do you think about this idea?
> > > > 
> > > >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > > >   Subject [RFC][PATCH 1/10] I/O context inheritance
> > > >   From Hirokazu Takahashi <>
> > > >   http://lkml.org/lkml/2008/4/22/195
> > > 
> > > So far you have been denying that there are issues with ioprio with-in
> > > group in higher level controller. Here you seems to be saying that there are
> > > issues with ioprio and we need to take this patch in to solve the issue? I am
> > > confused?
> > 
> > The true intention of this patch is to preserve the io-context of a
> > process which originate it, but I think that we could also make use of
> > this patch for one of the way to solve this issue.
> > 
> 
> Ok. Did you run the same test with this patch applied and how do numbers look
> like? Can you please forward port it to 2.6.31 and I will also like to
> play with it?

I'm sorry, I have no time to do that this week. I would like to do the
forward porting and test with it by the mini-summit when poissible.

> I am running more tests/numbers with 2.6.31 for all the IO controllers and
> planning to post it to lkml before we meet for IO mini summit. Numbers can
> help us understand the issue better.
> 
> In first phase I am planning to post numbers for IO scheudler controller
> and dm-ioband. Then will get to max bw controller of Andrea Righi.

That sounds good. Thank you for your work.

> > I created those patches against 2.6.32-rc1 and made sure the patches
> > can be cleanly applied to that version.
> 
> I am applying dm-ioband patch first and then bio cgroup patches. Is this
> right order? Will try again.

Yes, the order is right. Here are the sha1sums.
9f4e50878d77922c84a29be9913a8b5c3f66e6ec linux-2.6.32-rc1.tar.bz2
15d7cc9d801805327204296a2454d6c5346dd2ae dm-ioband-1.14.0.patch
5e0626c14a40c319fb79f2f78378d2de5cc97b02 blkio-cgroup-v13.tar.bz2

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-08  2:18                                         ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-08  2:18 UTC (permalink / raw)
  To: vgoyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> Ok. Our numbers can vary a bit depending on fio settings like block size
> and underlying storage also. But that's not the important thing. Currently
> with this test I just wanted to point out that model of ioprio with-in group
> is currently broken with dm-ioband and good that you can reproduce that.
> 
> One minor nit, for max latency you need to look at "clat " row and "max=" field
> in fio output. Most of the time "max latency" will matter most. You seem to
> be currently grepping for "maxt" which is just seems to be telling how
> long did test run and in this case 30 seconds.
> 
> Assigning reads to right context in CFQ and not to dm-ioband thread might
> help a bit, but I am bit skeptical and following is the reason.
> 
> CFQ relies on time providing longer time slice length for higher priority
> process and if one does not use time slice, it looses its share. So the moment
> you buffer even single bio of a process in dm-layer, if CFQ was servicing that
> process at same time, that process will loose its share. CFQ will at max
> anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the
> queue and move on to next queue. Later if you submit same bio and with
> dm-ioband helper thread and even if CFQ attributes it to right process, it is
> not going to help much as process already lost it slice and now a new slice
> will start.

O.K. I would like to figure something out this issue.

> > > > Be that as it way, I think that if every bio can point the iocontext
> > > > of the process, then it makes it possible to handle IO priority in the
> > > > higher level controller. A patchse has already posted by Takhashi-san.
> > > > What do you think about this idea?
> > > > 
> > > >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > > >   Subject [RFC][PATCH 1/10] I/O context inheritance
> > > >   From Hirokazu Takahashi <>
> > > >   http://lkml.org/lkml/2008/4/22/195
> > > 
> > > So far you have been denying that there are issues with ioprio with-in
> > > group in higher level controller. Here you seems to be saying that there are
> > > issues with ioprio and we need to take this patch in to solve the issue? I am
> > > confused?
> > 
> > The true intention of this patch is to preserve the io-context of a
> > process which originate it, but I think that we could also make use of
> > this patch for one of the way to solve this issue.
> > 
> 
> Ok. Did you run the same test with this patch applied and how do numbers look
> like? Can you please forward port it to 2.6.31 and I will also like to
> play with it?

I'm sorry, I have no time to do that this week. I would like to do the
forward porting and test with it by the mini-summit when poissible.

> I am running more tests/numbers with 2.6.31 for all the IO controllers and
> planning to post it to lkml before we meet for IO mini summit. Numbers can
> help us understand the issue better.
> 
> In first phase I am planning to post numbers for IO scheudler controller
> and dm-ioband. Then will get to max bw controller of Andrea Righi.

That sounds good. Thank you for your work.

> > I created those patches against 2.6.32-rc1 and made sure the patches
> > can be cleanly applied to that version.
> 
> I am applying dm-ioband patch first and then bio cgroup patches. Is this
> right order? Will try again.

Yes, the order is right. Here are the sha1sums.
9f4e50878d77922c84a29be9913a8b5c3f66e6ec linux-2.6.32-rc1.tar.bz2
15d7cc9d801805327204296a2454d6c5346dd2ae dm-ioband-1.14.0.patch
5e0626c14a40c319fb79f2f78378d2de5cc97b02 blkio-cgroup-v13.tar.bz2

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                     ` <20091007.233805.183040347.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-10-07 15:09                                       ` Vivek Goyal
@ 2009-10-07 16:41                                       ` Rik van Riel
  1 sibling, 0 replies; 349+ messages in thread
From: Rik van Riel @ 2009-10-07 16:41 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Ryo Tsuruta wrote:

> If once dm-ioband is integrated into the LVM tools and bandwidth can
> be assigned per device by lvcreate, the use of dm-tools is no longer
> required for users.

A lot of large data center users have a SAN, with volume management
handled SAN-side and dedicated LUNs for different applications or
groups of applications.

Because of alignment issues, they typically use filesystems directly
on top of the LUNs, without partitions or LVM layers.  We cannot rely
on LVM for these systems, because people prefer not to use that.

Besides ... isn't the goal of the cgroups io bandwidth controller
to control the IO used by PROCESSES?

If we want to control processes, why would we want the configuration
to be applied to any other kind of object in the system?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-07 14:38                                     ` Ryo Tsuruta
@ 2009-10-07 16:41                                       ` Rik van Riel
  -1 siblings, 0 replies; 349+ messages in thread
From: Rik van Riel @ 2009-10-07 16:41 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: vgoyal, nauman, m-ikeda, linux-kernel, jens.axboe, containers,
	dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente,
	fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo,
	yoshikawa.takuya

Ryo Tsuruta wrote:

> If once dm-ioband is integrated into the LVM tools and bandwidth can
> be assigned per device by lvcreate, the use of dm-tools is no longer
> required for users.

A lot of large data center users have a SAN, with volume management
handled SAN-side and dedicated LUNs for different applications or
groups of applications.

Because of alignment issues, they typically use filesystems directly
on top of the LUNs, without partitions or LVM layers.  We cannot rely
on LVM for these systems, because people prefer not to use that.

Besides ... isn't the goal of the cgroups io bandwidth controller
to control the IO used by PROCESSES?

If we want to control processes, why would we want the configuration
to be applied to any other kind of object in the system?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-07 16:41                                       ` Rik van Riel
  0 siblings, 0 replies; 349+ messages in thread
From: Rik van Riel @ 2009-10-07 16:41 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, vgoyal, righi.andrea,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

Ryo Tsuruta wrote:

> If once dm-ioband is integrated into the LVM tools and bandwidth can
> be assigned per device by lvcreate, the use of dm-tools is no longer
> required for users.

A lot of large data center users have a SAN, with volume management
handled SAN-side and dedicated LUNs for different applications or
groups of applications.

Because of alignment issues, they typically use filesystems directly
on top of the LUNs, without partitions or LVM layers.  We cannot rely
on LVM for these systems, because people prefer not to use that.

Besides ... isn't the goal of the cgroups io bandwidth controller
to control the IO used by PROCESSES?

If we want to control processes, why would we want the configuration
to be applied to any other kind of object in the system?

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                     ` <20091007.233805.183040347.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-10-07 15:09                                       ` Vivek Goyal
  2009-10-07 16:41                                       ` Rik van Riel
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-07 15:09 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Oct 07, 2009 at 11:38:05PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > >> If one would like to
> > > > >> combine some physical disks into one logical device like a dm-linear,
> > > > >> I think one should map the IO controller on each physical device and
> > > > >> combine them into one logical device.
> > > > >>
> > > > >
> > > > > In fact this sounds like a more complicated step where one has to setup
> > > > > one dm-ioband device on top of each physical device. But I am assuming
> > > > > that this will go away once you move to per reuqest queue like implementation.
> > > 
> > > I don't understand why the per request queue implementation makes it
> > > go away. If dm-ioband is integrated into the LVM tools, it could allow
> > > users to skip the complicated steps to configure dm-linear devices.
> > > 
> > 
> > Those who are not using dm-tools will be forced to use dm-tools for
> > bandwidth control features.
> 
> If once dm-ioband is integrated into the LVM tools and bandwidth can
> be assigned per device by lvcreate, the use of dm-tools is no longer
> required for users.

But it is same thing. Now LVM tools is mandatory to use?

> 
> > Interesting. In all the test cases you always test with sequential
> > readers. I have changed the test case a bit (I have already reported the
> > results in another mail, now running the same test again with dm-version
> > 1.14). I made all the readers doing direct IO and in other group I put
> > a buffered writer. So setup looks as follows.
> > 
> > In group1, I launch 1 prio 0 reader and increasing number of prio4
> > readers. In group 2 I just run a dd doing buffered writes. Weights of
> > both the groups are 100 each.
> > 
> > Following are the results on 2.6.31 kernel.
> > 
> > With-dm-ioband
> > ==============
> > <------------prio4 readers---------------------->  <---prio0 reader------>
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
> > 2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
> > 4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
> > 8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
> > 16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   
> > 
> > With vanilla CFQ
> > ================
> > <------------prio4 readers---------------------->  <---prio0 reader------>
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
> > 2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
> > 4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
> > 8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
> > 16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  
> > 
> > 
> > Above results are showing how bandwidth got distributed between prio4 and
> > prio1 readers with-in group as we increased number of prio4 readers in
> > the group. In another group a buffered writer is continuously going on
> > as competitor.
> > 
> > Notice, with dm-ioband how bandwidth allocation is broken.
> > 
> > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.
> > 
> > With 2 prio4 readers, looks like prio4 got almost same BW as prio1.
> > 
> > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
> > readers starve.
> > 
> > As we incresae number of prio4 readers in the group, their total aggregate
> > BW share should increase. Instread it is decreasing.
> > 
> > So to me in the face of competition with a writer in other group, BW is
> > all over the place. Some of these might be dm-ioband bugs and some of
> > these might be coming from the fact that buffering takes place in higher
> > layer and dispatch is FIFO?
> 
> Thank you for testing. I did the same test and here are the results.
> 
> with vanilla CFQ
>    <------------prio4 readers------------------>   prio0       group2
>       maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
>  1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s  1,923KiB/s
>  2  3,967KiB/s  3,930KiB/s  7,897KiB/s 30001msec 14,213KiB/s  1,586KiB/s
>  4  3,399KiB/s  3,066KiB/s 13,031KiB/s 30082msec  8,930KiB/s  1,296KiB/s
>  8  2,086KiB/s  1,720KiB/s 15,266KiB/s 30003msec  7,546KiB/s    517KiB/s
> 16  1,156KiB/s    837KiB/s 15,377KiB/s 30033msec  4,282KiB/s    600KiB/s
> 
> with dm-ioband weight-iosize policy
>    <------------prio4 readers------------------>   prio0       group2
>       maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
>  1    107KiB/s    107KiB/s    107KiB/s 30007msec 12,242KiB/s 12,320KiB/s
>  2  1,259KiB/s    702KiB/s  1,961KiB/s 30037msec  9,657KiB/s 11,657KiB/s
>  4  2,705KiB/s     29KiB/s  5,186KiB/s 30026msec  5,927KiB/s 11,300KiB/s
>  8  2,428KiB/s     27KiB/s  5,629KiB/s 30054msec  5,057KiB/s 10,704KiB/s
> 16  2,465KiB/s     23KiB/s  4,309KiB/s 30032msec  4,750KiB/s  9,088KiB/s
> 
> The results are somewhat different from yours. The bandwidth is
> distributed to each group equally, but CFQ priority is broken as you
> said. I think that the reason is not because of FIFO, but because
> some IO requests are issued from dm-ioband's kernel thread on behalf of
> processes which origirante the IO requests, then CFQ assumes that the
> kernel thread is the originator and uses its io_context.

Ok. Our numbers can vary a bit depending on fio settings like block size
and underlying storage also. But that's not the important thing. Currently
with this test I just wanted to point out that model of ioprio with-in group
is currently broken with dm-ioband and good that you can reproduce that.

One minor nit, for max latency you need to look at "clat " row and "max=" field
in fio output. Most of the time "max latency" will matter most. You seem to
be currently grepping for "maxt" which is just seems to be telling how
long did test run and in this case 30 seconds.

Assigning reads to right context in CFQ and not to dm-ioband thread might
help a bit, but I am bit skeptical and following is the reason.

CFQ relies on time providing longer time slice length for higher priority
process and if one does not use time slice, it looses its share. So the moment
you buffer even single bio of a process in dm-layer, if CFQ was servicing that
process at same time, that process will loose its share. CFQ will at max
anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the
queue and move on to next queue. Later if you submit same bio and with
dm-ioband helper thread and even if CFQ attributes it to right process, it is
not going to help much as process already lost it slice and now a new slice
will start.

> 
> > > Here is my test script.
> > > -------------------------------------------------------------------------
> > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> > >      --group_reporting"
> > > 
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > echo $$ > /cgroup/1/tasks
> > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> > > echo $$ > /cgroup/2/tasks
> > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> > > echo $$ > /cgroup/tasks
> > > wait
> > > -------------------------------------------------------------------------
> > > 
> > > Be that as it way, I think that if every bio can point the iocontext
> > > of the process, then it makes it possible to handle IO priority in the
> > > higher level controller. A patchse has already posted by Takhashi-san.
> > > What do you think about this idea?
> > > 
> > >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > >   Subject [RFC][PATCH 1/10] I/O context inheritance
> > >   From Hirokazu Takahashi <>
> > >   http://lkml.org/lkml/2008/4/22/195
> > 
> > So far you have been denying that there are issues with ioprio with-in
> > group in higher level controller. Here you seems to be saying that there are
> > issues with ioprio and we need to take this patch in to solve the issue? I am
> > confused?
> 
> The true intention of this patch is to preserve the io-context of a
> process which originate it, but I think that we could also make use of
> this patch for one of the way to solve this issue.
> 

Ok. Did you run the same test with this patch applied and how do numbers look
like? Can you please forward port it to 2.6.31 and I will also like to
play with it?

I am running more tests/numbers with 2.6.31 for all the IO controllers and
planning to post it to lkml before we meet for IO mini summit. Numbers can
help us understand the issue better.

In first phase I am planning to post numbers for IO scheudler controller
and dm-ioband. Then will get to max bw controller of Andrea Righi.

> > Anyway, if you think that above patch is needed to solve the issue of
> > ioprio in higher level controller, why are you not posting it as part of
> > your patch series regularly, so that we can also apply this patch along
> > with other patches and test the effects?
> 
> I will post the patch, but I would like to find out and understand the
> reason of above test results before posting the patch.
> 

Ok. So in the mean time, I will continue to do testing with dm-ioband
version 1.14.0 and post the numbers.

> > Against what kernel version above patches apply. The biocgroup patches
> > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
> > against any of these?
> > 
> > So for the time being I am doing testing with biocgroup patches.
> 
> I created those patches against 2.6.32-rc1 and made sure the patches
> can be cleanly applied to that version.

I am applying dm-ioband patch first and then bio cgroup patches. Is this
right order? Will try again.

Anyway, don't have too much time for IO mini summit, so will stick to
2.6.31 for the time being. If time permits, will venture into 32-rc1 also.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-07 14:38                                     ` Ryo Tsuruta
@ 2009-10-07 15:09                                       ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-07 15:09 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

On Wed, Oct 07, 2009 at 11:38:05PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >> If one would like to
> > > > >> combine some physical disks into one logical device like a dm-linear,
> > > > >> I think one should map the IO controller on each physical device and
> > > > >> combine them into one logical device.
> > > > >>
> > > > >
> > > > > In fact this sounds like a more complicated step where one has to setup
> > > > > one dm-ioband device on top of each physical device. But I am assuming
> > > > > that this will go away once you move to per reuqest queue like implementation.
> > > 
> > > I don't understand why the per request queue implementation makes it
> > > go away. If dm-ioband is integrated into the LVM tools, it could allow
> > > users to skip the complicated steps to configure dm-linear devices.
> > > 
> > 
> > Those who are not using dm-tools will be forced to use dm-tools for
> > bandwidth control features.
> 
> If once dm-ioband is integrated into the LVM tools and bandwidth can
> be assigned per device by lvcreate, the use of dm-tools is no longer
> required for users.

But it is same thing. Now LVM tools is mandatory to use?

> 
> > Interesting. In all the test cases you always test with sequential
> > readers. I have changed the test case a bit (I have already reported the
> > results in another mail, now running the same test again with dm-version
> > 1.14). I made all the readers doing direct IO and in other group I put
> > a buffered writer. So setup looks as follows.
> > 
> > In group1, I launch 1 prio 0 reader and increasing number of prio4
> > readers. In group 2 I just run a dd doing buffered writes. Weights of
> > both the groups are 100 each.
> > 
> > Following are the results on 2.6.31 kernel.
> > 
> > With-dm-ioband
> > ==============
> > <------------prio4 readers---------------------->  <---prio0 reader------>
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
> > 2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
> > 4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
> > 8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
> > 16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   
> > 
> > With vanilla CFQ
> > ================
> > <------------prio4 readers---------------------->  <---prio0 reader------>
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
> > 2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
> > 4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
> > 8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
> > 16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  
> > 
> > 
> > Above results are showing how bandwidth got distributed between prio4 and
> > prio1 readers with-in group as we increased number of prio4 readers in
> > the group. In another group a buffered writer is continuously going on
> > as competitor.
> > 
> > Notice, with dm-ioband how bandwidth allocation is broken.
> > 
> > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.
> > 
> > With 2 prio4 readers, looks like prio4 got almost same BW as prio1.
> > 
> > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
> > readers starve.
> > 
> > As we incresae number of prio4 readers in the group, their total aggregate
> > BW share should increase. Instread it is decreasing.
> > 
> > So to me in the face of competition with a writer in other group, BW is
> > all over the place. Some of these might be dm-ioband bugs and some of
> > these might be coming from the fact that buffering takes place in higher
> > layer and dispatch is FIFO?
> 
> Thank you for testing. I did the same test and here are the results.
> 
> with vanilla CFQ
>    <------------prio4 readers------------------>   prio0       group2
>       maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
>  1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s  1,923KiB/s
>  2  3,967KiB/s  3,930KiB/s  7,897KiB/s 30001msec 14,213KiB/s  1,586KiB/s
>  4  3,399KiB/s  3,066KiB/s 13,031KiB/s 30082msec  8,930KiB/s  1,296KiB/s
>  8  2,086KiB/s  1,720KiB/s 15,266KiB/s 30003msec  7,546KiB/s    517KiB/s
> 16  1,156KiB/s    837KiB/s 15,377KiB/s 30033msec  4,282KiB/s    600KiB/s
> 
> with dm-ioband weight-iosize policy
>    <------------prio4 readers------------------>   prio0       group2
>       maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
>  1    107KiB/s    107KiB/s    107KiB/s 30007msec 12,242KiB/s 12,320KiB/s
>  2  1,259KiB/s    702KiB/s  1,961KiB/s 30037msec  9,657KiB/s 11,657KiB/s
>  4  2,705KiB/s     29KiB/s  5,186KiB/s 30026msec  5,927KiB/s 11,300KiB/s
>  8  2,428KiB/s     27KiB/s  5,629KiB/s 30054msec  5,057KiB/s 10,704KiB/s
> 16  2,465KiB/s     23KiB/s  4,309KiB/s 30032msec  4,750KiB/s  9,088KiB/s
> 
> The results are somewhat different from yours. The bandwidth is
> distributed to each group equally, but CFQ priority is broken as you
> said. I think that the reason is not because of FIFO, but because
> some IO requests are issued from dm-ioband's kernel thread on behalf of
> processes which origirante the IO requests, then CFQ assumes that the
> kernel thread is the originator and uses its io_context.

Ok. Our numbers can vary a bit depending on fio settings like block size
and underlying storage also. But that's not the important thing. Currently
with this test I just wanted to point out that model of ioprio with-in group
is currently broken with dm-ioband and good that you can reproduce that.

One minor nit, for max latency you need to look at "clat " row and "max=" field
in fio output. Most of the time "max latency" will matter most. You seem to
be currently grepping for "maxt" which is just seems to be telling how
long did test run and in this case 30 seconds.

Assigning reads to right context in CFQ and not to dm-ioband thread might
help a bit, but I am bit skeptical and following is the reason.

CFQ relies on time providing longer time slice length for higher priority
process and if one does not use time slice, it looses its share. So the moment
you buffer even single bio of a process in dm-layer, if CFQ was servicing that
process at same time, that process will loose its share. CFQ will at max
anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the
queue and move on to next queue. Later if you submit same bio and with
dm-ioband helper thread and even if CFQ attributes it to right process, it is
not going to help much as process already lost it slice and now a new slice
will start.

> 
> > > Here is my test script.
> > > -------------------------------------------------------------------------
> > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> > >      --group_reporting"
> > > 
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > echo $$ > /cgroup/1/tasks
> > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> > > echo $$ > /cgroup/2/tasks
> > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> > > echo $$ > /cgroup/tasks
> > > wait
> > > -------------------------------------------------------------------------
> > > 
> > > Be that as it way, I think that if every bio can point the iocontext
> > > of the process, then it makes it possible to handle IO priority in the
> > > higher level controller. A patchse has already posted by Takhashi-san.
> > > What do you think about this idea?
> > > 
> > >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > >   Subject [RFC][PATCH 1/10] I/O context inheritance
> > >   From Hirokazu Takahashi <>
> > >   http://lkml.org/lkml/2008/4/22/195
> > 
> > So far you have been denying that there are issues with ioprio with-in
> > group in higher level controller. Here you seems to be saying that there are
> > issues with ioprio and we need to take this patch in to solve the issue? I am
> > confused?
> 
> The true intention of this patch is to preserve the io-context of a
> process which originate it, but I think that we could also make use of
> this patch for one of the way to solve this issue.
> 

Ok. Did you run the same test with this patch applied and how do numbers look
like? Can you please forward port it to 2.6.31 and I will also like to
play with it?

I am running more tests/numbers with 2.6.31 for all the IO controllers and
planning to post it to lkml before we meet for IO mini summit. Numbers can
help us understand the issue better.

In first phase I am planning to post numbers for IO scheudler controller
and dm-ioband. Then will get to max bw controller of Andrea Righi.

> > Anyway, if you think that above patch is needed to solve the issue of
> > ioprio in higher level controller, why are you not posting it as part of
> > your patch series regularly, so that we can also apply this patch along
> > with other patches and test the effects?
> 
> I will post the patch, but I would like to find out and understand the
> reason of above test results before posting the patch.
> 

Ok. So in the mean time, I will continue to do testing with dm-ioband
version 1.14.0 and post the numbers.

> > Against what kernel version above patches apply. The biocgroup patches
> > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
> > against any of these?
> > 
> > So for the time being I am doing testing with biocgroup patches.
> 
> I created those patches against 2.6.32-rc1 and made sure the patches
> can be cleanly applied to that version.

I am applying dm-ioband patch first and then bio cgroup patches. Is this
right order? Will try again.

Anyway, don't have too much time for IO mini summit, so will stick to
2.6.31 for the time being. If time permits, will venture into 32-rc1 also.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-07 15:09                                       ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-07 15:09 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

On Wed, Oct 07, 2009 at 11:38:05PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >> If one would like to
> > > > >> combine some physical disks into one logical device like a dm-linear,
> > > > >> I think one should map the IO controller on each physical device and
> > > > >> combine them into one logical device.
> > > > >>
> > > > >
> > > > > In fact this sounds like a more complicated step where one has to setup
> > > > > one dm-ioband device on top of each physical device. But I am assuming
> > > > > that this will go away once you move to per reuqest queue like implementation.
> > > 
> > > I don't understand why the per request queue implementation makes it
> > > go away. If dm-ioband is integrated into the LVM tools, it could allow
> > > users to skip the complicated steps to configure dm-linear devices.
> > > 
> > 
> > Those who are not using dm-tools will be forced to use dm-tools for
> > bandwidth control features.
> 
> If once dm-ioband is integrated into the LVM tools and bandwidth can
> be assigned per device by lvcreate, the use of dm-tools is no longer
> required for users.

But it is same thing. Now LVM tools is mandatory to use?

> 
> > Interesting. In all the test cases you always test with sequential
> > readers. I have changed the test case a bit (I have already reported the
> > results in another mail, now running the same test again with dm-version
> > 1.14). I made all the readers doing direct IO and in other group I put
> > a buffered writer. So setup looks as follows.
> > 
> > In group1, I launch 1 prio 0 reader and increasing number of prio4
> > readers. In group 2 I just run a dd doing buffered writes. Weights of
> > both the groups are 100 each.
> > 
> > Following are the results on 2.6.31 kernel.
> > 
> > With-dm-ioband
> > ==============
> > <------------prio4 readers---------------------->  <---prio0 reader------>
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
> > 2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
> > 4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
> > 8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
> > 16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   
> > 
> > With vanilla CFQ
> > ================
> > <------------prio4 readers---------------------->  <---prio0 reader------>
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
> > 2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
> > 4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
> > 8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
> > 16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  
> > 
> > 
> > Above results are showing how bandwidth got distributed between prio4 and
> > prio1 readers with-in group as we increased number of prio4 readers in
> > the group. In another group a buffered writer is continuously going on
> > as competitor.
> > 
> > Notice, with dm-ioband how bandwidth allocation is broken.
> > 
> > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.
> > 
> > With 2 prio4 readers, looks like prio4 got almost same BW as prio1.
> > 
> > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
> > readers starve.
> > 
> > As we incresae number of prio4 readers in the group, their total aggregate
> > BW share should increase. Instread it is decreasing.
> > 
> > So to me in the face of competition with a writer in other group, BW is
> > all over the place. Some of these might be dm-ioband bugs and some of
> > these might be coming from the fact that buffering takes place in higher
> > layer and dispatch is FIFO?
> 
> Thank you for testing. I did the same test and here are the results.
> 
> with vanilla CFQ
>    <------------prio4 readers------------------>   prio0       group2
>       maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
>  1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s  1,923KiB/s
>  2  3,967KiB/s  3,930KiB/s  7,897KiB/s 30001msec 14,213KiB/s  1,586KiB/s
>  4  3,399KiB/s  3,066KiB/s 13,031KiB/s 30082msec  8,930KiB/s  1,296KiB/s
>  8  2,086KiB/s  1,720KiB/s 15,266KiB/s 30003msec  7,546KiB/s    517KiB/s
> 16  1,156KiB/s    837KiB/s 15,377KiB/s 30033msec  4,282KiB/s    600KiB/s
> 
> with dm-ioband weight-iosize policy
>    <------------prio4 readers------------------>   prio0       group2
>       maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
>  1    107KiB/s    107KiB/s    107KiB/s 30007msec 12,242KiB/s 12,320KiB/s
>  2  1,259KiB/s    702KiB/s  1,961KiB/s 30037msec  9,657KiB/s 11,657KiB/s
>  4  2,705KiB/s     29KiB/s  5,186KiB/s 30026msec  5,927KiB/s 11,300KiB/s
>  8  2,428KiB/s     27KiB/s  5,629KiB/s 30054msec  5,057KiB/s 10,704KiB/s
> 16  2,465KiB/s     23KiB/s  4,309KiB/s 30032msec  4,750KiB/s  9,088KiB/s
> 
> The results are somewhat different from yours. The bandwidth is
> distributed to each group equally, but CFQ priority is broken as you
> said. I think that the reason is not because of FIFO, but because
> some IO requests are issued from dm-ioband's kernel thread on behalf of
> processes which origirante the IO requests, then CFQ assumes that the
> kernel thread is the originator and uses its io_context.

Ok. Our numbers can vary a bit depending on fio settings like block size
and underlying storage also. But that's not the important thing. Currently
with this test I just wanted to point out that model of ioprio with-in group
is currently broken with dm-ioband and good that you can reproduce that.

One minor nit, for max latency you need to look at "clat " row and "max=" field
in fio output. Most of the time "max latency" will matter most. You seem to
be currently grepping for "maxt" which is just seems to be telling how
long did test run and in this case 30 seconds.

Assigning reads to right context in CFQ and not to dm-ioband thread might
help a bit, but I am bit skeptical and following is the reason.

CFQ relies on time providing longer time slice length for higher priority
process and if one does not use time slice, it looses its share. So the moment
you buffer even single bio of a process in dm-layer, if CFQ was servicing that
process at same time, that process will loose its share. CFQ will at max
anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the
queue and move on to next queue. Later if you submit same bio and with
dm-ioband helper thread and even if CFQ attributes it to right process, it is
not going to help much as process already lost it slice and now a new slice
will start.

> 
> > > Here is my test script.
> > > -------------------------------------------------------------------------
> > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> > >      --group_reporting"
> > > 
> > > sync
> > > echo 3 > /proc/sys/vm/drop_caches
> > > 
> > > echo $$ > /cgroup/1/tasks
> > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> > > echo $$ > /cgroup/2/tasks
> > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> > > echo $$ > /cgroup/tasks
> > > wait
> > > -------------------------------------------------------------------------
> > > 
> > > Be that as it way, I think that if every bio can point the iocontext
> > > of the process, then it makes it possible to handle IO priority in the
> > > higher level controller. A patchse has already posted by Takhashi-san.
> > > What do you think about this idea?
> > > 
> > >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> > >   Subject [RFC][PATCH 1/10] I/O context inheritance
> > >   From Hirokazu Takahashi <>
> > >   http://lkml.org/lkml/2008/4/22/195
> > 
> > So far you have been denying that there are issues with ioprio with-in
> > group in higher level controller. Here you seems to be saying that there are
> > issues with ioprio and we need to take this patch in to solve the issue? I am
> > confused?
> 
> The true intention of this patch is to preserve the io-context of a
> process which originate it, but I think that we could also make use of
> this patch for one of the way to solve this issue.
> 

Ok. Did you run the same test with this patch applied and how do numbers look
like? Can you please forward port it to 2.6.31 and I will also like to
play with it?

I am running more tests/numbers with 2.6.31 for all the IO controllers and
planning to post it to lkml before we meet for IO mini summit. Numbers can
help us understand the issue better.

In first phase I am planning to post numbers for IO scheudler controller
and dm-ioband. Then will get to max bw controller of Andrea Righi.

> > Anyway, if you think that above patch is needed to solve the issue of
> > ioprio in higher level controller, why are you not posting it as part of
> > your patch series regularly, so that we can also apply this patch along
> > with other patches and test the effects?
> 
> I will post the patch, but I would like to find out and understand the
> reason of above test results before posting the patch.
> 

Ok. So in the mean time, I will continue to do testing with dm-ioband
version 1.14.0 and post the numbers.

> > Against what kernel version above patches apply. The biocgroup patches
> > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
> > against any of these?
> > 
> > So for the time being I am doing testing with biocgroup patches.
> 
> I created those patches against 2.6.32-rc1 and made sure the patches
> can be cleanly applied to that version.

I am applying dm-ioband patch first and then bio cgroup patches. Is this
right order? Will try again.

Anyway, don't have too much time for IO mini summit, so will stick to
2.6.31 for the time being. If time permits, will venture into 32-rc1 also.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                   ` <20091006112201.GA27866-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-07 14:38                                     ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-07 14:38 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > >> If one would like to
> > > >> combine some physical disks into one logical device like a dm-linear,
> > > >> I think one should map the IO controller on each physical device and
> > > >> combine them into one logical device.
> > > >>
> > > >
> > > > In fact this sounds like a more complicated step where one has to setup
> > > > one dm-ioband device on top of each physical device. But I am assuming
> > > > that this will go away once you move to per reuqest queue like implementation.
> > 
> > I don't understand why the per request queue implementation makes it
> > go away. If dm-ioband is integrated into the LVM tools, it could allow
> > users to skip the complicated steps to configure dm-linear devices.
> > 
> 
> Those who are not using dm-tools will be forced to use dm-tools for
> bandwidth control features.

If once dm-ioband is integrated into the LVM tools and bandwidth can
be assigned per device by lvcreate, the use of dm-tools is no longer
required for users.

> Interesting. In all the test cases you always test with sequential
> readers. I have changed the test case a bit (I have already reported the
> results in another mail, now running the same test again with dm-version
> 1.14). I made all the readers doing direct IO and in other group I put
> a buffered writer. So setup looks as follows.
> 
> In group1, I launch 1 prio 0 reader and increasing number of prio4
> readers. In group 2 I just run a dd doing buffered writes. Weights of
> both the groups are 100 each.
> 
> Following are the results on 2.6.31 kernel.
> 
> With-dm-ioband
> ==============
> <------------prio4 readers---------------------->  <---prio0 reader------>
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
> 2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
> 4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
> 8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
> 16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   
> 
> With vanilla CFQ
> ================
> <------------prio4 readers---------------------->  <---prio0 reader------>
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
> 2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
> 4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
> 8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
> 16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  
> 
> 
> Above results are showing how bandwidth got distributed between prio4 and
> prio1 readers with-in group as we increased number of prio4 readers in
> the group. In another group a buffered writer is continuously going on
> as competitor.
> 
> Notice, with dm-ioband how bandwidth allocation is broken.
> 
> With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.
> 
> With 2 prio4 readers, looks like prio4 got almost same BW as prio1.
> 
> With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
> readers starve.
> 
> As we incresae number of prio4 readers in the group, their total aggregate
> BW share should increase. Instread it is decreasing.
> 
> So to me in the face of competition with a writer in other group, BW is
> all over the place. Some of these might be dm-ioband bugs and some of
> these might be coming from the fact that buffering takes place in higher
> layer and dispatch is FIFO?

Thank you for testing. I did the same test and here are the results.

with vanilla CFQ
   <------------prio4 readers------------------>   prio0       group2
      maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s  1,923KiB/s
 2  3,967KiB/s  3,930KiB/s  7,897KiB/s 30001msec 14,213KiB/s  1,586KiB/s
 4  3,399KiB/s  3,066KiB/s 13,031KiB/s 30082msec  8,930KiB/s  1,296KiB/s
 8  2,086KiB/s  1,720KiB/s 15,266KiB/s 30003msec  7,546KiB/s    517KiB/s
16  1,156KiB/s    837KiB/s 15,377KiB/s 30033msec  4,282KiB/s    600KiB/s

with dm-ioband weight-iosize policy
   <------------prio4 readers------------------>   prio0       group2
      maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
 1    107KiB/s    107KiB/s    107KiB/s 30007msec 12,242KiB/s 12,320KiB/s
 2  1,259KiB/s    702KiB/s  1,961KiB/s 30037msec  9,657KiB/s 11,657KiB/s
 4  2,705KiB/s     29KiB/s  5,186KiB/s 30026msec  5,927KiB/s 11,300KiB/s
 8  2,428KiB/s     27KiB/s  5,629KiB/s 30054msec  5,057KiB/s 10,704KiB/s
16  2,465KiB/s     23KiB/s  4,309KiB/s 30032msec  4,750KiB/s  9,088KiB/s

The results are somewhat different from yours. The bandwidth is
distributed to each group equally, but CFQ priority is broken as you
said. I think that the reason is not because of FIFO, but because
some IO requests are issued from dm-ioband's kernel thread on behalf of
processes which origirante the IO requests, then CFQ assumes that the
kernel thread is the originator and uses its io_context.

> > Here is my test script.
> > -------------------------------------------------------------------------
> > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> >      --group_reporting"
> > 
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> > 
> > echo $$ > /cgroup/1/tasks
> > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> > echo $$ > /cgroup/2/tasks
> > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> > echo $$ > /cgroup/tasks
> > wait
> > -------------------------------------------------------------------------
> > 
> > Be that as it way, I think that if every bio can point the iocontext
> > of the process, then it makes it possible to handle IO priority in the
> > higher level controller. A patchse has already posted by Takhashi-san.
> > What do you think about this idea?
> > 
> >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> >   Subject [RFC][PATCH 1/10] I/O context inheritance
> >   From Hirokazu Takahashi <>
> >   http://lkml.org/lkml/2008/4/22/195
> 
> So far you have been denying that there are issues with ioprio with-in
> group in higher level controller. Here you seems to be saying that there are
> issues with ioprio and we need to take this patch in to solve the issue? I am
> confused?

The true intention of this patch is to preserve the io-context of a
process which originate it, but I think that we could also make use of
this patch for one of the way to solve this issue.

> Anyway, if you think that above patch is needed to solve the issue of
> ioprio in higher level controller, why are you not posting it as part of
> your patch series regularly, so that we can also apply this patch along
> with other patches and test the effects?

I will post the patch, but I would like to find out and understand the
reason of above test results before posting the patch.

> Against what kernel version above patches apply. The biocgroup patches
> I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
> against any of these?
> 
> So for the time being I am doing testing with biocgroup patches.

I created those patches against 2.6.32-rc1 and made sure the patches
can be cleanly applied to that version.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-06 11:22                                   ` Vivek Goyal
@ 2009-10-07 14:38                                     ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-07 14:38 UTC (permalink / raw)
  To: vgoyal
  Cc: nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >> If one would like to
> > > >> combine some physical disks into one logical device like a dm-linear,
> > > >> I think one should map the IO controller on each physical device and
> > > >> combine them into one logical device.
> > > >>
> > > >
> > > > In fact this sounds like a more complicated step where one has to setup
> > > > one dm-ioband device on top of each physical device. But I am assuming
> > > > that this will go away once you move to per reuqest queue like implementation.
> > 
> > I don't understand why the per request queue implementation makes it
> > go away. If dm-ioband is integrated into the LVM tools, it could allow
> > users to skip the complicated steps to configure dm-linear devices.
> > 
> 
> Those who are not using dm-tools will be forced to use dm-tools for
> bandwidth control features.

If once dm-ioband is integrated into the LVM tools and bandwidth can
be assigned per device by lvcreate, the use of dm-tools is no longer
required for users.

> Interesting. In all the test cases you always test with sequential
> readers. I have changed the test case a bit (I have already reported the
> results in another mail, now running the same test again with dm-version
> 1.14). I made all the readers doing direct IO and in other group I put
> a buffered writer. So setup looks as follows.
> 
> In group1, I launch 1 prio 0 reader and increasing number of prio4
> readers. In group 2 I just run a dd doing buffered writes. Weights of
> both the groups are 100 each.
> 
> Following are the results on 2.6.31 kernel.
> 
> With-dm-ioband
> ==============
> <------------prio4 readers---------------------->  <---prio0 reader------>
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
> 2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
> 4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
> 8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
> 16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   
> 
> With vanilla CFQ
> ================
> <------------prio4 readers---------------------->  <---prio0 reader------>
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
> 2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
> 4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
> 8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
> 16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  
> 
> 
> Above results are showing how bandwidth got distributed between prio4 and
> prio1 readers with-in group as we increased number of prio4 readers in
> the group. In another group a buffered writer is continuously going on
> as competitor.
> 
> Notice, with dm-ioband how bandwidth allocation is broken.
> 
> With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.
> 
> With 2 prio4 readers, looks like prio4 got almost same BW as prio1.
> 
> With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
> readers starve.
> 
> As we incresae number of prio4 readers in the group, their total aggregate
> BW share should increase. Instread it is decreasing.
> 
> So to me in the face of competition with a writer in other group, BW is
> all over the place. Some of these might be dm-ioband bugs and some of
> these might be coming from the fact that buffering takes place in higher
> layer and dispatch is FIFO?

Thank you for testing. I did the same test and here are the results.

with vanilla CFQ
   <------------prio4 readers------------------>   prio0       group2
      maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s  1,923KiB/s
 2  3,967KiB/s  3,930KiB/s  7,897KiB/s 30001msec 14,213KiB/s  1,586KiB/s
 4  3,399KiB/s  3,066KiB/s 13,031KiB/s 30082msec  8,930KiB/s  1,296KiB/s
 8  2,086KiB/s  1,720KiB/s 15,266KiB/s 30003msec  7,546KiB/s    517KiB/s
16  1,156KiB/s    837KiB/s 15,377KiB/s 30033msec  4,282KiB/s    600KiB/s

with dm-ioband weight-iosize policy
   <------------prio4 readers------------------>   prio0       group2
      maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
 1    107KiB/s    107KiB/s    107KiB/s 30007msec 12,242KiB/s 12,320KiB/s
 2  1,259KiB/s    702KiB/s  1,961KiB/s 30037msec  9,657KiB/s 11,657KiB/s
 4  2,705KiB/s     29KiB/s  5,186KiB/s 30026msec  5,927KiB/s 11,300KiB/s
 8  2,428KiB/s     27KiB/s  5,629KiB/s 30054msec  5,057KiB/s 10,704KiB/s
16  2,465KiB/s     23KiB/s  4,309KiB/s 30032msec  4,750KiB/s  9,088KiB/s

The results are somewhat different from yours. The bandwidth is
distributed to each group equally, but CFQ priority is broken as you
said. I think that the reason is not because of FIFO, but because
some IO requests are issued from dm-ioband's kernel thread on behalf of
processes which origirante the IO requests, then CFQ assumes that the
kernel thread is the originator and uses its io_context.

> > Here is my test script.
> > -------------------------------------------------------------------------
> > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> >      --group_reporting"
> > 
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> > 
> > echo $$ > /cgroup/1/tasks
> > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> > echo $$ > /cgroup/2/tasks
> > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> > echo $$ > /cgroup/tasks
> > wait
> > -------------------------------------------------------------------------
> > 
> > Be that as it way, I think that if every bio can point the iocontext
> > of the process, then it makes it possible to handle IO priority in the
> > higher level controller. A patchse has already posted by Takhashi-san.
> > What do you think about this idea?
> > 
> >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> >   Subject [RFC][PATCH 1/10] I/O context inheritance
> >   From Hirokazu Takahashi <>
> >   http://lkml.org/lkml/2008/4/22/195
> 
> So far you have been denying that there are issues with ioprio with-in
> group in higher level controller. Here you seems to be saying that there are
> issues with ioprio and we need to take this patch in to solve the issue? I am
> confused?

The true intention of this patch is to preserve the io-context of a
process which originate it, but I think that we could also make use of
this patch for one of the way to solve this issue.

> Anyway, if you think that above patch is needed to solve the issue of
> ioprio in higher level controller, why are you not posting it as part of
> your patch series regularly, so that we can also apply this patch along
> with other patches and test the effects?

I will post the patch, but I would like to find out and understand the
reason of above test results before posting the patch.

> Against what kernel version above patches apply. The biocgroup patches
> I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
> against any of these?
> 
> So for the time being I am doing testing with biocgroup patches.

I created those patches against 2.6.32-rc1 and made sure the patches
can be cleanly applied to that version.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-07 14:38                                     ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-07 14:38 UTC (permalink / raw)
  To: vgoyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >> If one would like to
> > > >> combine some physical disks into one logical device like a dm-linear,
> > > >> I think one should map the IO controller on each physical device and
> > > >> combine them into one logical device.
> > > >>
> > > >
> > > > In fact this sounds like a more complicated step where one has to setup
> > > > one dm-ioband device on top of each physical device. But I am assuming
> > > > that this will go away once you move to per reuqest queue like implementation.
> > 
> > I don't understand why the per request queue implementation makes it
> > go away. If dm-ioband is integrated into the LVM tools, it could allow
> > users to skip the complicated steps to configure dm-linear devices.
> > 
> 
> Those who are not using dm-tools will be forced to use dm-tools for
> bandwidth control features.

If once dm-ioband is integrated into the LVM tools and bandwidth can
be assigned per device by lvcreate, the use of dm-tools is no longer
required for users.

> Interesting. In all the test cases you always test with sequential
> readers. I have changed the test case a bit (I have already reported the
> results in another mail, now running the same test again with dm-version
> 1.14). I made all the readers doing direct IO and in other group I put
> a buffered writer. So setup looks as follows.
> 
> In group1, I launch 1 prio 0 reader and increasing number of prio4
> readers. In group 2 I just run a dd doing buffered writes. Weights of
> both the groups are 100 each.
> 
> Following are the results on 2.6.31 kernel.
> 
> With-dm-ioband
> ==============
> <------------prio4 readers---------------------->  <---prio0 reader------>
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
> 2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
> 4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
> 8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
> 16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   
> 
> With vanilla CFQ
> ================
> <------------prio4 readers---------------------->  <---prio0 reader------>
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
> 2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
> 4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
> 8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
> 16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  
> 
> 
> Above results are showing how bandwidth got distributed between prio4 and
> prio1 readers with-in group as we increased number of prio4 readers in
> the group. In another group a buffered writer is continuously going on
> as competitor.
> 
> Notice, with dm-ioband how bandwidth allocation is broken.
> 
> With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.
> 
> With 2 prio4 readers, looks like prio4 got almost same BW as prio1.
> 
> With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
> readers starve.
> 
> As we incresae number of prio4 readers in the group, their total aggregate
> BW share should increase. Instread it is decreasing.
> 
> So to me in the face of competition with a writer in other group, BW is
> all over the place. Some of these might be dm-ioband bugs and some of
> these might be coming from the fact that buffering takes place in higher
> layer and dispatch is FIFO?

Thank you for testing. I did the same test and here are the results.

with vanilla CFQ
   <------------prio4 readers------------------>   prio0       group2
      maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s  1,923KiB/s
 2  3,967KiB/s  3,930KiB/s  7,897KiB/s 30001msec 14,213KiB/s  1,586KiB/s
 4  3,399KiB/s  3,066KiB/s 13,031KiB/s 30082msec  8,930KiB/s  1,296KiB/s
 8  2,086KiB/s  1,720KiB/s 15,266KiB/s 30003msec  7,546KiB/s    517KiB/s
16  1,156KiB/s    837KiB/s 15,377KiB/s 30033msec  4,282KiB/s    600KiB/s

with dm-ioband weight-iosize policy
   <------------prio4 readers------------------>   prio0       group2
      maxbw       minbw      aggrbw     maxlat     aggrbw      bufwrite
 1    107KiB/s    107KiB/s    107KiB/s 30007msec 12,242KiB/s 12,320KiB/s
 2  1,259KiB/s    702KiB/s  1,961KiB/s 30037msec  9,657KiB/s 11,657KiB/s
 4  2,705KiB/s     29KiB/s  5,186KiB/s 30026msec  5,927KiB/s 11,300KiB/s
 8  2,428KiB/s     27KiB/s  5,629KiB/s 30054msec  5,057KiB/s 10,704KiB/s
16  2,465KiB/s     23KiB/s  4,309KiB/s 30032msec  4,750KiB/s  9,088KiB/s

The results are somewhat different from yours. The bandwidth is
distributed to each group equally, but CFQ priority is broken as you
said. I think that the reason is not because of FIFO, but because
some IO requests are issued from dm-ioband's kernel thread on behalf of
processes which origirante the IO requests, then CFQ assumes that the
kernel thread is the originator and uses its io_context.

> > Here is my test script.
> > -------------------------------------------------------------------------
> > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
> >      --group_reporting"
> > 
> > sync
> > echo 3 > /proc/sys/vm/drop_caches
> > 
> > echo $$ > /cgroup/1/tasks
> > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> > echo $$ > /cgroup/2/tasks
> > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> > echo $$ > /cgroup/tasks
> > wait
> > -------------------------------------------------------------------------
> > 
> > Be that as it way, I think that if every bio can point the iocontext
> > of the process, then it makes it possible to handle IO priority in the
> > higher level controller. A patchse has already posted by Takhashi-san.
> > What do you think about this idea?
> > 
> >   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
> >   Subject [RFC][PATCH 1/10] I/O context inheritance
> >   From Hirokazu Takahashi <>
> >   http://lkml.org/lkml/2008/4/22/195
> 
> So far you have been denying that there are issues with ioprio with-in
> group in higher level controller. Here you seems to be saying that there are
> issues with ioprio and we need to take this patch in to solve the issue? I am
> confused?

The true intention of this patch is to preserve the io-context of a
process which originate it, but I think that we could also make use of
this patch for one of the way to solve this issue.

> Anyway, if you think that above patch is needed to solve the issue of
> ioprio in higher level controller, why are you not posting it as part of
> your patch series regularly, so that we can also apply this patch along
> with other patches and test the effects?

I will post the patch, but I would like to find out and understand the
reason of above test results before posting the patch.

> Against what kernel version above patches apply. The biocgroup patches
> I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
> against any of these?
> 
> So for the time being I am doing testing with biocgroup patches.

I created those patches against 2.6.32-rc1 and made sure the patches
can be cleanly applied to that version.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                 ` <20091006.161744.189719641.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-10-06 11:22                                   ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-06 11:22 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Oct 06, 2009 at 04:17:44PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and Nauman,
> 
> Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> > >> > > How about adding a callback function to the higher level controller?
> > >> > > CFQ calls it when the active queue runs out of time, then the higer
> > >> > > level controller use it as a trigger or a hint to move IO group, so
> > >> > > I think a time-based controller could be implemented at higher level.
> > >> > >
> > >> >
> > >> > Adding a call back should not be a big issue. But that means you are
> > >> > planning to run only one group at higher layer at one time and I think
> > >> > that's the problem because than we are introducing serialization at higher
> > >> > layer. So any higher level device mapper target which has multiple
> > >> > physical disks under it, we might be underutilizing these even more and
> > >> > take a big hit on overall throughput.
> > >> >
> > >> > The whole design of doing proportional weight at lower layer is optimial
> > >> > usage of system.
> > >>
> > >> But I think that the higher level approch makes easy to configure
> > >> against striped software raid devices.
> > >
> > > How does it make easier to configure in case of higher level controller?
> > >
> > > In case of lower level design, one just have to create cgroups and assign
> > > weights to cgroups. This mininum step will be required in higher level
> > > controller also. (Even if you get rid of dm-ioband device setup step).
> 
> In the case of lower level controller, if we need to assign weights on
> a per device basis, we have to assign weights to all devices of which
> a raid device consists, but in the case of higher level controller, 
> we just assign weights to the raid device only.
> 

This is required only if you need to assign different weights to different
devices. This is just additional facility and not a requirement. Normally
you will not be required to do that and devices will inherit the cgroup
weights automatically. So one has to only assign the cgroup weights.

> > >> If one would like to
> > >> combine some physical disks into one logical device like a dm-linear,
> > >> I think one should map the IO controller on each physical device and
> > >> combine them into one logical device.
> > >>
> > >
> > > In fact this sounds like a more complicated step where one has to setup
> > > one dm-ioband device on top of each physical device. But I am assuming
> > > that this will go away once you move to per reuqest queue like implementation.
> 
> I don't understand why the per request queue implementation makes it
> go away. If dm-ioband is integrated into the LVM tools, it could allow
> users to skip the complicated steps to configure dm-linear devices.
> 

Those who are not using dm-tools will be forced to use dm-tools for
bandwidth control features.

> > > I think it should be same in principal as my initial implementation of IO
> > > controller on request queue and I stopped development on it because of FIFO
> > > dispatch.
> 
> I think that FIFO dispatch seldom lead to prioviry inversion, because
> holding period for throttling is not too long to break the IO priority.
> I did some tests to see whether priority inversion is happened.
> 
> The first test ran fio sequential readers on the same group. The BE0
> reader got the highest throughput as I expected.
> 
> nr_threads      16      |      16    |     1
> ionice          BE7     |     BE7    |    BE0
> ------------------------+------------+-------------
> vanilla     10,076KiB/s | 9,779KiB/s | 32,775KiB/s
> ioband       9,576KiB/s | 9,367KiB/s | 34,154KiB/s
> 
> The second test ran fio sequential readers on two different groups and
> give weights of 20 and 10 to each group respectively. The bandwidth
> was distributed according to their weights and the BE0 reader got
> higher throughput than the BE7 readers in the same group. IO priority
> was preserved within the IO group.
> 
> group         group1    |         group2
> weight          20      |           10    
> ------------------------+--------------------------
> nr_threads      16      |      16    |     1
> ionice          BE7     |     BE7    |    BE0
> ------------------------+--------------------------
> ioband      27,513KiB/s | 3,524KiB/s | 10,248KiB/s
>                         |     Total = 13,772KiB/s
> 

Interesting. In all the test cases you always test with sequential
readers. I have changed the test case a bit (I have already reported the
results in another mail, now running the same test again with dm-version
1.14). I made all the readers doing direct IO and in other group I put
a buffered writer. So setup looks as follows.

In group1, I launch 1 prio 0 reader and increasing number of prio4
readers. In group 2 I just run a dd doing buffered writes. Weights of
both the groups are 100 each.

Following are the results on 2.6.31 kernel.

With-dm-ioband
==============
<------------prio4 readers---------------------->  <---prio0 reader------>
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   

With vanilla CFQ
================
<------------prio4 readers---------------------->  <---prio0 reader------>
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  


Above results are showing how bandwidth got distributed between prio4 and
prio1 readers with-in group as we increased number of prio4 readers in
the group. In another group a buffered writer is continuously going on
as competitor.

Notice, with dm-ioband how bandwidth allocation is broken.

With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.

With 2 prio4 readers, looks like prio4 got almost same BW as prio1.

With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
readers starve.

As we incresae number of prio4 readers in the group, their total aggregate
BW share should increase. Instread it is decreasing.

So to me in the face of competition with a writer in other group, BW is
all over the place. Some of these might be dm-ioband bugs and some of
these might be coming from the fact that buffering takes place in higher
layer and dispatch is FIFO?

> Here is my test script.
> -------------------------------------------------------------------------
> arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
>      --group_reporting"
> 
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> echo $$ > /cgroup/1/tasks
> ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> echo $$ > /cgroup/2/tasks
> ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> echo $$ > /cgroup/tasks
> wait
> -------------------------------------------------------------------------
> 
> Be that as it way, I think that if every bio can point the iocontext
> of the process, then it makes it possible to handle IO priority in the
> higher level controller. A patchse has already posted by Takhashi-san.
> What do you think about this idea?
> 
>   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
>   Subject [RFC][PATCH 1/10] I/O context inheritance
>   From Hirokazu Takahashi <>
>   http://lkml.org/lkml/2008/4/22/195

So far you have been denying that there are issues with ioprio with-in
group in higher level controller. Here you seems to be saying that there are
issues with ioprio and we need to take this patch in to solve the issue? I am
confused?

Anyway, if you think that above patch is needed to solve the issue of
ioprio in higher level controller, why are you not posting it as part of
your patch series regularly, so that we can also apply this patch along
with other patches and test the effects?

> 
> > > So you seem to be suggesting that you will move dm-ioband to request queue
> > > so that setting up additional device setup is gone. You will also enable
> > > it to do time based groups policy, so that we don't run into issues on
> > > seeky media. Will also enable dispatch from one group only at a time so
> > > that we don't run into isolation issues and can do time accounting
> > > accruately.
> > 
> > Will that approach solve the problem of doing bandwidth control on
> > logical devices? What would be the advantages compared to Vivek's
> > current patches?
> 
> I will only move the point where dm-ioband grabs bios, other
> dm-ioband's mechanism and functionality will stll be the same.
> The advantages against to scheduler based controllers are:
>  - can work with any type of block devices
>  - can work with any type of IO scheduler and no need a big change.
> 

The big change thing we will come to know for sure when we have
implementation for the timed groups done and shown that it works as well as my
patches. There are so many subtle things with time based approach. 

[..]
> > >> > Is there a new version of dm-ioband now where you have solved the issue of
> > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > >> > trying to run some tests and come up with numbers so that we have more
> > >> > clear picture of pros/cons.
> > >>
> > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> > >> dm-ioband handles sync/async IO requests separately and
> > >> the write-starve-read issue you pointed out is fixed. I would
> > >> appreciate it if you would try them.
> > >> http://sourceforge.net/projects/ioband/files/
> > >
> > > Cool. Will get to testing it.
> 
> Thanks for your help in advance.

Against what kernel version above patches apply. The biocgroup patches
I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
against any of these?

So for the time being I am doing testing with biocgroup patches.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-06  7:17                                 ` Ryo Tsuruta
@ 2009-10-06 11:22                                   ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-06 11:22 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

On Tue, Oct 06, 2009 at 04:17:44PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and Nauman,
> 
> Nauman Rafique <nauman@google.com> wrote:
> > >> > > How about adding a callback function to the higher level controller?
> > >> > > CFQ calls it when the active queue runs out of time, then the higer
> > >> > > level controller use it as a trigger or a hint to move IO group, so
> > >> > > I think a time-based controller could be implemented at higher level.
> > >> > >
> > >> >
> > >> > Adding a call back should not be a big issue. But that means you are
> > >> > planning to run only one group at higher layer at one time and I think
> > >> > that's the problem because than we are introducing serialization at higher
> > >> > layer. So any higher level device mapper target which has multiple
> > >> > physical disks under it, we might be underutilizing these even more and
> > >> > take a big hit on overall throughput.
> > >> >
> > >> > The whole design of doing proportional weight at lower layer is optimial
> > >> > usage of system.
> > >>
> > >> But I think that the higher level approch makes easy to configure
> > >> against striped software raid devices.
> > >
> > > How does it make easier to configure in case of higher level controller?
> > >
> > > In case of lower level design, one just have to create cgroups and assign
> > > weights to cgroups. This mininum step will be required in higher level
> > > controller also. (Even if you get rid of dm-ioband device setup step).
> 
> In the case of lower level controller, if we need to assign weights on
> a per device basis, we have to assign weights to all devices of which
> a raid device consists, but in the case of higher level controller, 
> we just assign weights to the raid device only.
> 

This is required only if you need to assign different weights to different
devices. This is just additional facility and not a requirement. Normally
you will not be required to do that and devices will inherit the cgroup
weights automatically. So one has to only assign the cgroup weights.

> > >> If one would like to
> > >> combine some physical disks into one logical device like a dm-linear,
> > >> I think one should map the IO controller on each physical device and
> > >> combine them into one logical device.
> > >>
> > >
> > > In fact this sounds like a more complicated step where one has to setup
> > > one dm-ioband device on top of each physical device. But I am assuming
> > > that this will go away once you move to per reuqest queue like implementation.
> 
> I don't understand why the per request queue implementation makes it
> go away. If dm-ioband is integrated into the LVM tools, it could allow
> users to skip the complicated steps to configure dm-linear devices.
> 

Those who are not using dm-tools will be forced to use dm-tools for
bandwidth control features.

> > > I think it should be same in principal as my initial implementation of IO
> > > controller on request queue and I stopped development on it because of FIFO
> > > dispatch.
> 
> I think that FIFO dispatch seldom lead to prioviry inversion, because
> holding period for throttling is not too long to break the IO priority.
> I did some tests to see whether priority inversion is happened.
> 
> The first test ran fio sequential readers on the same group. The BE0
> reader got the highest throughput as I expected.
> 
> nr_threads      16      |      16    |     1
> ionice          BE7     |     BE7    |    BE0
> ------------------------+------------+-------------
> vanilla     10,076KiB/s | 9,779KiB/s | 32,775KiB/s
> ioband       9,576KiB/s | 9,367KiB/s | 34,154KiB/s
> 
> The second test ran fio sequential readers on two different groups and
> give weights of 20 and 10 to each group respectively. The bandwidth
> was distributed according to their weights and the BE0 reader got
> higher throughput than the BE7 readers in the same group. IO priority
> was preserved within the IO group.
> 
> group         group1    |         group2
> weight          20      |           10    
> ------------------------+--------------------------
> nr_threads      16      |      16    |     1
> ionice          BE7     |     BE7    |    BE0
> ------------------------+--------------------------
> ioband      27,513KiB/s | 3,524KiB/s | 10,248KiB/s
>                         |     Total = 13,772KiB/s
> 

Interesting. In all the test cases you always test with sequential
readers. I have changed the test case a bit (I have already reported the
results in another mail, now running the same test again with dm-version
1.14). I made all the readers doing direct IO and in other group I put
a buffered writer. So setup looks as follows.

In group1, I launch 1 prio 0 reader and increasing number of prio4
readers. In group 2 I just run a dd doing buffered writes. Weights of
both the groups are 100 each.

Following are the results on 2.6.31 kernel.

With-dm-ioband
==============
<------------prio4 readers---------------------->  <---prio0 reader------>
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   

With vanilla CFQ
================
<------------prio4 readers---------------------->  <---prio0 reader------>
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  


Above results are showing how bandwidth got distributed between prio4 and
prio1 readers with-in group as we increased number of prio4 readers in
the group. In another group a buffered writer is continuously going on
as competitor.

Notice, with dm-ioband how bandwidth allocation is broken.

With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.

With 2 prio4 readers, looks like prio4 got almost same BW as prio1.

With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
readers starve.

As we incresae number of prio4 readers in the group, their total aggregate
BW share should increase. Instread it is decreasing.

So to me in the face of competition with a writer in other group, BW is
all over the place. Some of these might be dm-ioband bugs and some of
these might be coming from the fact that buffering takes place in higher
layer and dispatch is FIFO?

> Here is my test script.
> -------------------------------------------------------------------------
> arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
>      --group_reporting"
> 
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> echo $$ > /cgroup/1/tasks
> ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> echo $$ > /cgroup/2/tasks
> ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> echo $$ > /cgroup/tasks
> wait
> -------------------------------------------------------------------------
> 
> Be that as it way, I think that if every bio can point the iocontext
> of the process, then it makes it possible to handle IO priority in the
> higher level controller. A patchse has already posted by Takhashi-san.
> What do you think about this idea?
> 
>   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
>   Subject [RFC][PATCH 1/10] I/O context inheritance
>   From Hirokazu Takahashi <>
>   http://lkml.org/lkml/2008/4/22/195

So far you have been denying that there are issues with ioprio with-in
group in higher level controller. Here you seems to be saying that there are
issues with ioprio and we need to take this patch in to solve the issue? I am
confused?

Anyway, if you think that above patch is needed to solve the issue of
ioprio in higher level controller, why are you not posting it as part of
your patch series regularly, so that we can also apply this patch along
with other patches and test the effects?

> 
> > > So you seem to be suggesting that you will move dm-ioband to request queue
> > > so that setting up additional device setup is gone. You will also enable
> > > it to do time based groups policy, so that we don't run into issues on
> > > seeky media. Will also enable dispatch from one group only at a time so
> > > that we don't run into isolation issues and can do time accounting
> > > accruately.
> > 
> > Will that approach solve the problem of doing bandwidth control on
> > logical devices? What would be the advantages compared to Vivek's
> > current patches?
> 
> I will only move the point where dm-ioband grabs bios, other
> dm-ioband's mechanism and functionality will stll be the same.
> The advantages against to scheduler based controllers are:
>  - can work with any type of block devices
>  - can work with any type of IO scheduler and no need a big change.
> 

The big change thing we will come to know for sure when we have
implementation for the timed groups done and shown that it works as well as my
patches. There are so many subtle things with time based approach. 

[..]
> > >> > Is there a new version of dm-ioband now where you have solved the issue of
> > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > >> > trying to run some tests and come up with numbers so that we have more
> > >> > clear picture of pros/cons.
> > >>
> > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> > >> dm-ioband handles sync/async IO requests separately and
> > >> the write-starve-read issue you pointed out is fixed. I would
> > >> appreciate it if you would try them.
> > >> http://sourceforge.net/projects/ioband/files/
> > >
> > > Cool. Will get to testing it.
> 
> Thanks for your help in advance.

Against what kernel version above patches apply. The biocgroup patches
I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
against any of these?

So for the time being I am doing testing with biocgroup patches.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-06 11:22                                   ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-06 11:22 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

On Tue, Oct 06, 2009 at 04:17:44PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and Nauman,
> 
> Nauman Rafique <nauman@google.com> wrote:
> > >> > > How about adding a callback function to the higher level controller?
> > >> > > CFQ calls it when the active queue runs out of time, then the higer
> > >> > > level controller use it as a trigger or a hint to move IO group, so
> > >> > > I think a time-based controller could be implemented at higher level.
> > >> > >
> > >> >
> > >> > Adding a call back should not be a big issue. But that means you are
> > >> > planning to run only one group at higher layer at one time and I think
> > >> > that's the problem because than we are introducing serialization at higher
> > >> > layer. So any higher level device mapper target which has multiple
> > >> > physical disks under it, we might be underutilizing these even more and
> > >> > take a big hit on overall throughput.
> > >> >
> > >> > The whole design of doing proportional weight at lower layer is optimial
> > >> > usage of system.
> > >>
> > >> But I think that the higher level approch makes easy to configure
> > >> against striped software raid devices.
> > >
> > > How does it make easier to configure in case of higher level controller?
> > >
> > > In case of lower level design, one just have to create cgroups and assign
> > > weights to cgroups. This mininum step will be required in higher level
> > > controller also. (Even if you get rid of dm-ioband device setup step).
> 
> In the case of lower level controller, if we need to assign weights on
> a per device basis, we have to assign weights to all devices of which
> a raid device consists, but in the case of higher level controller, 
> we just assign weights to the raid device only.
> 

This is required only if you need to assign different weights to different
devices. This is just additional facility and not a requirement. Normally
you will not be required to do that and devices will inherit the cgroup
weights automatically. So one has to only assign the cgroup weights.

> > >> If one would like to
> > >> combine some physical disks into one logical device like a dm-linear,
> > >> I think one should map the IO controller on each physical device and
> > >> combine them into one logical device.
> > >>
> > >
> > > In fact this sounds like a more complicated step where one has to setup
> > > one dm-ioband device on top of each physical device. But I am assuming
> > > that this will go away once you move to per reuqest queue like implementation.
> 
> I don't understand why the per request queue implementation makes it
> go away. If dm-ioband is integrated into the LVM tools, it could allow
> users to skip the complicated steps to configure dm-linear devices.
> 

Those who are not using dm-tools will be forced to use dm-tools for
bandwidth control features.

> > > I think it should be same in principal as my initial implementation of IO
> > > controller on request queue and I stopped development on it because of FIFO
> > > dispatch.
> 
> I think that FIFO dispatch seldom lead to prioviry inversion, because
> holding period for throttling is not too long to break the IO priority.
> I did some tests to see whether priority inversion is happened.
> 
> The first test ran fio sequential readers on the same group. The BE0
> reader got the highest throughput as I expected.
> 
> nr_threads      16      |      16    |     1
> ionice          BE7     |     BE7    |    BE0
> ------------------------+------------+-------------
> vanilla     10,076KiB/s | 9,779KiB/s | 32,775KiB/s
> ioband       9,576KiB/s | 9,367KiB/s | 34,154KiB/s
> 
> The second test ran fio sequential readers on two different groups and
> give weights of 20 and 10 to each group respectively. The bandwidth
> was distributed according to their weights and the BE0 reader got
> higher throughput than the BE7 readers in the same group. IO priority
> was preserved within the IO group.
> 
> group         group1    |         group2
> weight          20      |           10    
> ------------------------+--------------------------
> nr_threads      16      |      16    |     1
> ionice          BE7     |     BE7    |    BE0
> ------------------------+--------------------------
> ioband      27,513KiB/s | 3,524KiB/s | 10,248KiB/s
>                         |     Total = 13,772KiB/s
> 

Interesting. In all the test cases you always test with sequential
readers. I have changed the test case a bit (I have already reported the
results in another mail, now running the same test again with dm-version
1.14). I made all the readers doing direct IO and in other group I put
a buffered writer. So setup looks as follows.

In group1, I launch 1 prio 0 reader and increasing number of prio4
readers. In group 2 I just run a dd doing buffered writes. Weights of
both the groups are 100 each.

Following are the results on 2.6.31 kernel.

With-dm-ioband
==============
<------------prio4 readers---------------------->  <---prio0 reader------>
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   9992KiB/s   9992KiB/s   9992KiB/s   413K usec   4621KiB/s   369K usec   
2   4859KiB/s   4265KiB/s   9122KiB/s   344K usec   4915KiB/s   401K usec   
4   2238KiB/s   1381KiB/s   7703KiB/s   532K usec   3195KiB/s   546K usec   
8   504KiB/s    46KiB/s     1439KiB/s   399K usec   7661KiB/s   220K usec   
16  131KiB/s    26KiB/s     638KiB/s    492K usec   4847KiB/s   359K usec   

With vanilla CFQ
================
<------------prio4 readers---------------------->  <---prio0 reader------>
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   10779KiB/s  10779KiB/s  10779KiB/s  407K usec   16094KiB/s  808K usec   
2   7045KiB/s   6913KiB/s   13959KiB/s  538K usec   18794KiB/s  761K usec   
4   7842KiB/s   4409KiB/s   20967KiB/s  876K usec   12543KiB/s  443K usec   
8   6198KiB/s   2426KiB/s   24219KiB/s  1469K usec  9483KiB/s   685K usec   
16  5041KiB/s   1358KiB/s   27022KiB/s  2417K usec  6211KiB/s   1025K usec  


Above results are showing how bandwidth got distributed between prio4 and
prio1 readers with-in group as we increased number of prio4 readers in
the group. In another group a buffered writer is continuously going on
as competitor.

Notice, with dm-ioband how bandwidth allocation is broken.

With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader.

With 2 prio4 readers, looks like prio4 got almost same BW as prio1.

With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4
readers starve.

As we incresae number of prio4 readers in the group, their total aggregate
BW share should increase. Instread it is decreasing.

So to me in the face of competition with a writer in other group, BW is
all over the place. Some of these might be dm-ioband bugs and some of
these might be coming from the fact that buffering takes place in higher
layer and dispatch is FIFO?

> Here is my test script.
> -------------------------------------------------------------------------
> arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
>      --group_reporting"
> 
> sync
> echo 3 > /proc/sys/vm/drop_caches
> 
> echo $$ > /cgroup/1/tasks
> ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
> echo $$ > /cgroup/2/tasks
> ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
> ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
> echo $$ > /cgroup/tasks
> wait
> -------------------------------------------------------------------------
> 
> Be that as it way, I think that if every bio can point the iocontext
> of the process, then it makes it possible to handle IO priority in the
> higher level controller. A patchse has already posted by Takhashi-san.
> What do you think about this idea?
> 
>   Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
>   Subject [RFC][PATCH 1/10] I/O context inheritance
>   From Hirokazu Takahashi <>
>   http://lkml.org/lkml/2008/4/22/195

So far you have been denying that there are issues with ioprio with-in
group in higher level controller. Here you seems to be saying that there are
issues with ioprio and we need to take this patch in to solve the issue? I am
confused?

Anyway, if you think that above patch is needed to solve the issue of
ioprio in higher level controller, why are you not posting it as part of
your patch series regularly, so that we can also apply this patch along
with other patches and test the effects?

> 
> > > So you seem to be suggesting that you will move dm-ioband to request queue
> > > so that setting up additional device setup is gone. You will also enable
> > > it to do time based groups policy, so that we don't run into issues on
> > > seeky media. Will also enable dispatch from one group only at a time so
> > > that we don't run into isolation issues and can do time accounting
> > > accruately.
> > 
> > Will that approach solve the problem of doing bandwidth control on
> > logical devices? What would be the advantages compared to Vivek's
> > current patches?
> 
> I will only move the point where dm-ioband grabs bios, other
> dm-ioband's mechanism and functionality will stll be the same.
> The advantages against to scheduler based controllers are:
>  - can work with any type of block devices
>  - can work with any type of IO scheduler and no need a big change.
> 

The big change thing we will come to know for sure when we have
implementation for the timed groups done and shown that it works as well as my
patches. There are so many subtle things with time based approach. 

[..]
> > >> > Is there a new version of dm-ioband now where you have solved the issue of
> > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > >> > trying to run some tests and come up with numbers so that we have more
> > >> > clear picture of pros/cons.
> > >>
> > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> > >> dm-ioband handles sync/async IO requests separately and
> > >> the write-starve-read issue you pointed out is fixed. I would
> > >> appreciate it if you would try them.
> > >> http://sourceforge.net/projects/ioband/files/
> > >
> > > Cool. Will get to testing it.
> 
> Thanks for your help in advance.

Against what kernel version above patches apply. The biocgroup patches
I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly
against any of these?

So for the time being I am doing testing with biocgroup patches.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                               ` <e98e18940910051111r110dc776l5105bf931761b842-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-06  7:17                                 ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-06  7:17 UTC (permalink / raw)
  To: nauman-hpIqsD4AKlfQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek and Nauman,

Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote:
> >> > > How about adding a callback function to the higher level controller?
> >> > > CFQ calls it when the active queue runs out of time, then the higer
> >> > > level controller use it as a trigger or a hint to move IO group, so
> >> > > I think a time-based controller could be implemented at higher level.
> >> > >
> >> >
> >> > Adding a call back should not be a big issue. But that means you are
> >> > planning to run only one group at higher layer at one time and I think
> >> > that's the problem because than we are introducing serialization at higher
> >> > layer. So any higher level device mapper target which has multiple
> >> > physical disks under it, we might be underutilizing these even more and
> >> > take a big hit on overall throughput.
> >> >
> >> > The whole design of doing proportional weight at lower layer is optimial
> >> > usage of system.
> >>
> >> But I think that the higher level approch makes easy to configure
> >> against striped software raid devices.
> >
> > How does it make easier to configure in case of higher level controller?
> >
> > In case of lower level design, one just have to create cgroups and assign
> > weights to cgroups. This mininum step will be required in higher level
> > controller also. (Even if you get rid of dm-ioband device setup step).

In the case of lower level controller, if we need to assign weights on
a per device basis, we have to assign weights to all devices of which
a raid device consists, but in the case of higher level controller, 
we just assign weights to the raid device only.

> >> If one would like to
> >> combine some physical disks into one logical device like a dm-linear,
> >> I think one should map the IO controller on each physical device and
> >> combine them into one logical device.
> >>
> >
> > In fact this sounds like a more complicated step where one has to setup
> > one dm-ioband device on top of each physical device. But I am assuming
> > that this will go away once you move to per reuqest queue like implementation.

I don't understand why the per request queue implementation makes it
go away. If dm-ioband is integrated into the LVM tools, it could allow
users to skip the complicated steps to configure dm-linear devices.

> > I think it should be same in principal as my initial implementation of IO
> > controller on request queue and I stopped development on it because of FIFO
> > dispatch.

I think that FIFO dispatch seldom lead to prioviry inversion, because
holding period for throttling is not too long to break the IO priority.
I did some tests to see whether priority inversion is happened.

The first test ran fio sequential readers on the same group. The BE0
reader got the highest throughput as I expected.

nr_threads      16      |      16    |     1
ionice          BE7     |     BE7    |    BE0
------------------------+------------+-------------
vanilla     10,076KiB/s | 9,779KiB/s | 32,775KiB/s
ioband       9,576KiB/s | 9,367KiB/s | 34,154KiB/s

The second test ran fio sequential readers on two different groups and
give weights of 20 and 10 to each group respectively. The bandwidth
was distributed according to their weights and the BE0 reader got
higher throughput than the BE7 readers in the same group. IO priority
was preserved within the IO group.

group         group1    |         group2
weight          20      |           10    
------------------------+--------------------------
nr_threads      16      |      16    |     1
ionice          BE7     |     BE7    |    BE0
------------------------+--------------------------
ioband      27,513KiB/s | 3,524KiB/s | 10,248KiB/s
                        |     Total = 13,772KiB/s

Here is my test script.
-------------------------------------------------------------------------
arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
     --group_reporting"

sync
echo 3 > /proc/sys/vm/drop_caches

echo $$ > /cgroup/1/tasks
ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
echo $$ > /cgroup/2/tasks
ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
echo $$ > /cgroup/tasks
wait
-------------------------------------------------------------------------

Be that as it way, I think that if every bio can point the iocontext
of the process, then it makes it possible to handle IO priority in the
higher level controller. A patchse has already posted by Takhashi-san.
What do you think about this idea?

  Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
  Subject [RFC][PATCH 1/10] I/O context inheritance
  From Hirokazu Takahashi <>
  http://lkml.org/lkml/2008/4/22/195

> > So you seem to be suggesting that you will move dm-ioband to request queue
> > so that setting up additional device setup is gone. You will also enable
> > it to do time based groups policy, so that we don't run into issues on
> > seeky media. Will also enable dispatch from one group only at a time so
> > that we don't run into isolation issues and can do time accounting
> > accruately.
> 
> Will that approach solve the problem of doing bandwidth control on
> logical devices? What would be the advantages compared to Vivek's
> current patches?

I will only move the point where dm-ioband grabs bios, other
dm-ioband's mechanism and functionality will stll be the same.
The advantages against to scheduler based controllers are:
 - can work with any type of block devices
 - can work with any type of IO scheduler and no need a big change.

> > If yes, then that has the potential to solve the issue. At higher layer one
> > can think of enabling size of IO/number of IO policy both for proportional
> > BW and max BW type of control. At lower level one can enable pure time
> > based control on seeky media.
> >
> > I think this will still left with the issue of prio with-in group as group
> > control is separate and you will not be maintatinig separate queues for
> > each process. Similarly you will also have isseus with read vs write
> > ratios as IO schedulers underneath change.
> >
> > So I will be curious to see that implementation.
> >
> >> > > My requirements for IO controller are:
> >> > > - Implement s a higher level controller, which is located at block
> >> > >   layer and bio is grabbed in generic_make_request().
> >> >
> >> > How are you planning to handle the issue of buffered writes Andrew raised?
> >>
> >> I think that it would be better to use the higher-level controller
> >> along with the memory controller and have limits memory usage for each
> >> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> >> be better, too.
> >>
> >
> > Ok. So if we plan to co-mount memory controller with per memory group
> > dirty_ratio implemented, that can work with both higher level as well as
> > low level controller. Not sure if we also require some kind of a per
> > memory group flusher thread infrastructure also to make sure higher weight
> > group gets more job done.

I'm not sure either that a per memory group flusher is necessary.
An we have to consider not only pdflush but also other threads which 
issue IOs from multiple groups.

> >> > > - Can work with any type of IO scheduler.
> >> > > - Can work with any type of block devices.
> >> > > - Support multiple policies, proportional wegiht, max rate, time
> >> > >   based, ans so on.
> >> > >
> >> > > The IO controller mini-summit will be held in next week, and I'm
> >> > > looking forard to meet you all and discuss about IO controller.
> >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> >> >
> >> > Is there a new version of dm-ioband now where you have solved the issue of
> >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> >> > trying to run some tests and come up with numbers so that we have more
> >> > clear picture of pros/cons.
> >>
> >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> >> dm-ioband handles sync/async IO requests separately and
> >> the write-starve-read issue you pointed out is fixed. I would
> >> appreciate it if you would try them.
> >> http://sourceforge.net/projects/ioband/files/
> >
> > Cool. Will get to testing it.

Thanks for your help in advance.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-05 18:11                               ` Nauman Rafique
@ 2009-10-06  7:17                                 ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-06  7:17 UTC (permalink / raw)
  To: nauman
  Cc: vgoyal, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

Hi Vivek and Nauman,

Nauman Rafique <nauman@google.com> wrote:
> >> > > How about adding a callback function to the higher level controller?
> >> > > CFQ calls it when the active queue runs out of time, then the higer
> >> > > level controller use it as a trigger or a hint to move IO group, so
> >> > > I think a time-based controller could be implemented at higher level.
> >> > >
> >> >
> >> > Adding a call back should not be a big issue. But that means you are
> >> > planning to run only one group at higher layer at one time and I think
> >> > that's the problem because than we are introducing serialization at higher
> >> > layer. So any higher level device mapper target which has multiple
> >> > physical disks under it, we might be underutilizing these even more and
> >> > take a big hit on overall throughput.
> >> >
> >> > The whole design of doing proportional weight at lower layer is optimial
> >> > usage of system.
> >>
> >> But I think that the higher level approch makes easy to configure
> >> against striped software raid devices.
> >
> > How does it make easier to configure in case of higher level controller?
> >
> > In case of lower level design, one just have to create cgroups and assign
> > weights to cgroups. This mininum step will be required in higher level
> > controller also. (Even if you get rid of dm-ioband device setup step).

In the case of lower level controller, if we need to assign weights on
a per device basis, we have to assign weights to all devices of which
a raid device consists, but in the case of higher level controller, 
we just assign weights to the raid device only.

> >> If one would like to
> >> combine some physical disks into one logical device like a dm-linear,
> >> I think one should map the IO controller on each physical device and
> >> combine them into one logical device.
> >>
> >
> > In fact this sounds like a more complicated step where one has to setup
> > one dm-ioband device on top of each physical device. But I am assuming
> > that this will go away once you move to per reuqest queue like implementation.

I don't understand why the per request queue implementation makes it
go away. If dm-ioband is integrated into the LVM tools, it could allow
users to skip the complicated steps to configure dm-linear devices.

> > I think it should be same in principal as my initial implementation of IO
> > controller on request queue and I stopped development on it because of FIFO
> > dispatch.

I think that FIFO dispatch seldom lead to prioviry inversion, because
holding period for throttling is not too long to break the IO priority.
I did some tests to see whether priority inversion is happened.

The first test ran fio sequential readers on the same group. The BE0
reader got the highest throughput as I expected.

nr_threads      16      |      16    |     1
ionice          BE7     |     BE7    |    BE0
------------------------+------------+-------------
vanilla     10,076KiB/s | 9,779KiB/s | 32,775KiB/s
ioband       9,576KiB/s | 9,367KiB/s | 34,154KiB/s

The second test ran fio sequential readers on two different groups and
give weights of 20 and 10 to each group respectively. The bandwidth
was distributed according to their weights and the BE0 reader got
higher throughput than the BE7 readers in the same group. IO priority
was preserved within the IO group.

group         group1    |         group2
weight          20      |           10    
------------------------+--------------------------
nr_threads      16      |      16    |     1
ionice          BE7     |     BE7    |    BE0
------------------------+--------------------------
ioband      27,513KiB/s | 3,524KiB/s | 10,248KiB/s
                        |     Total = 13,772KiB/s

Here is my test script.
-------------------------------------------------------------------------
arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
     --group_reporting"

sync
echo 3 > /proc/sys/vm/drop_caches

echo $$ > /cgroup/1/tasks
ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
echo $$ > /cgroup/2/tasks
ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
echo $$ > /cgroup/tasks
wait
-------------------------------------------------------------------------

Be that as it way, I think that if every bio can point the iocontext
of the process, then it makes it possible to handle IO priority in the
higher level controller. A patchse has already posted by Takhashi-san.
What do you think about this idea?

  Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
  Subject [RFC][PATCH 1/10] I/O context inheritance
  From Hirokazu Takahashi <>
  http://lkml.org/lkml/2008/4/22/195

> > So you seem to be suggesting that you will move dm-ioband to request queue
> > so that setting up additional device setup is gone. You will also enable
> > it to do time based groups policy, so that we don't run into issues on
> > seeky media. Will also enable dispatch from one group only at a time so
> > that we don't run into isolation issues and can do time accounting
> > accruately.
> 
> Will that approach solve the problem of doing bandwidth control on
> logical devices? What would be the advantages compared to Vivek's
> current patches?

I will only move the point where dm-ioband grabs bios, other
dm-ioband's mechanism and functionality will stll be the same.
The advantages against to scheduler based controllers are:
 - can work with any type of block devices
 - can work with any type of IO scheduler and no need a big change.

> > If yes, then that has the potential to solve the issue. At higher layer one
> > can think of enabling size of IO/number of IO policy both for proportional
> > BW and max BW type of control. At lower level one can enable pure time
> > based control on seeky media.
> >
> > I think this will still left with the issue of prio with-in group as group
> > control is separate and you will not be maintatinig separate queues for
> > each process. Similarly you will also have isseus with read vs write
> > ratios as IO schedulers underneath change.
> >
> > So I will be curious to see that implementation.
> >
> >> > > My requirements for IO controller are:
> >> > > - Implement s a higher level controller, which is located at block
> >> > >   layer and bio is grabbed in generic_make_request().
> >> >
> >> > How are you planning to handle the issue of buffered writes Andrew raised?
> >>
> >> I think that it would be better to use the higher-level controller
> >> along with the memory controller and have limits memory usage for each
> >> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> >> be better, too.
> >>
> >
> > Ok. So if we plan to co-mount memory controller with per memory group
> > dirty_ratio implemented, that can work with both higher level as well as
> > low level controller. Not sure if we also require some kind of a per
> > memory group flusher thread infrastructure also to make sure higher weight
> > group gets more job done.

I'm not sure either that a per memory group flusher is necessary.
An we have to consider not only pdflush but also other threads which 
issue IOs from multiple groups.

> >> > > - Can work with any type of IO scheduler.
> >> > > - Can work with any type of block devices.
> >> > > - Support multiple policies, proportional wegiht, max rate, time
> >> > >   based, ans so on.
> >> > >
> >> > > The IO controller mini-summit will be held in next week, and I'm
> >> > > looking forard to meet you all and discuss about IO controller.
> >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> >> >
> >> > Is there a new version of dm-ioband now where you have solved the issue of
> >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> >> > trying to run some tests and come up with numbers so that we have more
> >> > clear picture of pros/cons.
> >>
> >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> >> dm-ioband handles sync/async IO requests separately and
> >> the write-starve-read issue you pointed out is fixed. I would
> >> appreciate it if you would try them.
> >> http://sourceforge.net/projects/ioband/files/
> >
> > Cool. Will get to testing it.

Thanks for your help in advance.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-06  7:17                                 ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-06  7:17 UTC (permalink / raw)
  To: nauman
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, mingo, vgoyal, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

Hi Vivek and Nauman,

Nauman Rafique <nauman@google.com> wrote:
> >> > > How about adding a callback function to the higher level controller?
> >> > > CFQ calls it when the active queue runs out of time, then the higer
> >> > > level controller use it as a trigger or a hint to move IO group, so
> >> > > I think a time-based controller could be implemented at higher level.
> >> > >
> >> >
> >> > Adding a call back should not be a big issue. But that means you are
> >> > planning to run only one group at higher layer at one time and I think
> >> > that's the problem because than we are introducing serialization at higher
> >> > layer. So any higher level device mapper target which has multiple
> >> > physical disks under it, we might be underutilizing these even more and
> >> > take a big hit on overall throughput.
> >> >
> >> > The whole design of doing proportional weight at lower layer is optimial
> >> > usage of system.
> >>
> >> But I think that the higher level approch makes easy to configure
> >> against striped software raid devices.
> >
> > How does it make easier to configure in case of higher level controller?
> >
> > In case of lower level design, one just have to create cgroups and assign
> > weights to cgroups. This mininum step will be required in higher level
> > controller also. (Even if you get rid of dm-ioband device setup step).

In the case of lower level controller, if we need to assign weights on
a per device basis, we have to assign weights to all devices of which
a raid device consists, but in the case of higher level controller, 
we just assign weights to the raid device only.

> >> If one would like to
> >> combine some physical disks into one logical device like a dm-linear,
> >> I think one should map the IO controller on each physical device and
> >> combine them into one logical device.
> >>
> >
> > In fact this sounds like a more complicated step where one has to setup
> > one dm-ioband device on top of each physical device. But I am assuming
> > that this will go away once you move to per reuqest queue like implementation.

I don't understand why the per request queue implementation makes it
go away. If dm-ioband is integrated into the LVM tools, it could allow
users to skip the complicated steps to configure dm-linear devices.

> > I think it should be same in principal as my initial implementation of IO
> > controller on request queue and I stopped development on it because of FIFO
> > dispatch.

I think that FIFO dispatch seldom lead to prioviry inversion, because
holding period for throttling is not too long to break the IO priority.
I did some tests to see whether priority inversion is happened.

The first test ran fio sequential readers on the same group. The BE0
reader got the highest throughput as I expected.

nr_threads      16      |      16    |     1
ionice          BE7     |     BE7    |    BE0
------------------------+------------+-------------
vanilla     10,076KiB/s | 9,779KiB/s | 32,775KiB/s
ioband       9,576KiB/s | 9,367KiB/s | 34,154KiB/s

The second test ran fio sequential readers on two different groups and
give weights of 20 and 10 to each group respectively. The bandwidth
was distributed according to their weights and the BE0 reader got
higher throughput than the BE7 readers in the same group. IO priority
was preserved within the IO group.

group         group1    |         group2
weight          20      |           10    
------------------------+--------------------------
nr_threads      16      |      16    |     1
ionice          BE7     |     BE7    |    BE0
------------------------+--------------------------
ioband      27,513KiB/s | 3,524KiB/s | 10,248KiB/s
                        |     Total = 13,772KiB/s

Here is my test script.
-------------------------------------------------------------------------
arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \
     --group_reporting"

sync
echo 3 > /proc/sys/vm/drop_caches

echo $$ > /cgroup/1/tasks
ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 &
echo $$ > /cgroup/2/tasks
ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 &
ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 &
echo $$ > /cgroup/tasks
wait
-------------------------------------------------------------------------

Be that as it way, I think that if every bio can point the iocontext
of the process, then it makes it possible to handle IO priority in the
higher level controller. A patchse has already posted by Takhashi-san.
What do you think about this idea?

  Date Tue, 22 Apr 2008 22:51:31 +0900 (JST)
  Subject [RFC][PATCH 1/10] I/O context inheritance
  From Hirokazu Takahashi <>
  http://lkml.org/lkml/2008/4/22/195

> > So you seem to be suggesting that you will move dm-ioband to request queue
> > so that setting up additional device setup is gone. You will also enable
> > it to do time based groups policy, so that we don't run into issues on
> > seeky media. Will also enable dispatch from one group only at a time so
> > that we don't run into isolation issues and can do time accounting
> > accruately.
> 
> Will that approach solve the problem of doing bandwidth control on
> logical devices? What would be the advantages compared to Vivek's
> current patches?

I will only move the point where dm-ioband grabs bios, other
dm-ioband's mechanism and functionality will stll be the same.
The advantages against to scheduler based controllers are:
 - can work with any type of block devices
 - can work with any type of IO scheduler and no need a big change.

> > If yes, then that has the potential to solve the issue. At higher layer one
> > can think of enabling size of IO/number of IO policy both for proportional
> > BW and max BW type of control. At lower level one can enable pure time
> > based control on seeky media.
> >
> > I think this will still left with the issue of prio with-in group as group
> > control is separate and you will not be maintatinig separate queues for
> > each process. Similarly you will also have isseus with read vs write
> > ratios as IO schedulers underneath change.
> >
> > So I will be curious to see that implementation.
> >
> >> > > My requirements for IO controller are:
> >> > > - Implement s a higher level controller, which is located at block
> >> > >   layer and bio is grabbed in generic_make_request().
> >> >
> >> > How are you planning to handle the issue of buffered writes Andrew raised?
> >>
> >> I think that it would be better to use the higher-level controller
> >> along with the memory controller and have limits memory usage for each
> >> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> >> be better, too.
> >>
> >
> > Ok. So if we plan to co-mount memory controller with per memory group
> > dirty_ratio implemented, that can work with both higher level as well as
> > low level controller. Not sure if we also require some kind of a per
> > memory group flusher thread infrastructure also to make sure higher weight
> > group gets more job done.

I'm not sure either that a per memory group flusher is necessary.
An we have to consider not only pdflush but also other threads which 
issue IOs from multiple groups.

> >> > > - Can work with any type of IO scheduler.
> >> > > - Can work with any type of block devices.
> >> > > - Support multiple policies, proportional wegiht, max rate, time
> >> > >   based, ans so on.
> >> > >
> >> > > The IO controller mini-summit will be held in next week, and I'm
> >> > > looking forard to meet you all and discuss about IO controller.
> >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> >> >
> >> > Is there a new version of dm-ioband now where you have solved the issue of
> >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> >> > trying to run some tests and come up with numbers so that we have more
> >> > clear picture of pros/cons.
> >>
> >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> >> dm-ioband handles sync/async IO requests separately and
> >> the write-starve-read issue you pointed out is fixed. I would
> >> appreciate it if you would try them.
> >> http://sourceforge.net/projects/ioband/files/
> >
> > Cool. Will get to testing it.

Thanks for your help in advance.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                             ` <20091005171023.GG22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-05 18:11                               ` Nauman Rafique
  0 siblings, 0 replies; 349+ messages in thread
From: Nauman Rafique @ 2009-10-05 18:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Oct 5, 2009 at 10:10 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
>> Hi Vivek,
>>
>> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
>> > > Hi,
>> > >
>> > > Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote:
>> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
>> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
>> > > > > going through the request based dm-multipath paper. Will it make sense
>> > > > > to implement request based dm-ioband? So basically we implement all the
>> > > > > group scheduling in CFQ and let dm-ioband implement a request function
>> > > > > to take the request and break it back into bios. This way we can keep
>> > > > > all the group control at one place and also meet most of the requirements.
>> > > > >
>> > > > > So request based dm-ioband will have a request in hand once that request
>> > > > > has passed group control and prio control. Because dm-ioband is a device
>> > > > > mapper target, one can put it on higher level devices (practically taking
>> > > > > CFQ at higher level device), and provide fairness there. One can also
>> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
>> > > > > them to use the IO scheduler.)
>> > > > >
>> > > > > I am sure that will be many issues but one big issue I could think of that
>> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
>> > > > > from one queue (in case of idling) and that would kill parallelism at
>> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
>> > > > >
>> > > > > Thanks
>> > > > > Vivek
>> > > >
>> > > > As long as using CFQ, your idea is reasonable for me.  But how about for
>> > > > other IO schedulers?  In my understanding, one of the keys to guarantee
>> > > > group isolation in your patch is to have per-group IO scheduler internal
>> > > > queue even with as, deadline, and noop scheduler.  I think this is
>> > > > great idea, and to implement generic code for all IO schedulers was
>> > > > concluded when we had so many IO scheduler specific proposals.
>> > > > If we will still need per-group IO scheduler internal queues with
>> > > > request-based dm-ioband, we have to modify elevator layer.  It seems
>> > > > out of scope of dm.
>> > > > I might miss something...
>> > >
>> > > IIUC, the request based device-mapper could not break back a request
>> > > into bio, so it could not work with block devices which don't use the
>> > > IO scheduler.
>> > >
>> >
>> > I think current request based multipath drvier does not do it but can't it
>> > be implemented that requests are broken back into bio?
>>
>> I guess it would be hard to implement it, and we need to hold requests
>> and throttle them at there and it would break the ordering by CFQ.
>>
>> > Anyway, I don't feel too strongly about this approach as it might
>> > introduce more serialization at higher layer.
>>
>> Yes, I know it.
>>
>> > > How about adding a callback function to the higher level controller?
>> > > CFQ calls it when the active queue runs out of time, then the higer
>> > > level controller use it as a trigger or a hint to move IO group, so
>> > > I think a time-based controller could be implemented at higher level.
>> > >
>> >
>> > Adding a call back should not be a big issue. But that means you are
>> > planning to run only one group at higher layer at one time and I think
>> > that's the problem because than we are introducing serialization at higher
>> > layer. So any higher level device mapper target which has multiple
>> > physical disks under it, we might be underutilizing these even more and
>> > take a big hit on overall throughput.
>> >
>> > The whole design of doing proportional weight at lower layer is optimial
>> > usage of system.
>>
>> But I think that the higher level approch makes easy to configure
>> against striped software raid devices.
>
> How does it make easier to configure in case of higher level controller?
>
> In case of lower level design, one just have to create cgroups and assign
> weights to cgroups. This mininum step will be required in higher level
> controller also. (Even if you get rid of dm-ioband device setup step).
>
>> If one would like to
>> combine some physical disks into one logical device like a dm-linear,
>> I think one should map the IO controller on each physical device and
>> combine them into one logical device.
>>
>
> In fact this sounds like a more complicated step where one has to setup
> one dm-ioband device on top of each physical device. But I am assuming
> that this will go away once you move to per reuqest queue like implementation.
>
> I think it should be same in principal as my initial implementation of IO
> controller on request queue and I stopped development on it because of FIFO
> dispatch.
>
> So you seem to be suggesting that you will move dm-ioband to request queue
> so that setting up additional device setup is gone. You will also enable
> it to do time based groups policy, so that we don't run into issues on
> seeky media. Will also enable dispatch from one group only at a time so
> that we don't run into isolation issues and can do time accounting
> accruately.

Will that approach solve the problem of doing bandwidth control on
logical devices? What would be the advantages compared to Vivek's
current patches?

>
> If yes, then that has the potential to solve the issue. At higher layer one
> can think of enabling size of IO/number of IO policy both for proportional
> BW and max BW type of control. At lower level one can enable pure time
> based control on seeky media.
>
> I think this will still left with the issue of prio with-in group as group
> control is separate and you will not be maintatinig separate queues for
> each process. Similarly you will also have isseus with read vs write
> ratios as IO schedulers underneath change.
>
> So I will be curious to see that implementation.
>
>> > > My requirements for IO controller are:
>> > > - Implement s a higher level controller, which is located at block
>> > >   layer and bio is grabbed in generic_make_request().
>> >
>> > How are you planning to handle the issue of buffered writes Andrew raised?
>>
>> I think that it would be better to use the higher-level controller
>> along with the memory controller and have limits memory usage for each
>> cgroup. And as Kamezawa-san said, having limits of dirty pages would
>> be better, too.
>>
>
> Ok. So if we plan to co-mount memory controller with per memory group
> dirty_ratio implemented, that can work with both higher level as well as
> low level controller. Not sure if we also require some kind of a per
> memory group flusher thread infrastructure also to make sure higher weight
> group gets more job done.
>
>> > > - Can work with any type of IO scheduler.
>> > > - Can work with any type of block devices.
>> > > - Support multiple policies, proportional wegiht, max rate, time
>> > >   based, ans so on.
>> > >
>> > > The IO controller mini-summit will be held in next week, and I'm
>> > > looking forard to meet you all and discuss about IO controller.
>> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
>> >
>> > Is there a new version of dm-ioband now where you have solved the issue of
>> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
>> > trying to run some tests and come up with numbers so that we have more
>> > clear picture of pros/cons.
>>
>> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
>> dm-ioband handles sync/async IO requests separately and
>> the write-starve-read issue you pointed out is fixed. I would
>> appreciate it if you would try them.
>> http://sourceforge.net/projects/ioband/files/
>
> Cool. Will get to testing it.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-05 17:10                             ` Vivek Goyal
@ 2009-10-05 18:11                               ` Nauman Rafique
  -1 siblings, 0 replies; 349+ messages in thread
From: Nauman Rafique @ 2009-10-05 18:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ryo Tsuruta, m-ikeda, linux-kernel, jens.axboe, containers,
	dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente,
	fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

On Mon, Oct 5, 2009 at 10:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
>> Hi Vivek,
>>
>> Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
>> > > Hi,
>> > >
>> > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
>> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
>> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
>> > > > > going through the request based dm-multipath paper. Will it make sense
>> > > > > to implement request based dm-ioband? So basically we implement all the
>> > > > > group scheduling in CFQ and let dm-ioband implement a request function
>> > > > > to take the request and break it back into bios. This way we can keep
>> > > > > all the group control at one place and also meet most of the requirements.
>> > > > >
>> > > > > So request based dm-ioband will have a request in hand once that request
>> > > > > has passed group control and prio control. Because dm-ioband is a device
>> > > > > mapper target, one can put it on higher level devices (practically taking
>> > > > > CFQ at higher level device), and provide fairness there. One can also
>> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
>> > > > > them to use the IO scheduler.)
>> > > > >
>> > > > > I am sure that will be many issues but one big issue I could think of that
>> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
>> > > > > from one queue (in case of idling) and that would kill parallelism at
>> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
>> > > > >
>> > > > > Thanks
>> > > > > Vivek
>> > > >
>> > > > As long as using CFQ, your idea is reasonable for me.  But how about for
>> > > > other IO schedulers?  In my understanding, one of the keys to guarantee
>> > > > group isolation in your patch is to have per-group IO scheduler internal
>> > > > queue even with as, deadline, and noop scheduler.  I think this is
>> > > > great idea, and to implement generic code for all IO schedulers was
>> > > > concluded when we had so many IO scheduler specific proposals.
>> > > > If we will still need per-group IO scheduler internal queues with
>> > > > request-based dm-ioband, we have to modify elevator layer.  It seems
>> > > > out of scope of dm.
>> > > > I might miss something...
>> > >
>> > > IIUC, the request based device-mapper could not break back a request
>> > > into bio, so it could not work with block devices which don't use the
>> > > IO scheduler.
>> > >
>> >
>> > I think current request based multipath drvier does not do it but can't it
>> > be implemented that requests are broken back into bio?
>>
>> I guess it would be hard to implement it, and we need to hold requests
>> and throttle them at there and it would break the ordering by CFQ.
>>
>> > Anyway, I don't feel too strongly about this approach as it might
>> > introduce more serialization at higher layer.
>>
>> Yes, I know it.
>>
>> > > How about adding a callback function to the higher level controller?
>> > > CFQ calls it when the active queue runs out of time, then the higer
>> > > level controller use it as a trigger or a hint to move IO group, so
>> > > I think a time-based controller could be implemented at higher level.
>> > >
>> >
>> > Adding a call back should not be a big issue. But that means you are
>> > planning to run only one group at higher layer at one time and I think
>> > that's the problem because than we are introducing serialization at higher
>> > layer. So any higher level device mapper target which has multiple
>> > physical disks under it, we might be underutilizing these even more and
>> > take a big hit on overall throughput.
>> >
>> > The whole design of doing proportional weight at lower layer is optimial
>> > usage of system.
>>
>> But I think that the higher level approch makes easy to configure
>> against striped software raid devices.
>
> How does it make easier to configure in case of higher level controller?
>
> In case of lower level design, one just have to create cgroups and assign
> weights to cgroups. This mininum step will be required in higher level
> controller also. (Even if you get rid of dm-ioband device setup step).
>
>> If one would like to
>> combine some physical disks into one logical device like a dm-linear,
>> I think one should map the IO controller on each physical device and
>> combine them into one logical device.
>>
>
> In fact this sounds like a more complicated step where one has to setup
> one dm-ioband device on top of each physical device. But I am assuming
> that this will go away once you move to per reuqest queue like implementation.
>
> I think it should be same in principal as my initial implementation of IO
> controller on request queue and I stopped development on it because of FIFO
> dispatch.
>
> So you seem to be suggesting that you will move dm-ioband to request queue
> so that setting up additional device setup is gone. You will also enable
> it to do time based groups policy, so that we don't run into issues on
> seeky media. Will also enable dispatch from one group only at a time so
> that we don't run into isolation issues and can do time accounting
> accruately.

Will that approach solve the problem of doing bandwidth control on
logical devices? What would be the advantages compared to Vivek's
current patches?

>
> If yes, then that has the potential to solve the issue. At higher layer one
> can think of enabling size of IO/number of IO policy both for proportional
> BW and max BW type of control. At lower level one can enable pure time
> based control on seeky media.
>
> I think this will still left with the issue of prio with-in group as group
> control is separate and you will not be maintatinig separate queues for
> each process. Similarly you will also have isseus with read vs write
> ratios as IO schedulers underneath change.
>
> So I will be curious to see that implementation.
>
>> > > My requirements for IO controller are:
>> > > - Implement s a higher level controller, which is located at block
>> > >   layer and bio is grabbed in generic_make_request().
>> >
>> > How are you planning to handle the issue of buffered writes Andrew raised?
>>
>> I think that it would be better to use the higher-level controller
>> along with the memory controller and have limits memory usage for each
>> cgroup. And as Kamezawa-san said, having limits of dirty pages would
>> be better, too.
>>
>
> Ok. So if we plan to co-mount memory controller with per memory group
> dirty_ratio implemented, that can work with both higher level as well as
> low level controller. Not sure if we also require some kind of a per
> memory group flusher thread infrastructure also to make sure higher weight
> group gets more job done.
>
>> > > - Can work with any type of IO scheduler.
>> > > - Can work with any type of block devices.
>> > > - Support multiple policies, proportional wegiht, max rate, time
>> > >   based, ans so on.
>> > >
>> > > The IO controller mini-summit will be held in next week, and I'm
>> > > looking forard to meet you all and discuss about IO controller.
>> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
>> >
>> > Is there a new version of dm-ioband now where you have solved the issue of
>> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
>> > trying to run some tests and come up with numbers so that we have more
>> > clear picture of pros/cons.
>>
>> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
>> dm-ioband handles sync/async IO requests separately and
>> the write-starve-read issue you pointed out is fixed. I would
>> appreciate it if you would try them.
>> http://sourceforge.net/projects/ioband/files/
>
> Cool. Will get to testing it.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-05 18:11                               ` Nauman Rafique
  0 siblings, 0 replies; 349+ messages in thread
From: Nauman Rafique @ 2009-10-05 18:11 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, mingo, righi.andrea, riel, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda,
	torvalds

On Mon, Oct 5, 2009 at 10:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
>> Hi Vivek,
>>
>> Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
>> > > Hi,
>> > >
>> > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
>> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
>> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
>> > > > > going through the request based dm-multipath paper. Will it make sense
>> > > > > to implement request based dm-ioband? So basically we implement all the
>> > > > > group scheduling in CFQ and let dm-ioband implement a request function
>> > > > > to take the request and break it back into bios. This way we can keep
>> > > > > all the group control at one place and also meet most of the requirements.
>> > > > >
>> > > > > So request based dm-ioband will have a request in hand once that request
>> > > > > has passed group control and prio control. Because dm-ioband is a device
>> > > > > mapper target, one can put it on higher level devices (practically taking
>> > > > > CFQ at higher level device), and provide fairness there. One can also
>> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
>> > > > > them to use the IO scheduler.)
>> > > > >
>> > > > > I am sure that will be many issues but one big issue I could think of that
>> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
>> > > > > from one queue (in case of idling) and that would kill parallelism at
>> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
>> > > > >
>> > > > > Thanks
>> > > > > Vivek
>> > > >
>> > > > As long as using CFQ, your idea is reasonable for me.  But how about for
>> > > > other IO schedulers?  In my understanding, one of the keys to guarantee
>> > > > group isolation in your patch is to have per-group IO scheduler internal
>> > > > queue even with as, deadline, and noop scheduler.  I think this is
>> > > > great idea, and to implement generic code for all IO schedulers was
>> > > > concluded when we had so many IO scheduler specific proposals.
>> > > > If we will still need per-group IO scheduler internal queues with
>> > > > request-based dm-ioband, we have to modify elevator layer.  It seems
>> > > > out of scope of dm.
>> > > > I might miss something...
>> > >
>> > > IIUC, the request based device-mapper could not break back a request
>> > > into bio, so it could not work with block devices which don't use the
>> > > IO scheduler.
>> > >
>> >
>> > I think current request based multipath drvier does not do it but can't it
>> > be implemented that requests are broken back into bio?
>>
>> I guess it would be hard to implement it, and we need to hold requests
>> and throttle them at there and it would break the ordering by CFQ.
>>
>> > Anyway, I don't feel too strongly about this approach as it might
>> > introduce more serialization at higher layer.
>>
>> Yes, I know it.
>>
>> > > How about adding a callback function to the higher level controller?
>> > > CFQ calls it when the active queue runs out of time, then the higer
>> > > level controller use it as a trigger or a hint to move IO group, so
>> > > I think a time-based controller could be implemented at higher level.
>> > >
>> >
>> > Adding a call back should not be a big issue. But that means you are
>> > planning to run only one group at higher layer at one time and I think
>> > that's the problem because than we are introducing serialization at higher
>> > layer. So any higher level device mapper target which has multiple
>> > physical disks under it, we might be underutilizing these even more and
>> > take a big hit on overall throughput.
>> >
>> > The whole design of doing proportional weight at lower layer is optimial
>> > usage of system.
>>
>> But I think that the higher level approch makes easy to configure
>> against striped software raid devices.
>
> How does it make easier to configure in case of higher level controller?
>
> In case of lower level design, one just have to create cgroups and assign
> weights to cgroups. This mininum step will be required in higher level
> controller also. (Even if you get rid of dm-ioband device setup step).
>
>> If one would like to
>> combine some physical disks into one logical device like a dm-linear,
>> I think one should map the IO controller on each physical device and
>> combine them into one logical device.
>>
>
> In fact this sounds like a more complicated step where one has to setup
> one dm-ioband device on top of each physical device. But I am assuming
> that this will go away once you move to per reuqest queue like implementation.
>
> I think it should be same in principal as my initial implementation of IO
> controller on request queue and I stopped development on it because of FIFO
> dispatch.
>
> So you seem to be suggesting that you will move dm-ioband to request queue
> so that setting up additional device setup is gone. You will also enable
> it to do time based groups policy, so that we don't run into issues on
> seeky media. Will also enable dispatch from one group only at a time so
> that we don't run into isolation issues and can do time accounting
> accruately.

Will that approach solve the problem of doing bandwidth control on
logical devices? What would be the advantages compared to Vivek's
current patches?

>
> If yes, then that has the potential to solve the issue. At higher layer one
> can think of enabling size of IO/number of IO policy both for proportional
> BW and max BW type of control. At lower level one can enable pure time
> based control on seeky media.
>
> I think this will still left with the issue of prio with-in group as group
> control is separate and you will not be maintatinig separate queues for
> each process. Similarly you will also have isseus with read vs write
> ratios as IO schedulers underneath change.
>
> So I will be curious to see that implementation.
>
>> > > My requirements for IO controller are:
>> > > - Implement s a higher level controller, which is located at block
>> > >   layer and bio is grabbed in generic_make_request().
>> >
>> > How are you planning to handle the issue of buffered writes Andrew raised?
>>
>> I think that it would be better to use the higher-level controller
>> along with the memory controller and have limits memory usage for each
>> cgroup. And as Kamezawa-san said, having limits of dirty pages would
>> be better, too.
>>
>
> Ok. So if we plan to co-mount memory controller with per memory group
> dirty_ratio implemented, that can work with both higher level as well as
> low level controller. Not sure if we also require some kind of a per
> memory group flusher thread infrastructure also to make sure higher weight
> group gets more job done.
>
>> > > - Can work with any type of IO scheduler.
>> > > - Can work with any type of block devices.
>> > > - Support multiple policies, proportional wegiht, max rate, time
>> > >   based, ans so on.
>> > >
>> > > The IO controller mini-summit will be held in next week, and I'm
>> > > looking forard to meet you all and discuss about IO controller.
>> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
>> >
>> > Is there a new version of dm-ioband now where you have solved the issue of
>> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
>> > trying to run some tests and come up with numbers so that we have more
>> > clear picture of pros/cons.
>>
>> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
>> dm-ioband handles sync/async IO requests separately and
>> the write-starve-read issue you pointed out is fixed. I would
>> appreciate it if you would try them.
>> http://sourceforge.net/projects/ioband/files/
>
> Cool. Will get to testing it.
>
> Thanks
> Vivek
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                           ` <20091005.235535.193690928.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-10-05 17:10                             ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-05 17:10 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > > Hi,
> > > 
> > > Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote:
> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > > going through the request based dm-multipath paper. Will it make sense
> > > > > to implement request based dm-ioband? So basically we implement all the
> > > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > > to take the request and break it back into bios. This way we can keep
> > > > > all the group control at one place and also meet most of the requirements.
> > > > >
> > > > > So request based dm-ioband will have a request in hand once that request
> > > > > has passed group control and prio control. Because dm-ioband is a device
> > > > > mapper target, one can put it on higher level devices (practically taking
> > > > > CFQ at higher level device), and provide fairness there. One can also
> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > > them to use the IO scheduler.)
> > > > >
> > > > > I am sure that will be many issues but one big issue I could think of that
> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > > from one queue (in case of idling) and that would kill parallelism at
> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > > >
> > > > > Thanks
> > > > > Vivek
> > > > 
> > > > As long as using CFQ, your idea is reasonable for me.  But how about for
> > > > other IO schedulers?  In my understanding, one of the keys to guarantee
> > > > group isolation in your patch is to have per-group IO scheduler internal
> > > > queue even with as, deadline, and noop scheduler.  I think this is
> > > > great idea, and to implement generic code for all IO schedulers was
> > > > concluded when we had so many IO scheduler specific proposals.
> > > > If we will still need per-group IO scheduler internal queues with
> > > > request-based dm-ioband, we have to modify elevator layer.  It seems
> > > > out of scope of dm.
> > > > I might miss something...
> > > 
> > > IIUC, the request based device-mapper could not break back a request
> > > into bio, so it could not work with block devices which don't use the
> > > IO scheduler.
> > > 
> > 
> > I think current request based multipath drvier does not do it but can't it
> > be implemented that requests are broken back into bio?
> 
> I guess it would be hard to implement it, and we need to hold requests
> and throttle them at there and it would break the ordering by CFQ.
> 
> > Anyway, I don't feel too strongly about this approach as it might
> > introduce more serialization at higher layer.
> 
> Yes, I know it.
> 
> > > How about adding a callback function to the higher level controller?
> > > CFQ calls it when the active queue runs out of time, then the higer
> > > level controller use it as a trigger or a hint to move IO group, so
> > > I think a time-based controller could be implemented at higher level.
> > > 
> > 
> > Adding a call back should not be a big issue. But that means you are
> > planning to run only one group at higher layer at one time and I think
> > that's the problem because than we are introducing serialization at higher
> > layer. So any higher level device mapper target which has multiple
> > physical disks under it, we might be underutilizing these even more and
> > take a big hit on overall throughput.
> > 
> > The whole design of doing proportional weight at lower layer is optimial 
> > usage of system.
> 
> But I think that the higher level approch makes easy to configure
> against striped software raid devices.

How does it make easier to configure in case of higher level controller?

In case of lower level design, one just have to create cgroups and assign
weights to cgroups. This mininum step will be required in higher level
controller also. (Even if you get rid of dm-ioband device setup step).

> If one would like to
> combine some physical disks into one logical device like a dm-linear,
> I think one should map the IO controller on each physical device and
> combine them into one logical device.
> 

In fact this sounds like a more complicated step where one has to setup
one dm-ioband device on top of each physical device. But I am assuming 
that this will go away once you move to per reuqest queue like implementation.

I think it should be same in principal as my initial implementation of IO
controller on request queue and I stopped development on it because of FIFO
dispatch. 

So you seem to be suggesting that you will move dm-ioband to request queue
so that setting up additional device setup is gone. You will also enable
it to do time based groups policy, so that we don't run into issues on
seeky media. Will also enable dispatch from one group only at a time so
that we don't run into isolation issues and can do time accounting
accruately.

If yes, then that has the potential to solve the issue. At higher layer one
can think of enabling size of IO/number of IO policy both for proportional
BW and max BW type of control. At lower level one can enable pure time
based control on seeky media.

I think this will still left with the issue of prio with-in group as group
control is separate and you will not be maintatinig separate queues for
each process. Similarly you will also have isseus with read vs write
ratios as IO schedulers underneath change.

So I will be curious to see that implementation. 

> > > My requirements for IO controller are:
> > > - Implement s a higher level controller, which is located at block
> > >   layer and bio is grabbed in generic_make_request().
> > 
> > How are you planning to handle the issue of buffered writes Andrew raised?
> 
> I think that it would be better to use the higher-level controller
> along with the memory controller and have limits memory usage for each
> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> be better, too.
> 

Ok. So if we plan to co-mount memory controller with per memory group
dirty_ratio implemented, that can work with both higher level as well as
low level controller. Not sure if we also require some kind of a per
memory group flusher thread infrastructure also to make sure higher weight
group gets more job done.

> > > - Can work with any type of IO scheduler.
> > > - Can work with any type of block devices.
> > > - Support multiple policies, proportional wegiht, max rate, time
> > >   based, ans so on.
> > > 
> > > The IO controller mini-summit will be held in next week, and I'm
> > > looking forard to meet you all and discuss about IO controller.
> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> > 
> > Is there a new version of dm-ioband now where you have solved the issue of
> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > trying to run some tests and come up with numbers so that we have more
> > clear picture of pros/cons.
> 
> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> dm-ioband handles sync/async IO requests separately and
> the write-starve-read issue you pointed out is fixed. I would
> appreciate it if you would try them.
> http://sourceforge.net/projects/ioband/files/ 

Cool. Will get to testing it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-05 14:55                           ` Ryo Tsuruta
@ 2009-10-05 17:10                             ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-05 17:10 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: m-ikeda, nauman, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > > Hi,
> > > 
> > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > > going through the request based dm-multipath paper. Will it make sense
> > > > > to implement request based dm-ioband? So basically we implement all the
> > > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > > to take the request and break it back into bios. This way we can keep
> > > > > all the group control at one place and also meet most of the requirements.
> > > > >
> > > > > So request based dm-ioband will have a request in hand once that request
> > > > > has passed group control and prio control. Because dm-ioband is a device
> > > > > mapper target, one can put it on higher level devices (practically taking
> > > > > CFQ at higher level device), and provide fairness there. One can also
> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > > them to use the IO scheduler.)
> > > > >
> > > > > I am sure that will be many issues but one big issue I could think of that
> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > > from one queue (in case of idling) and that would kill parallelism at
> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > > >
> > > > > Thanks
> > > > > Vivek
> > > > 
> > > > As long as using CFQ, your idea is reasonable for me.  But how about for
> > > > other IO schedulers?  In my understanding, one of the keys to guarantee
> > > > group isolation in your patch is to have per-group IO scheduler internal
> > > > queue even with as, deadline, and noop scheduler.  I think this is
> > > > great idea, and to implement generic code for all IO schedulers was
> > > > concluded when we had so many IO scheduler specific proposals.
> > > > If we will still need per-group IO scheduler internal queues with
> > > > request-based dm-ioband, we have to modify elevator layer.  It seems
> > > > out of scope of dm.
> > > > I might miss something...
> > > 
> > > IIUC, the request based device-mapper could not break back a request
> > > into bio, so it could not work with block devices which don't use the
> > > IO scheduler.
> > > 
> > 
> > I think current request based multipath drvier does not do it but can't it
> > be implemented that requests are broken back into bio?
> 
> I guess it would be hard to implement it, and we need to hold requests
> and throttle them at there and it would break the ordering by CFQ.
> 
> > Anyway, I don't feel too strongly about this approach as it might
> > introduce more serialization at higher layer.
> 
> Yes, I know it.
> 
> > > How about adding a callback function to the higher level controller?
> > > CFQ calls it when the active queue runs out of time, then the higer
> > > level controller use it as a trigger or a hint to move IO group, so
> > > I think a time-based controller could be implemented at higher level.
> > > 
> > 
> > Adding a call back should not be a big issue. But that means you are
> > planning to run only one group at higher layer at one time and I think
> > that's the problem because than we are introducing serialization at higher
> > layer. So any higher level device mapper target which has multiple
> > physical disks under it, we might be underutilizing these even more and
> > take a big hit on overall throughput.
> > 
> > The whole design of doing proportional weight at lower layer is optimial 
> > usage of system.
> 
> But I think that the higher level approch makes easy to configure
> against striped software raid devices.

How does it make easier to configure in case of higher level controller?

In case of lower level design, one just have to create cgroups and assign
weights to cgroups. This mininum step will be required in higher level
controller also. (Even if you get rid of dm-ioband device setup step).

> If one would like to
> combine some physical disks into one logical device like a dm-linear,
> I think one should map the IO controller on each physical device and
> combine them into one logical device.
> 

In fact this sounds like a more complicated step where one has to setup
one dm-ioband device on top of each physical device. But I am assuming 
that this will go away once you move to per reuqest queue like implementation.

I think it should be same in principal as my initial implementation of IO
controller on request queue and I stopped development on it because of FIFO
dispatch. 

So you seem to be suggesting that you will move dm-ioband to request queue
so that setting up additional device setup is gone. You will also enable
it to do time based groups policy, so that we don't run into issues on
seeky media. Will also enable dispatch from one group only at a time so
that we don't run into isolation issues and can do time accounting
accruately.

If yes, then that has the potential to solve the issue. At higher layer one
can think of enabling size of IO/number of IO policy both for proportional
BW and max BW type of control. At lower level one can enable pure time
based control on seeky media.

I think this will still left with the issue of prio with-in group as group
control is separate and you will not be maintatinig separate queues for
each process. Similarly you will also have isseus with read vs write
ratios as IO schedulers underneath change.

So I will be curious to see that implementation. 

> > > My requirements for IO controller are:
> > > - Implement s a higher level controller, which is located at block
> > >   layer and bio is grabbed in generic_make_request().
> > 
> > How are you planning to handle the issue of buffered writes Andrew raised?
> 
> I think that it would be better to use the higher-level controller
> along with the memory controller and have limits memory usage for each
> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> be better, too.
> 

Ok. So if we plan to co-mount memory controller with per memory group
dirty_ratio implemented, that can work with both higher level as well as
low level controller. Not sure if we also require some kind of a per
memory group flusher thread infrastructure also to make sure higher weight
group gets more job done.

> > > - Can work with any type of IO scheduler.
> > > - Can work with any type of block devices.
> > > - Support multiple policies, proportional wegiht, max rate, time
> > >   based, ans so on.
> > > 
> > > The IO controller mini-summit will be held in next week, and I'm
> > > looking forard to meet you all and discuss about IO controller.
> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> > 
> > Is there a new version of dm-ioband now where you have solved the issue of
> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > trying to run some tests and come up with numbers so that we have more
> > clear picture of pros/cons.
> 
> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> dm-ioband handles sync/async IO requests separately and
> the write-starve-read issue you pointed out is fixed. I would
> appreciate it if you would try them.
> http://sourceforge.net/projects/ioband/files/ 

Cool. Will get to testing it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-05 17:10                             ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-05 17:10 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > > Hi,
> > > 
> > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > > going through the request based dm-multipath paper. Will it make sense
> > > > > to implement request based dm-ioband? So basically we implement all the
> > > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > > to take the request and break it back into bios. This way we can keep
> > > > > all the group control at one place and also meet most of the requirements.
> > > > >
> > > > > So request based dm-ioband will have a request in hand once that request
> > > > > has passed group control and prio control. Because dm-ioband is a device
> > > > > mapper target, one can put it on higher level devices (practically taking
> > > > > CFQ at higher level device), and provide fairness there. One can also
> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > > them to use the IO scheduler.)
> > > > >
> > > > > I am sure that will be many issues but one big issue I could think of that
> > > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > > from one queue (in case of idling) and that would kill parallelism at
> > > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > > >
> > > > > Thanks
> > > > > Vivek
> > > > 
> > > > As long as using CFQ, your idea is reasonable for me.  But how about for
> > > > other IO schedulers?  In my understanding, one of the keys to guarantee
> > > > group isolation in your patch is to have per-group IO scheduler internal
> > > > queue even with as, deadline, and noop scheduler.  I think this is
> > > > great idea, and to implement generic code for all IO schedulers was
> > > > concluded when we had so many IO scheduler specific proposals.
> > > > If we will still need per-group IO scheduler internal queues with
> > > > request-based dm-ioband, we have to modify elevator layer.  It seems
> > > > out of scope of dm.
> > > > I might miss something...
> > > 
> > > IIUC, the request based device-mapper could not break back a request
> > > into bio, so it could not work with block devices which don't use the
> > > IO scheduler.
> > > 
> > 
> > I think current request based multipath drvier does not do it but can't it
> > be implemented that requests are broken back into bio?
> 
> I guess it would be hard to implement it, and we need to hold requests
> and throttle them at there and it would break the ordering by CFQ.
> 
> > Anyway, I don't feel too strongly about this approach as it might
> > introduce more serialization at higher layer.
> 
> Yes, I know it.
> 
> > > How about adding a callback function to the higher level controller?
> > > CFQ calls it when the active queue runs out of time, then the higer
> > > level controller use it as a trigger or a hint to move IO group, so
> > > I think a time-based controller could be implemented at higher level.
> > > 
> > 
> > Adding a call back should not be a big issue. But that means you are
> > planning to run only one group at higher layer at one time and I think
> > that's the problem because than we are introducing serialization at higher
> > layer. So any higher level device mapper target which has multiple
> > physical disks under it, we might be underutilizing these even more and
> > take a big hit on overall throughput.
> > 
> > The whole design of doing proportional weight at lower layer is optimial 
> > usage of system.
> 
> But I think that the higher level approch makes easy to configure
> against striped software raid devices.

How does it make easier to configure in case of higher level controller?

In case of lower level design, one just have to create cgroups and assign
weights to cgroups. This mininum step will be required in higher level
controller also. (Even if you get rid of dm-ioband device setup step).

> If one would like to
> combine some physical disks into one logical device like a dm-linear,
> I think one should map the IO controller on each physical device and
> combine them into one logical device.
> 

In fact this sounds like a more complicated step where one has to setup
one dm-ioband device on top of each physical device. But I am assuming 
that this will go away once you move to per reuqest queue like implementation.

I think it should be same in principal as my initial implementation of IO
controller on request queue and I stopped development on it because of FIFO
dispatch. 

So you seem to be suggesting that you will move dm-ioband to request queue
so that setting up additional device setup is gone. You will also enable
it to do time based groups policy, so that we don't run into issues on
seeky media. Will also enable dispatch from one group only at a time so
that we don't run into isolation issues and can do time accounting
accruately.

If yes, then that has the potential to solve the issue. At higher layer one
can think of enabling size of IO/number of IO policy both for proportional
BW and max BW type of control. At lower level one can enable pure time
based control on seeky media.

I think this will still left with the issue of prio with-in group as group
control is separate and you will not be maintatinig separate queues for
each process. Similarly you will also have isseus with read vs write
ratios as IO schedulers underneath change.

So I will be curious to see that implementation. 

> > > My requirements for IO controller are:
> > > - Implement s a higher level controller, which is located at block
> > >   layer and bio is grabbed in generic_make_request().
> > 
> > How are you planning to handle the issue of buffered writes Andrew raised?
> 
> I think that it would be better to use the higher-level controller
> along with the memory controller and have limits memory usage for each
> cgroup. And as Kamezawa-san said, having limits of dirty pages would
> be better, too.
> 

Ok. So if we plan to co-mount memory controller with per memory group
dirty_ratio implemented, that can work with both higher level as well as
low level controller. Not sure if we also require some kind of a per
memory group flusher thread infrastructure also to make sure higher weight
group gets more job done.

> > > - Can work with any type of IO scheduler.
> > > - Can work with any type of block devices.
> > > - Support multiple policies, proportional wegiht, max rate, time
> > >   based, ans so on.
> > > 
> > > The IO controller mini-summit will be held in next week, and I'm
> > > looking forard to meet you all and discuss about IO controller.
> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> > 
> > Is there a new version of dm-ioband now where you have solved the issue of
> > sync/async dispatch with-in group? Before meeting at mini-summit, I am
> > trying to run some tests and come up with numbers so that we have more
> > clear picture of pros/cons.
> 
> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
> dm-ioband handles sync/async IO requests separately and
> the write-starve-read issue you pointed out is fixed. I would
> appreciate it if you would try them.
> http://sourceforge.net/projects/ioband/files/ 

Cool. Will get to testing it.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                         ` <20091005123148.GB22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-05 14:55                           ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-05 14:55 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > Hi,
> > 
> > Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote:
> > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > going through the request based dm-multipath paper. Will it make sense
> > > > to implement request based dm-ioband? So basically we implement all the
> > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > to take the request and break it back into bios. This way we can keep
> > > > all the group control at one place and also meet most of the requirements.
> > > >
> > > > So request based dm-ioband will have a request in hand once that request
> > > > has passed group control and prio control. Because dm-ioband is a device
> > > > mapper target, one can put it on higher level devices (practically taking
> > > > CFQ at higher level device), and provide fairness there. One can also
> > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > them to use the IO scheduler.)
> > > >
> > > > I am sure that will be many issues but one big issue I could think of that
> > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > from one queue (in case of idling) and that would kill parallelism at
> > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > >
> > > > Thanks
> > > > Vivek
> > > 
> > > As long as using CFQ, your idea is reasonable for me.  But how about for
> > > other IO schedulers?  In my understanding, one of the keys to guarantee
> > > group isolation in your patch is to have per-group IO scheduler internal
> > > queue even with as, deadline, and noop scheduler.  I think this is
> > > great idea, and to implement generic code for all IO schedulers was
> > > concluded when we had so many IO scheduler specific proposals.
> > > If we will still need per-group IO scheduler internal queues with
> > > request-based dm-ioband, we have to modify elevator layer.  It seems
> > > out of scope of dm.
> > > I might miss something...
> > 
> > IIUC, the request based device-mapper could not break back a request
> > into bio, so it could not work with block devices which don't use the
> > IO scheduler.
> > 
> 
> I think current request based multipath drvier does not do it but can't it
> be implemented that requests are broken back into bio?

I guess it would be hard to implement it, and we need to hold requests
and throttle them at there and it would break the ordering by CFQ.

> Anyway, I don't feel too strongly about this approach as it might
> introduce more serialization at higher layer.

Yes, I know it.

> > How about adding a callback function to the higher level controller?
> > CFQ calls it when the active queue runs out of time, then the higer
> > level controller use it as a trigger or a hint to move IO group, so
> > I think a time-based controller could be implemented at higher level.
> > 
> 
> Adding a call back should not be a big issue. But that means you are
> planning to run only one group at higher layer at one time and I think
> that's the problem because than we are introducing serialization at higher
> layer. So any higher level device mapper target which has multiple
> physical disks under it, we might be underutilizing these even more and
> take a big hit on overall throughput.
> 
> The whole design of doing proportional weight at lower layer is optimial 
> usage of system.

But I think that the higher level approch makes easy to configure
against striped software raid devices. If one would like to
combine some physical disks into one logical device like a dm-linear,
I think one should map the IO controller on each physical device and
combine them into one logical device.

> > My requirements for IO controller are:
> > - Implement s a higher level controller, which is located at block
> >   layer and bio is grabbed in generic_make_request().
> 
> How are you planning to handle the issue of buffered writes Andrew raised?

I think that it would be better to use the higher-level controller
along with the memory controller and have limits memory usage for each
cgroup. And as Kamezawa-san said, having limits of dirty pages would
be better, too.

> > - Can work with any type of IO scheduler.
> > - Can work with any type of block devices.
> > - Support multiple policies, proportional wegiht, max rate, time
> >   based, ans so on.
> > 
> > The IO controller mini-summit will be held in next week, and I'm
> > looking forard to meet you all and discuss about IO controller.
> > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> 
> Is there a new version of dm-ioband now where you have solved the issue of
> sync/async dispatch with-in group? Before meeting at mini-summit, I am
> trying to run some tests and come up with numbers so that we have more
> clear picture of pros/cons.

Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
dm-ioband handles sync/async IO requests separately and
the write-starve-read issue you pointed out is fixed. I would
appreciate it if you would try them.
http://sourceforge.net/projects/ioband/files/ 

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-05 12:31                         ` Vivek Goyal
@ 2009-10-05 14:55                           ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-05 14:55 UTC (permalink / raw)
  To: vgoyal
  Cc: m-ikeda, nauman, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > Hi,
> > 
> > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > going through the request based dm-multipath paper. Will it make sense
> > > > to implement request based dm-ioband? So basically we implement all the
> > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > to take the request and break it back into bios. This way we can keep
> > > > all the group control at one place and also meet most of the requirements.
> > > >
> > > > So request based dm-ioband will have a request in hand once that request
> > > > has passed group control and prio control. Because dm-ioband is a device
> > > > mapper target, one can put it on higher level devices (practically taking
> > > > CFQ at higher level device), and provide fairness there. One can also
> > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > them to use the IO scheduler.)
> > > >
> > > > I am sure that will be many issues but one big issue I could think of that
> > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > from one queue (in case of idling) and that would kill parallelism at
> > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > >
> > > > Thanks
> > > > Vivek
> > > 
> > > As long as using CFQ, your idea is reasonable for me.  But how about for
> > > other IO schedulers?  In my understanding, one of the keys to guarantee
> > > group isolation in your patch is to have per-group IO scheduler internal
> > > queue even with as, deadline, and noop scheduler.  I think this is
> > > great idea, and to implement generic code for all IO schedulers was
> > > concluded when we had so many IO scheduler specific proposals.
> > > If we will still need per-group IO scheduler internal queues with
> > > request-based dm-ioband, we have to modify elevator layer.  It seems
> > > out of scope of dm.
> > > I might miss something...
> > 
> > IIUC, the request based device-mapper could not break back a request
> > into bio, so it could not work with block devices which don't use the
> > IO scheduler.
> > 
> 
> I think current request based multipath drvier does not do it but can't it
> be implemented that requests are broken back into bio?

I guess it would be hard to implement it, and we need to hold requests
and throttle them at there and it would break the ordering by CFQ.

> Anyway, I don't feel too strongly about this approach as it might
> introduce more serialization at higher layer.

Yes, I know it.

> > How about adding a callback function to the higher level controller?
> > CFQ calls it when the active queue runs out of time, then the higer
> > level controller use it as a trigger or a hint to move IO group, so
> > I think a time-based controller could be implemented at higher level.
> > 
> 
> Adding a call back should not be a big issue. But that means you are
> planning to run only one group at higher layer at one time and I think
> that's the problem because than we are introducing serialization at higher
> layer. So any higher level device mapper target which has multiple
> physical disks under it, we might be underutilizing these even more and
> take a big hit on overall throughput.
> 
> The whole design of doing proportional weight at lower layer is optimial 
> usage of system.

But I think that the higher level approch makes easy to configure
against striped software raid devices. If one would like to
combine some physical disks into one logical device like a dm-linear,
I think one should map the IO controller on each physical device and
combine them into one logical device.

> > My requirements for IO controller are:
> > - Implement s a higher level controller, which is located at block
> >   layer and bio is grabbed in generic_make_request().
> 
> How are you planning to handle the issue of buffered writes Andrew raised?

I think that it would be better to use the higher-level controller
along with the memory controller and have limits memory usage for each
cgroup. And as Kamezawa-san said, having limits of dirty pages would
be better, too.

> > - Can work with any type of IO scheduler.
> > - Can work with any type of block devices.
> > - Support multiple policies, proportional wegiht, max rate, time
> >   based, ans so on.
> > 
> > The IO controller mini-summit will be held in next week, and I'm
> > looking forard to meet you all and discuss about IO controller.
> > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> 
> Is there a new version of dm-ioband now where you have solved the issue of
> sync/async dispatch with-in group? Before meeting at mini-summit, I am
> trying to run some tests and come up with numbers so that we have more
> clear picture of pros/cons.

Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
dm-ioband handles sync/async IO requests separately and
the write-starve-read issue you pointed out is fixed. I would
appreciate it if you would try them.
http://sourceforge.net/projects/ioband/files/ 

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-05 14:55                           ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-05 14:55 UTC (permalink / raw)
  To: vgoyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> > Hi,
> > 
> > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> > > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > > going through the request based dm-multipath paper. Will it make sense
> > > > to implement request based dm-ioband? So basically we implement all the
> > > > group scheduling in CFQ and let dm-ioband implement a request function
> > > > to take the request and break it back into bios. This way we can keep
> > > > all the group control at one place and also meet most of the requirements.
> > > >
> > > > So request based dm-ioband will have a request in hand once that request
> > > > has passed group control and prio control. Because dm-ioband is a device
> > > > mapper target, one can put it on higher level devices (practically taking
> > > > CFQ at higher level device), and provide fairness there. One can also
> > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > > them to use the IO scheduler.)
> > > >
> > > > I am sure that will be many issues but one big issue I could think of that
> > > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > > from one queue (in case of idling) and that would kill parallelism at
> > > > higher layer and throughput will suffer on many of the dm/md configurations.
> > > >
> > > > Thanks
> > > > Vivek
> > > 
> > > As long as using CFQ, your idea is reasonable for me.  But how about for
> > > other IO schedulers?  In my understanding, one of the keys to guarantee
> > > group isolation in your patch is to have per-group IO scheduler internal
> > > queue even with as, deadline, and noop scheduler.  I think this is
> > > great idea, and to implement generic code for all IO schedulers was
> > > concluded when we had so many IO scheduler specific proposals.
> > > If we will still need per-group IO scheduler internal queues with
> > > request-based dm-ioband, we have to modify elevator layer.  It seems
> > > out of scope of dm.
> > > I might miss something...
> > 
> > IIUC, the request based device-mapper could not break back a request
> > into bio, so it could not work with block devices which don't use the
> > IO scheduler.
> > 
> 
> I think current request based multipath drvier does not do it but can't it
> be implemented that requests are broken back into bio?

I guess it would be hard to implement it, and we need to hold requests
and throttle them at there and it would break the ordering by CFQ.

> Anyway, I don't feel too strongly about this approach as it might
> introduce more serialization at higher layer.

Yes, I know it.

> > How about adding a callback function to the higher level controller?
> > CFQ calls it when the active queue runs out of time, then the higer
> > level controller use it as a trigger or a hint to move IO group, so
> > I think a time-based controller could be implemented at higher level.
> > 
> 
> Adding a call back should not be a big issue. But that means you are
> planning to run only one group at higher layer at one time and I think
> that's the problem because than we are introducing serialization at higher
> layer. So any higher level device mapper target which has multiple
> physical disks under it, we might be underutilizing these even more and
> take a big hit on overall throughput.
> 
> The whole design of doing proportional weight at lower layer is optimial 
> usage of system.

But I think that the higher level approch makes easy to configure
against striped software raid devices. If one would like to
combine some physical disks into one logical device like a dm-linear,
I think one should map the IO controller on each physical device and
combine them into one logical device.

> > My requirements for IO controller are:
> > - Implement s a higher level controller, which is located at block
> >   layer and bio is grabbed in generic_make_request().
> 
> How are you planning to handle the issue of buffered writes Andrew raised?

I think that it would be better to use the higher-level controller
along with the memory controller and have limits memory usage for each
cgroup. And as Kamezawa-san said, having limits of dirty pages would
be better, too.

> > - Can work with any type of IO scheduler.
> > - Can work with any type of block devices.
> > - Support multiple policies, proportional wegiht, max rate, time
> >   based, ans so on.
> > 
> > The IO controller mini-summit will be held in next week, and I'm
> > looking forard to meet you all and discuss about IO controller.
> > https://sourceforge.net/apps/trac/ioband/wiki/iosummit
> 
> Is there a new version of dm-ioband now where you have solved the issue of
> sync/async dispatch with-in group? Before meeting at mini-summit, I am
> trying to run some tests and come up with numbers so that we have more
> clear picture of pros/cons.

Yes, I've released new versions of dm-ioband and blkio-cgroup. The new
dm-ioband handles sync/async IO requests separately and
the write-starve-read issue you pointed out is fixed. I would
appreciate it if you would try them.
http://sourceforge.net/projects/ioband/files/ 

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                       ` <20091005.193808.104033719.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-10-05 12:31                         ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-05 12:31 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> Hi,
> 
> Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote:
> > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > going through the request based dm-multipath paper. Will it make sense
> > > to implement request based dm-ioband? So basically we implement all the
> > > group scheduling in CFQ and let dm-ioband implement a request function
> > > to take the request and break it back into bios. This way we can keep
> > > all the group control at one place and also meet most of the requirements.
> > >
> > > So request based dm-ioband will have a request in hand once that request
> > > has passed group control and prio control. Because dm-ioband is a device
> > > mapper target, one can put it on higher level devices (practically taking
> > > CFQ at higher level device), and provide fairness there. One can also
> > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > them to use the IO scheduler.)
> > >
> > > I am sure that will be many issues but one big issue I could think of that
> > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > from one queue (in case of idling) and that would kill parallelism at
> > > higher layer and throughput will suffer on many of the dm/md configurations.
> > >
> > > Thanks
> > > Vivek
> > 
> > As long as using CFQ, your idea is reasonable for me.  But how about for
> > other IO schedulers?  In my understanding, one of the keys to guarantee
> > group isolation in your patch is to have per-group IO scheduler internal
> > queue even with as, deadline, and noop scheduler.  I think this is
> > great idea, and to implement generic code for all IO schedulers was
> > concluded when we had so many IO scheduler specific proposals.
> > If we will still need per-group IO scheduler internal queues with
> > request-based dm-ioband, we have to modify elevator layer.  It seems
> > out of scope of dm.
> > I might miss something...
> 
> IIUC, the request based device-mapper could not break back a request
> into bio, so it could not work with block devices which don't use the
> IO scheduler.
> 

I think current request based multipath drvier does not do it but can't it
be implemented that requests are broken back into bio?

Anyway, I don't feel too strongly about this approach as it might
introduce more serialization at higher layer.

> How about adding a callback function to the higher level controller?
> CFQ calls it when the active queue runs out of time, then the higer
> level controller use it as a trigger or a hint to move IO group, so
> I think a time-based controller could be implemented at higher level.
> 

Adding a call back should not be a big issue. But that means you are
planning to run only one group at higher layer at one time and I think
that's the problem because than we are introducing serialization at higher
layer. So any higher level device mapper target which has multiple
physical disks under it, we might be underutilizing these even more and
take a big hit on overall throughput.

The whole design of doing proportional weight at lower layer is optimial 
usage of system.

> My requirements for IO controller are:
> - Implement s a higher level controller, which is located at block
>   layer and bio is grabbed in generic_make_request().

How are you planning to handle the issue of buffered writes Andrew raised?

> - Can work with any type of IO scheduler.
> - Can work with any type of block devices.
> - Support multiple policies, proportional wegiht, max rate, time
>   based, ans so on.
> 
> The IO controller mini-summit will be held in next week, and I'm
> looking forard to meet you all and discuss about IO controller.
> https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Is there a new version of dm-ioband now where you have solved the issue of
sync/async dispatch with-in group? Before meeting at mini-summit, I am
trying to run some tests and come up with numbers so that we have more
clear picture of pros/cons.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-05 10:38                       ` Ryo Tsuruta
@ 2009-10-05 12:31                         ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-05 12:31 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: m-ikeda, nauman, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> Hi,
> 
> Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > going through the request based dm-multipath paper. Will it make sense
> > > to implement request based dm-ioband? So basically we implement all the
> > > group scheduling in CFQ and let dm-ioband implement a request function
> > > to take the request and break it back into bios. This way we can keep
> > > all the group control at one place and also meet most of the requirements.
> > >
> > > So request based dm-ioband will have a request in hand once that request
> > > has passed group control and prio control. Because dm-ioband is a device
> > > mapper target, one can put it on higher level devices (practically taking
> > > CFQ at higher level device), and provide fairness there. One can also
> > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > them to use the IO scheduler.)
> > >
> > > I am sure that will be many issues but one big issue I could think of that
> > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > from one queue (in case of idling) and that would kill parallelism at
> > > higher layer and throughput will suffer on many of the dm/md configurations.
> > >
> > > Thanks
> > > Vivek
> > 
> > As long as using CFQ, your idea is reasonable for me.  But how about for
> > other IO schedulers?  In my understanding, one of the keys to guarantee
> > group isolation in your patch is to have per-group IO scheduler internal
> > queue even with as, deadline, and noop scheduler.  I think this is
> > great idea, and to implement generic code for all IO schedulers was
> > concluded when we had so many IO scheduler specific proposals.
> > If we will still need per-group IO scheduler internal queues with
> > request-based dm-ioband, we have to modify elevator layer.  It seems
> > out of scope of dm.
> > I might miss something...
> 
> IIUC, the request based device-mapper could not break back a request
> into bio, so it could not work with block devices which don't use the
> IO scheduler.
> 

I think current request based multipath drvier does not do it but can't it
be implemented that requests are broken back into bio?

Anyway, I don't feel too strongly about this approach as it might
introduce more serialization at higher layer.

> How about adding a callback function to the higher level controller?
> CFQ calls it when the active queue runs out of time, then the higer
> level controller use it as a trigger or a hint to move IO group, so
> I think a time-based controller could be implemented at higher level.
> 

Adding a call back should not be a big issue. But that means you are
planning to run only one group at higher layer at one time and I think
that's the problem because than we are introducing serialization at higher
layer. So any higher level device mapper target which has multiple
physical disks under it, we might be underutilizing these even more and
take a big hit on overall throughput.

The whole design of doing proportional weight at lower layer is optimial 
usage of system.

> My requirements for IO controller are:
> - Implement s a higher level controller, which is located at block
>   layer and bio is grabbed in generic_make_request().

How are you planning to handle the issue of buffered writes Andrew raised?

> - Can work with any type of IO scheduler.
> - Can work with any type of block devices.
> - Support multiple policies, proportional wegiht, max rate, time
>   based, ans so on.
> 
> The IO controller mini-summit will be held in next week, and I'm
> looking forard to meet you all and discuss about IO controller.
> https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Is there a new version of dm-ioband now where you have solved the issue of
sync/async dispatch with-in group? Before meeting at mini-summit, I am
trying to run some tests and come up with numbers so that we have more
clear picture of pros/cons.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-05 12:31                         ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-05 12:31 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	m-ikeda, torvalds

On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote:
> Hi,
> 
> Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> > Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > > Before finishing this mail, will throw a whacky idea in the ring. I was
> > > going through the request based dm-multipath paper. Will it make sense
> > > to implement request based dm-ioband? So basically we implement all the
> > > group scheduling in CFQ and let dm-ioband implement a request function
> > > to take the request and break it back into bios. This way we can keep
> > > all the group control at one place and also meet most of the requirements.
> > >
> > > So request based dm-ioband will have a request in hand once that request
> > > has passed group control and prio control. Because dm-ioband is a device
> > > mapper target, one can put it on higher level devices (practically taking
> > > CFQ at higher level device), and provide fairness there. One can also
> > > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > > them to use the IO scheduler.)
> > >
> > > I am sure that will be many issues but one big issue I could think of that
> > > CFQ thinks that there is one device beneath it and dipsatches requests
> > > from one queue (in case of idling) and that would kill parallelism at
> > > higher layer and throughput will suffer on many of the dm/md configurations.
> > >
> > > Thanks
> > > Vivek
> > 
> > As long as using CFQ, your idea is reasonable for me.  But how about for
> > other IO schedulers?  In my understanding, one of the keys to guarantee
> > group isolation in your patch is to have per-group IO scheduler internal
> > queue even with as, deadline, and noop scheduler.  I think this is
> > great idea, and to implement generic code for all IO schedulers was
> > concluded when we had so many IO scheduler specific proposals.
> > If we will still need per-group IO scheduler internal queues with
> > request-based dm-ioband, we have to modify elevator layer.  It seems
> > out of scope of dm.
> > I might miss something...
> 
> IIUC, the request based device-mapper could not break back a request
> into bio, so it could not work with block devices which don't use the
> IO scheduler.
> 

I think current request based multipath drvier does not do it but can't it
be implemented that requests are broken back into bio?

Anyway, I don't feel too strongly about this approach as it might
introduce more serialization at higher layer.

> How about adding a callback function to the higher level controller?
> CFQ calls it when the active queue runs out of time, then the higer
> level controller use it as a trigger or a hint to move IO group, so
> I think a time-based controller could be implemented at higher level.
> 

Adding a call back should not be a big issue. But that means you are
planning to run only one group at higher layer at one time and I think
that's the problem because than we are introducing serialization at higher
layer. So any higher level device mapper target which has multiple
physical disks under it, we might be underutilizing these even more and
take a big hit on overall throughput.

The whole design of doing proportional weight at lower layer is optimial 
usage of system.

> My requirements for IO controller are:
> - Implement s a higher level controller, which is located at block
>   layer and bio is grabbed in generic_make_request().

How are you planning to handle the issue of buffered writes Andrew raised?

> - Can work with any type of IO scheduler.
> - Can work with any type of block devices.
> - Support multiple policies, proportional wegiht, max rate, time
>   based, ans so on.
> 
> The IO controller mini-summit will be held in next week, and I'm
> looking forard to meet you all and discuss about IO controller.
> https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Is there a new version of dm-ioband now where you have solved the issue of
sync/async dispatch with-in group? Before meeting at mini-summit, I am
trying to run some tests and come up with numbers so that we have more
clear picture of pros/cons.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                     ` <4AC6623F.70600-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
@ 2009-10-05 10:38                       ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-05 10:38 UTC (permalink / raw)
  To: m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi,

Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote:
> Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > Before finishing this mail, will throw a whacky idea in the ring. I was
> > going through the request based dm-multipath paper. Will it make sense
> > to implement request based dm-ioband? So basically we implement all the
> > group scheduling in CFQ and let dm-ioband implement a request function
> > to take the request and break it back into bios. This way we can keep
> > all the group control at one place and also meet most of the requirements.
> >
> > So request based dm-ioband will have a request in hand once that request
> > has passed group control and prio control. Because dm-ioband is a device
> > mapper target, one can put it on higher level devices (practically taking
> > CFQ at higher level device), and provide fairness there. One can also
> > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > them to use the IO scheduler.)
> >
> > I am sure that will be many issues but one big issue I could think of that
> > CFQ thinks that there is one device beneath it and dipsatches requests
> > from one queue (in case of idling) and that would kill parallelism at
> > higher layer and throughput will suffer on many of the dm/md configurations.
> >
> > Thanks
> > Vivek
> 
> As long as using CFQ, your idea is reasonable for me.  But how about for
> other IO schedulers?  In my understanding, one of the keys to guarantee
> group isolation in your patch is to have per-group IO scheduler internal
> queue even with as, deadline, and noop scheduler.  I think this is
> great idea, and to implement generic code for all IO schedulers was
> concluded when we had so many IO scheduler specific proposals.
> If we will still need per-group IO scheduler internal queues with
> request-based dm-ioband, we have to modify elevator layer.  It seems
> out of scope of dm.
> I might miss something...

IIUC, the request based device-mapper could not break back a request
into bio, so it could not work with block devices which don't use the
IO scheduler.

How about adding a callback function to the higher level controller?
CFQ calls it when the active queue runs out of time, then the higer
level controller use it as a trigger or a hint to move IO group, so
I think a time-based controller could be implemented at higher level.

My requirements for IO controller are:
- Implement s a higher level controller, which is located at block
  layer and bio is grabbed in generic_make_request().
- Can work with any type of IO scheduler.
- Can work with any type of block devices.
- Support multiple policies, proportional wegiht, max rate, time
  based, ans so on.

The IO controller mini-summit will be held in next week, and I'm
looking forard to meet you all and discuss about IO controller.
https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 20:27                     ` Munehiro Ikeda
@ 2009-10-05 10:38                       ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-05 10:38 UTC (permalink / raw)
  To: m-ikeda
  Cc: vgoyal, nauman, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

Hi,

Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > Before finishing this mail, will throw a whacky idea in the ring. I was
> > going through the request based dm-multipath paper. Will it make sense
> > to implement request based dm-ioband? So basically we implement all the
> > group scheduling in CFQ and let dm-ioband implement a request function
> > to take the request and break it back into bios. This way we can keep
> > all the group control at one place and also meet most of the requirements.
> >
> > So request based dm-ioband will have a request in hand once that request
> > has passed group control and prio control. Because dm-ioband is a device
> > mapper target, one can put it on higher level devices (practically taking
> > CFQ at higher level device), and provide fairness there. One can also
> > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > them to use the IO scheduler.)
> >
> > I am sure that will be many issues but one big issue I could think of that
> > CFQ thinks that there is one device beneath it and dipsatches requests
> > from one queue (in case of idling) and that would kill parallelism at
> > higher layer and throughput will suffer on many of the dm/md configurations.
> >
> > Thanks
> > Vivek
> 
> As long as using CFQ, your idea is reasonable for me.  But how about for
> other IO schedulers?  In my understanding, one of the keys to guarantee
> group isolation in your patch is to have per-group IO scheduler internal
> queue even with as, deadline, and noop scheduler.  I think this is
> great idea, and to implement generic code for all IO schedulers was
> concluded when we had so many IO scheduler specific proposals.
> If we will still need per-group IO scheduler internal queues with
> request-based dm-ioband, we have to modify elevator layer.  It seems
> out of scope of dm.
> I might miss something...

IIUC, the request based device-mapper could not break back a request
into bio, so it could not work with block devices which don't use the
IO scheduler.

How about adding a callback function to the higher level controller?
CFQ calls it when the active queue runs out of time, then the higer
level controller use it as a trigger or a hint to move IO group, so
I think a time-based controller could be implemented at higher level.

My requirements for IO controller are:
- Implement s a higher level controller, which is located at block
  layer and bio is grabbed in generic_make_request().
- Can work with any type of IO scheduler.
- Can work with any type of block devices.
- Support multiple policies, proportional wegiht, max rate, time
  based, ans so on.

The IO controller mini-summit will be held in next week, and I'm
looking forard to meet you all and discuss about IO controller.
https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-05 10:38                       ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-05 10:38 UTC (permalink / raw)
  To: m-ikeda
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, vgoyal, righi.andrea,
	riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	torvalds

Hi,

Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote:
> Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> > Before finishing this mail, will throw a whacky idea in the ring. I was
> > going through the request based dm-multipath paper. Will it make sense
> > to implement request based dm-ioband? So basically we implement all the
> > group scheduling in CFQ and let dm-ioband implement a request function
> > to take the request and break it back into bios. This way we can keep
> > all the group control at one place and also meet most of the requirements.
> >
> > So request based dm-ioband will have a request in hand once that request
> > has passed group control and prio control. Because dm-ioband is a device
> > mapper target, one can put it on higher level devices (practically taking
> > CFQ at higher level device), and provide fairness there. One can also
> > put it on those SSDs which don't use IO scheduler (this is kind of forcing
> > them to use the IO scheduler.)
> >
> > I am sure that will be many issues but one big issue I could think of that
> > CFQ thinks that there is one device beneath it and dipsatches requests
> > from one queue (in case of idling) and that would kill parallelism at
> > higher layer and throughput will suffer on many of the dm/md configurations.
> >
> > Thanks
> > Vivek
> 
> As long as using CFQ, your idea is reasonable for me.  But how about for
> other IO schedulers?  In my understanding, one of the keys to guarantee
> group isolation in your patch is to have per-group IO scheduler internal
> queue even with as, deadline, and noop scheduler.  I think this is
> great idea, and to implement generic code for all IO schedulers was
> concluded when we had so many IO scheduler specific proposals.
> If we will still need per-group IO scheduler internal queues with
> request-based dm-ioband, we have to modify elevator layer.  It seems
> out of scope of dm.
> I might miss something...

IIUC, the request based device-mapper could not break back a request
into bio, so it could not work with block devices which don't use the
IO scheduler.

How about adding a callback function to the higher level controller?
CFQ calls it when the active queue runs out of time, then the higer
level controller use it as a trigger or a hint to move IO group, so
I think a time-based controller could be implemented at higher level.

My requirements for IO controller are:
- Implement s a higher level controller, which is located at block
  layer and bio is grabbed in generic_make_request().
- Can work with any type of IO scheduler.
- Can work with any type of block devices.
- Support multiple policies, proportional wegiht, max rate, time
  based, ans so on.

The IO controller mini-summit will be held in next week, and I'm
looking forard to meet you all and discuss about IO controller.
https://sourceforge.net/apps/trac/ioband/wiki/iosummit

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                               ` <4e5e476b0910030212y50f97d97nc2e17c35d855cc63-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-03 13:18                                                                 ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03 13:18 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Sat, Oct 03 2009, Corrado Zoccolo wrote:
> Hi,
> On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> wrote:
> > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
> >
> >> After shutting down the computer yesterday, I was thinking a bit about
> >> this issue and how to solve it without incurring too much delay. If we
> >> add a stricter control of the depth, that may help. So instead of
> >> allowing up to max_quantum (or larger) depths, only allow gradual build
> >> up of that the farther we get away from a dispatch from the sync IO
> >> queues. For example, when switching to an async or seeky sync queue,
> >> initially allow just 1 in flight. For the next round, if there still
> >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> >> again, immediately drop to 1.
> >>
> 
> I would limit just async I/O. Seeky sync queues are automatically
> throttled by being sync, and have already high latency, so we
> shouldn't increase it artificially. I think, instead, that we should
> send multiple seeky requests (possibly coming from different queues)
> at once. They will help especially with raid devices, where the seeks
> for requests going to different disks will happen in parallel.
> 
Async is the prime offendor, definitely.

> >> It could tie in with (or partly replace) the overload feature. The key
> >> to good latency and decent throughput is knowing when to allow queue
> >> build up and when not to.
> >
> > Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> > build/unleash any sizable IO, but that's just my gut talking.
> >
> On the other hand, sending 1 write first and then waiting it to
> complete before submitting new ones, will help performing more merges,
> so the subsequent requests will be bigger and thus more efficient.

Usually async writes stack up very quickly, so as long as you don't
drain completely, the merging will happen automagically anyway.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  9:12                                                               ` Corrado Zoccolo
@ 2009-10-03 13:18                                                                 ` Jens Axboe
  -1 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03 13:18 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Mike Galbraith, Ingo Molnar, Linus Torvalds, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

On Sat, Oct 03 2009, Corrado Zoccolo wrote:
> Hi,
> On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote:
> > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
> >
> >> After shutting down the computer yesterday, I was thinking a bit about
> >> this issue and how to solve it without incurring too much delay. If we
> >> add a stricter control of the depth, that may help. So instead of
> >> allowing up to max_quantum (or larger) depths, only allow gradual build
> >> up of that the farther we get away from a dispatch from the sync IO
> >> queues. For example, when switching to an async or seeky sync queue,
> >> initially allow just 1 in flight. For the next round, if there still
> >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> >> again, immediately drop to 1.
> >>
> 
> I would limit just async I/O. Seeky sync queues are automatically
> throttled by being sync, and have already high latency, so we
> shouldn't increase it artificially. I think, instead, that we should
> send multiple seeky requests (possibly coming from different queues)
> at once. They will help especially with raid devices, where the seeks
> for requests going to different disks will happen in parallel.
> 
Async is the prime offendor, definitely.

> >> It could tie in with (or partly replace) the overload feature. The key
> >> to good latency and decent throughput is knowing when to allow queue
> >> build up and when not to.
> >
> > Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> > build/unleash any sizable IO, but that's just my gut talking.
> >
> On the other hand, sending 1 write first and then waiting it to
> complete before submitting new ones, will help performing more merges,
> so the subsequent requests will be bigger and thus more efficient.

Usually async writes stack up very quickly, so as long as you don't
drain completely, the merging will happen automagically anyway.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-03 13:18                                                                 ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03 13:18 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, Mike Galbraith, linux-kernel, akpm, righi.andrea,
	Linus Torvalds

On Sat, Oct 03 2009, Corrado Zoccolo wrote:
> Hi,
> On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote:
> > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
> >
> >> After shutting down the computer yesterday, I was thinking a bit about
> >> this issue and how to solve it without incurring too much delay. If we
> >> add a stricter control of the depth, that may help. So instead of
> >> allowing up to max_quantum (or larger) depths, only allow gradual build
> >> up of that the farther we get away from a dispatch from the sync IO
> >> queues. For example, when switching to an async or seeky sync queue,
> >> initially allow just 1 in flight. For the next round, if there still
> >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> >> again, immediately drop to 1.
> >>
> 
> I would limit just async I/O. Seeky sync queues are automatically
> throttled by being sync, and have already high latency, so we
> shouldn't increase it artificially. I think, instead, that we should
> send multiple seeky requests (possibly coming from different queues)
> at once. They will help especially with raid devices, where the seeks
> for requests going to different disks will happen in parallel.
> 
Async is the prime offendor, definitely.

> >> It could tie in with (or partly replace) the overload feature. The key
> >> to good latency and decent throughput is knowing when to allow queue
> >> build up and when not to.
> >
> > Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> > build/unleash any sizable IO, but that's just my gut talking.
> >
> On the other hand, sending 1 write first and then waiting it to
> complete before submitting new ones, will help performing more merges,
> so the subsequent requests will be bigger and thus more efficient.

Usually async writes stack up very quickly, so as long as you don't
drain completely, the merging will happen automagically anyway.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                             ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-03  9:12                                                               ` Corrado Zoccolo
@ 2009-10-03 13:17                                                               ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03 13:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
> 
> > After shutting down the computer yesterday, I was thinking a bit about
> > this issue and how to solve it without incurring too much delay. If we
> > add a stricter control of the depth, that may help. So instead of
> > allowing up to max_quantum (or larger) depths, only allow gradual build
> > up of that the farther we get away from a dispatch from the sync IO
> > queues. For example, when switching to an async or seeky sync queue,
> > initially allow just 1 in flight. For the next round, if there still
> > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> > again, immediately drop to 1.
> > 
> > It could tie in with (or partly replace) the overload feature. The key
> > to good latency and decent throughput is knowing when to allow queue
> > build up and when not to.
> 
> Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> build/unleash any sizable IO, but that's just my gut talking.

Not sure, will need some testing of course. But it'll build up quickly.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  9:00                                                           ` Mike Galbraith
@ 2009-10-03 13:17                                                               ` Jens Axboe
       [not found]                                                             ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-03 13:17                                                               ` Jens Axboe
  2 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03 13:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
> 
> > After shutting down the computer yesterday, I was thinking a bit about
> > this issue and how to solve it without incurring too much delay. If we
> > add a stricter control of the depth, that may help. So instead of
> > allowing up to max_quantum (or larger) depths, only allow gradual build
> > up of that the farther we get away from a dispatch from the sync IO
> > queues. For example, when switching to an async or seeky sync queue,
> > initially allow just 1 in flight. For the next round, if there still
> > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> > again, immediately drop to 1.
> > 
> > It could tie in with (or partly replace) the overload feature. The key
> > to good latency and decent throughput is knowing when to allow queue
> > build up and when not to.
> 
> Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> build/unleash any sizable IO, but that's just my gut talking.

Not sure, will need some testing of course. But it'll build up quickly.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-03 13:17                                                               ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03 13:17 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, linux-kernel, akpm, righi.andrea, Linus Torvalds

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
> 
> > After shutting down the computer yesterday, I was thinking a bit about
> > this issue and how to solve it without incurring too much delay. If we
> > add a stricter control of the depth, that may help. So instead of
> > allowing up to max_quantum (or larger) depths, only allow gradual build
> > up of that the farther we get away from a dispatch from the sync IO
> > queues. For example, when switching to an async or seeky sync queue,
> > initially allow just 1 in flight. For the next round, if there still
> > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> > again, immediately drop to 1.
> > 
> > It could tie in with (or partly replace) the overload feature. The key
> > to good latency and decent throughput is knowing when to allow queue
> > build up and when not to.
> 
> Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> build/unleash any sizable IO, but that's just my gut talking.

Not sure, will need some testing of course. But it'll build up quickly.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                         ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-03  7:24                                                           ` Jens Axboe
@ 2009-10-03 11:29                                                           ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-03 11:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > 
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > > 
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > 
> > > then lets take it from there.
> 

> Note to self: build the darn thing after last minute changes.
> 
> Block:  Delay overloading of CFQ queues to improve read latency.
> 
> Introduce a delay maximum dispatch timestamp, and stamp it when:
>         1. we encounter a known seeky or possibly new sync IO queue.
>         2. the current queue may go idle and we're draining async IO.
>         3. we have sync IO in flight and are servicing an async queue.
>         4  we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
> 

So it looks like primarily the issue seems to be that we done lot of
dispatch from async queue and if some sync queue comes in now, it will
experience latencies.

For a ongoing seeky sync queue issue will be solved up to some extent
because previously we did not choose to idle for that queue now we will
idle, hence async queue will not get a chance to overload the dispatch
queue.

For the sync queues where we choose not to enable idle, we still will see
the latencies. Instead of time stamping on all the above events, can we 
just keep track of last sync request completed in the system and don't
allow async queue to flood/overload the dispatch queue with-in certain 
time limit of that last sync request completion. This just gives a buffer
period to that sync queue to come back and submit more requests and
still not suffer large latencies?

Thanks
Vivek


> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.
> 
> Signed-off-by: Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org>
> Cc: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
> ... others who let somewhat hacky tweak slip by
> 
> ---
>  block/cfq-iosched.c |   45 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 41 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -174,6 +174,9 @@ struct cfq_data {
>  	unsigned int cfq_slice_async_rq;
>  	unsigned int cfq_slice_idle;
>  	unsigned int cfq_desktop;
> +	unsigned int cfq_desktop_dispatch;
> +
> +	unsigned long desktop_dispatch_ts;
>  
>  	struct list_head cic_list;
>  
> @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
>  	struct cfq_data *cfqd = q->elevator->elevator_data;
>  	struct cfq_queue *cfqq;
>  	unsigned int max_dispatch;
> +	unsigned long delay;
>  
>  	if (!cfqd->busy_queues)
>  		return 0;
> @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
>  	/*
>  	 * Drain async requests before we start sync IO
>  	 */
> -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}
>  
>  	/*
>  	 * If this is an async queue and we have sync IO in flight, let it wait
>  	 */
> -	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> +	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}
>  
>  	max_dispatch = cfqd->cfq_quantum;
>  	if (cfq_class_idle(cfqq))
>  		max_dispatch = 1;
>  
> +	if (cfqd->busy_queues > 1)
> +		cfqd->desktop_dispatch_ts = jiffies;
> +
>  	/*
>  	 * Does this cfqq already have too much IO in flight?
>  	 */
> @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
>  			return 0;
>  
>  		/*
> +		 * Don't start overloading until we've been alone for a bit.
> +		 */
> +		if (cfqd->cfq_desktop_dispatch) {
> +			delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
> +
> +			if (time_before(jiffies, max_delay))
> +				return 0;
> +		}
> +
> +		/*
>  		 * we are the only queue, allow up to 4 times of 'quantum'
>  		 */
>  		if (cfqq->dispatched >= 4 * max_dispatch)
> @@ -1942,7 +1963,7 @@ static void
>  cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  		       struct cfq_io_context *cic)
>  {
> -	int old_idle, enable_idle;
> +	int old_idle, enable_idle, seeky = 0;
>  
>  	/*
>  	 * Don't idle for async or idle io prio class
> @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
>  	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
>  		return;
>  
> +	if (cfqd->hw_tag) {
> +		if (CIC_SEEKY(cic))
> +			seeky = 1;
> +		/*
> +		 * If seeky or incalculable seekiness, delay overloading.
> +		 */
> +		if (seeky || !sample_valid(cic->seek_samples))
> +			cfqd->desktop_dispatch_ts = jiffies;
> +	}
> +
>  	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>  
>  	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> -	    (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
> +	    (!cfqd->cfq_desktop && seeky))
>  		enable_idle = 0;
>  	else if (sample_valid(cic->ttime_samples)) {
>  		if (cic->ttime_mean > cfqd->cfq_slice_idle)
> @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
>  	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
>  	cfqd->cfq_slice_idle = cfq_slice_idle;
>  	cfqd->cfq_desktop = 1;
> +	cfqd->cfq_desktop_dispatch = 1;
> +
> +	cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
>  	cfqd->hw_tag = 1;
>  
>  	return cfqd;
> @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
>  SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
>  SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
>  SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
> +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
>  #undef SHOW_FUNCTION
>  
>  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
>  STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
>  		UINT_MAX, 0);
>  STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
> +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
>  #undef STORE_FUNCTION
>  
>  #define CFQ_ATTR(name) \
> @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
>  	CFQ_ATTR(slice_async_rq),
>  	CFQ_ATTR(slice_idle),
>  	CFQ_ATTR(desktop),
> +	CFQ_ATTR(desktop_dispatch),
>  	__ATTR_NULL
>  };
>  
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  5:56                                                       ` Mike Galbraith
@ 2009-10-03 11:29                                                           ` Vivek Goyal
  2009-10-03 11:29                                                           ` Vivek Goyal
       [not found]                                                         ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-03 11:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jens Axboe, Ingo Molnar, Linus Torvalds, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > 
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > > 
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > 
> > > then lets take it from there.
> 

> Note to self: build the darn thing after last minute changes.
> 
> Block:  Delay overloading of CFQ queues to improve read latency.
> 
> Introduce a delay maximum dispatch timestamp, and stamp it when:
>         1. we encounter a known seeky or possibly new sync IO queue.
>         2. the current queue may go idle and we're draining async IO.
>         3. we have sync IO in flight and are servicing an async queue.
>         4  we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
> 

So it looks like primarily the issue seems to be that we done lot of
dispatch from async queue and if some sync queue comes in now, it will
experience latencies.

For a ongoing seeky sync queue issue will be solved up to some extent
because previously we did not choose to idle for that queue now we will
idle, hence async queue will not get a chance to overload the dispatch
queue.

For the sync queues where we choose not to enable idle, we still will see
the latencies. Instead of time stamping on all the above events, can we 
just keep track of last sync request completed in the system and don't
allow async queue to flood/overload the dispatch queue with-in certain 
time limit of that last sync request completion. This just gives a buffer
period to that sync queue to come back and submit more requests and
still not suffer large latencies?

Thanks
Vivek


> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Cc: Jens Axboe <jens.axboe@oracle.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ... others who let somewhat hacky tweak slip by
> 
> ---
>  block/cfq-iosched.c |   45 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 41 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -174,6 +174,9 @@ struct cfq_data {
>  	unsigned int cfq_slice_async_rq;
>  	unsigned int cfq_slice_idle;
>  	unsigned int cfq_desktop;
> +	unsigned int cfq_desktop_dispatch;
> +
> +	unsigned long desktop_dispatch_ts;
>  
>  	struct list_head cic_list;
>  
> @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
>  	struct cfq_data *cfqd = q->elevator->elevator_data;
>  	struct cfq_queue *cfqq;
>  	unsigned int max_dispatch;
> +	unsigned long delay;
>  
>  	if (!cfqd->busy_queues)
>  		return 0;
> @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
>  	/*
>  	 * Drain async requests before we start sync IO
>  	 */
> -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}
>  
>  	/*
>  	 * If this is an async queue and we have sync IO in flight, let it wait
>  	 */
> -	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> +	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}
>  
>  	max_dispatch = cfqd->cfq_quantum;
>  	if (cfq_class_idle(cfqq))
>  		max_dispatch = 1;
>  
> +	if (cfqd->busy_queues > 1)
> +		cfqd->desktop_dispatch_ts = jiffies;
> +
>  	/*
>  	 * Does this cfqq already have too much IO in flight?
>  	 */
> @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
>  			return 0;
>  
>  		/*
> +		 * Don't start overloading until we've been alone for a bit.
> +		 */
> +		if (cfqd->cfq_desktop_dispatch) {
> +			delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
> +
> +			if (time_before(jiffies, max_delay))
> +				return 0;
> +		}
> +
> +		/*
>  		 * we are the only queue, allow up to 4 times of 'quantum'
>  		 */
>  		if (cfqq->dispatched >= 4 * max_dispatch)
> @@ -1942,7 +1963,7 @@ static void
>  cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  		       struct cfq_io_context *cic)
>  {
> -	int old_idle, enable_idle;
> +	int old_idle, enable_idle, seeky = 0;
>  
>  	/*
>  	 * Don't idle for async or idle io prio class
> @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
>  	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
>  		return;
>  
> +	if (cfqd->hw_tag) {
> +		if (CIC_SEEKY(cic))
> +			seeky = 1;
> +		/*
> +		 * If seeky or incalculable seekiness, delay overloading.
> +		 */
> +		if (seeky || !sample_valid(cic->seek_samples))
> +			cfqd->desktop_dispatch_ts = jiffies;
> +	}
> +
>  	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>  
>  	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> -	    (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
> +	    (!cfqd->cfq_desktop && seeky))
>  		enable_idle = 0;
>  	else if (sample_valid(cic->ttime_samples)) {
>  		if (cic->ttime_mean > cfqd->cfq_slice_idle)
> @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
>  	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
>  	cfqd->cfq_slice_idle = cfq_slice_idle;
>  	cfqd->cfq_desktop = 1;
> +	cfqd->cfq_desktop_dispatch = 1;
> +
> +	cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
>  	cfqd->hw_tag = 1;
>  
>  	return cfqd;
> @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
>  SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
>  SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
>  SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
> +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
>  #undef SHOW_FUNCTION
>  
>  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
>  STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
>  		UINT_MAX, 0);
>  STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
> +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
>  #undef STORE_FUNCTION
>  
>  #define CFQ_ATTR(name) \
> @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
>  	CFQ_ATTR(slice_async_rq),
>  	CFQ_ATTR(slice_idle),
>  	CFQ_ATTR(desktop),
> +	CFQ_ATTR(desktop_dispatch),
>  	__ATTR_NULL
>  };
>  
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-03 11:29                                                           ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-03 11:29 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers,
	linux-kernel, akpm, righi.andrea, Linus Torvalds

On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > 
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > > 
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > 
> > > then lets take it from there.
> 

> Note to self: build the darn thing after last minute changes.
> 
> Block:  Delay overloading of CFQ queues to improve read latency.
> 
> Introduce a delay maximum dispatch timestamp, and stamp it when:
>         1. we encounter a known seeky or possibly new sync IO queue.
>         2. the current queue may go idle and we're draining async IO.
>         3. we have sync IO in flight and are servicing an async queue.
>         4  we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
> 

So it looks like primarily the issue seems to be that we done lot of
dispatch from async queue and if some sync queue comes in now, it will
experience latencies.

For a ongoing seeky sync queue issue will be solved up to some extent
because previously we did not choose to idle for that queue now we will
idle, hence async queue will not get a chance to overload the dispatch
queue.

For the sync queues where we choose not to enable idle, we still will see
the latencies. Instead of time stamping on all the above events, can we 
just keep track of last sync request completed in the system and don't
allow async queue to flood/overload the dispatch queue with-in certain 
time limit of that last sync request completion. This just gives a buffer
period to that sync queue to come back and submit more requests and
still not suffer large latencies?

Thanks
Vivek


> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Cc: Jens Axboe <jens.axboe@oracle.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ... others who let somewhat hacky tweak slip by
> 
> ---
>  block/cfq-iosched.c |   45 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 41 insertions(+), 4 deletions(-)
> 
> Index: linux-2.6/block/cfq-iosched.c
> ===================================================================
> --- linux-2.6.orig/block/cfq-iosched.c
> +++ linux-2.6/block/cfq-iosched.c
> @@ -174,6 +174,9 @@ struct cfq_data {
>  	unsigned int cfq_slice_async_rq;
>  	unsigned int cfq_slice_idle;
>  	unsigned int cfq_desktop;
> +	unsigned int cfq_desktop_dispatch;
> +
> +	unsigned long desktop_dispatch_ts;
>  
>  	struct list_head cic_list;
>  
> @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
>  	struct cfq_data *cfqd = q->elevator->elevator_data;
>  	struct cfq_queue *cfqq;
>  	unsigned int max_dispatch;
> +	unsigned long delay;
>  
>  	if (!cfqd->busy_queues)
>  		return 0;
> @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
>  	/*
>  	 * Drain async requests before we start sync IO
>  	 */
> -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}
>  
>  	/*
>  	 * If this is an async queue and we have sync IO in flight, let it wait
>  	 */
> -	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
> +	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}
>  
>  	max_dispatch = cfqd->cfq_quantum;
>  	if (cfq_class_idle(cfqq))
>  		max_dispatch = 1;
>  
> +	if (cfqd->busy_queues > 1)
> +		cfqd->desktop_dispatch_ts = jiffies;
> +
>  	/*
>  	 * Does this cfqq already have too much IO in flight?
>  	 */
> @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
>  			return 0;
>  
>  		/*
> +		 * Don't start overloading until we've been alone for a bit.
> +		 */
> +		if (cfqd->cfq_desktop_dispatch) {
> +			delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
> +
> +			if (time_before(jiffies, max_delay))
> +				return 0;
> +		}
> +
> +		/*
>  		 * we are the only queue, allow up to 4 times of 'quantum'
>  		 */
>  		if (cfqq->dispatched >= 4 * max_dispatch)
> @@ -1942,7 +1963,7 @@ static void
>  cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
>  		       struct cfq_io_context *cic)
>  {
> -	int old_idle, enable_idle;
> +	int old_idle, enable_idle, seeky = 0;
>  
>  	/*
>  	 * Don't idle for async or idle io prio class
> @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
>  	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
>  		return;
>  
> +	if (cfqd->hw_tag) {
> +		if (CIC_SEEKY(cic))
> +			seeky = 1;
> +		/*
> +		 * If seeky or incalculable seekiness, delay overloading.
> +		 */
> +		if (seeky || !sample_valid(cic->seek_samples))
> +			cfqd->desktop_dispatch_ts = jiffies;
> +	}
> +
>  	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
>  
>  	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
> -	    (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
> +	    (!cfqd->cfq_desktop && seeky))
>  		enable_idle = 0;
>  	else if (sample_valid(cic->ttime_samples)) {
>  		if (cic->ttime_mean > cfqd->cfq_slice_idle)
> @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
>  	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
>  	cfqd->cfq_slice_idle = cfq_slice_idle;
>  	cfqd->cfq_desktop = 1;
> +	cfqd->cfq_desktop_dispatch = 1;
> +
> +	cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
>  	cfqd->hw_tag = 1;
>  
>  	return cfqd;
> @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
>  SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
>  SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
>  SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
> +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
>  #undef SHOW_FUNCTION
>  
>  #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
> @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
>  STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
>  		UINT_MAX, 0);
>  STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
> +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
>  #undef STORE_FUNCTION
>  
>  #define CFQ_ATTR(name) \
> @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
>  	CFQ_ATTR(slice_async_rq),
>  	CFQ_ATTR(slice_idle),
>  	CFQ_ATTR(desktop),
> +	CFQ_ATTR(desktop_dispatch),
>  	__ATTR_NULL
>  };
>  
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                             ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-03  9:12                                                               ` Corrado Zoccolo
  2009-10-03 13:17                                                               ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-03  9:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

Hi,
On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote:
> On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
>
>> After shutting down the computer yesterday, I was thinking a bit about
>> this issue and how to solve it without incurring too much delay. If we
>> add a stricter control of the depth, that may help. So instead of
>> allowing up to max_quantum (or larger) depths, only allow gradual build
>> up of that the farther we get away from a dispatch from the sync IO
>> queues. For example, when switching to an async or seeky sync queue,
>> initially allow just 1 in flight. For the next round, if there still
>> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
>> again, immediately drop to 1.
>>

I would limit just async I/O. Seeky sync queues are automatically
throttled by being sync, and have already high latency, so we
shouldn't increase it artificially. I think, instead, that we should
send multiple seeky requests (possibly coming from different queues)
at once. They will help especially with raid devices, where the seeks
for requests going to different disks will happen in parallel.

>> It could tie in with (or partly replace) the overload feature. The key
>> to good latency and decent throughput is knowing when to allow queue
>> build up and when not to.
>
> Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> build/unleash any sizable IO, but that's just my gut talking.
>
On the other hand, sending 1 write first and then waiting it to
complete before submitting new ones, will help performing more merges,
so the subsequent requests will be bigger and thus more efficient.

Corrado

>        -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  9:00                                                           ` Mike Galbraith
@ 2009-10-03  9:12                                                               ` Corrado Zoccolo
       [not found]                                                             ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-03 13:17                                                               ` Jens Axboe
  2 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-03  9:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jens Axboe, Ingo Molnar, Linus Torvalds, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

Hi,
On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote:
> On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
>
>> After shutting down the computer yesterday, I was thinking a bit about
>> this issue and how to solve it without incurring too much delay. If we
>> add a stricter control of the depth, that may help. So instead of
>> allowing up to max_quantum (or larger) depths, only allow gradual build
>> up of that the farther we get away from a dispatch from the sync IO
>> queues. For example, when switching to an async or seeky sync queue,
>> initially allow just 1 in flight. For the next round, if there still
>> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
>> again, immediately drop to 1.
>>

I would limit just async I/O. Seeky sync queues are automatically
throttled by being sync, and have already high latency, so we
shouldn't increase it artificially. I think, instead, that we should
send multiple seeky requests (possibly coming from different queues)
at once. They will help especially with raid devices, where the seeks
for requests going to different disks will happen in parallel.

>> It could tie in with (or partly replace) the overload feature. The key
>> to good latency and decent throughput is knowing when to allow queue
>> build up and when not to.
>
> Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> build/unleash any sizable IO, but that's just my gut talking.
>
On the other hand, sending 1 write first and then waiting it to
complete before submitting new ones, will help performing more merges,
so the subsequent requests will be bigger and thus more efficient.

Corrado

>        -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-03  9:12                                                               ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-03  9:12 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, linux-kernel, akpm, righi.andrea, Linus Torvalds

Hi,
On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote:
> On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:
>
>> After shutting down the computer yesterday, I was thinking a bit about
>> this issue and how to solve it without incurring too much delay. If we
>> add a stricter control of the depth, that may help. So instead of
>> allowing up to max_quantum (or larger) depths, only allow gradual build
>> up of that the farther we get away from a dispatch from the sync IO
>> queues. For example, when switching to an async or seeky sync queue,
>> initially allow just 1 in flight. For the next round, if there still
>> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
>> again, immediately drop to 1.
>>

I would limit just async I/O. Seeky sync queues are automatically
throttled by being sync, and have already high latency, so we
shouldn't increase it artificially. I think, instead, that we should
send multiple seeky requests (possibly coming from different queues)
at once. They will help especially with raid devices, where the seeks
for requests going to different disks will happen in parallel.

>> It could tie in with (or partly replace) the overload feature. The key
>> to good latency and decent throughput is knowing when to allow queue
>> build up and when not to.
>
> Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
> build/unleash any sizable IO, but that's just my gut talking.
>
On the other hand, sending 1 write first and then waiting it to
complete before submitting new ones, will help performing more merges,
so the subsequent requests will be bigger and thus more efficient.

Corrado

>        -Mike
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                           ` <20091003072540.GW31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-03  8:53                                                             ` Mike Galbraith
@ 2009-10-03  9:01                                                             ` Corrado Zoccolo
  1 sibling, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-03  9:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

Hi Jens,
On Sat, Oct 3, 2009 at 9:25 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Sat, Oct 03 2009, Ingo Molnar wrote:
>>
>> * Mike Galbraith <efault@gmx.de> wrote:
>>
>> >     unsigned int cfq_desktop;
>> > +   unsigned int cfq_desktop_dispatch;
>>
>> > -   if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
>> > +   if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
>> > +           cfqd->desktop_dispatch_ts = jiffies;
>> >             return 0;
>> > +   }
>>
>> btw., i hope all those desktop_ things will be named latency_ pretty
>> soon as the consensus seems to be - the word 'desktop' feels so wrong in
>> this context.
>>
>> 'desktop' is a form of use of computers and the implication of good
>> latencies goes far beyond that category of systems.
>
> I will rename it, for now it doesn't matter (lets not get bogged down in
> bike shed colors, please).
>
> Oh and Mike, I forgot to mention this in the previous email - no more
> tunables, please. We'll keep this under a single knob.

Did you have a look at my http://patchwork.kernel.org/patch/47750/ ?
It already introduces a 'target_latency' tunable, expressed in ms.

If we can quantify the benefits of each technique, we could enable
them based on the target latency requested by that single tunable.

Corrado

>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  7:25                                                           ` Jens Axboe
                                                                             ` (2 preceding siblings ...)
  (?)
@ 2009-10-03  9:01                                                           ` Corrado Zoccolo
  -1 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-03  9:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Mike Galbraith, Linus Torvalds, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

Hi Jens,
On Sat, Oct 3, 2009 at 9:25 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> On Sat, Oct 03 2009, Ingo Molnar wrote:
>>
>> * Mike Galbraith <efault@gmx.de> wrote:
>>
>> >     unsigned int cfq_desktop;
>> > +   unsigned int cfq_desktop_dispatch;
>>
>> > -   if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
>> > +   if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
>> > +           cfqd->desktop_dispatch_ts = jiffies;
>> >             return 0;
>> > +   }
>>
>> btw., i hope all those desktop_ things will be named latency_ pretty
>> soon as the consensus seems to be - the word 'desktop' feels so wrong in
>> this context.
>>
>> 'desktop' is a form of use of computers and the implication of good
>> latencies goes far beyond that category of systems.
>
> I will rename it, for now it doesn't matter (lets not get bogged down in
> bike shed colors, please).
>
> Oh and Mike, I forgot to mention this in the previous email - no more
> tunables, please. We'll keep this under a single knob.

Did you have a look at my http://patchwork.kernel.org/patch/47750/ ?
It already introduces a 'target_latency' tunable, expressed in ms.

If we can quantify the benefits of each technique, we could enable
them based on the target latency requested by that single tunable.

Corrado

>
> --
> Jens Axboe
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                           ` <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-03  9:00                                                             ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-03  9:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:

> After shutting down the computer yesterday, I was thinking a bit about
> this issue and how to solve it without incurring too much delay. If we
> add a stricter control of the depth, that may help. So instead of
> allowing up to max_quantum (or larger) depths, only allow gradual build
> up of that the farther we get away from a dispatch from the sync IO
> queues. For example, when switching to an async or seeky sync queue,
> initially allow just 1 in flight. For the next round, if there still
> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> again, immediately drop to 1.
> 
> It could tie in with (or partly replace) the overload feature. The key
> to good latency and decent throughput is knowing when to allow queue
> build up and when not to.

Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
build/unleash any sizable IO, but that's just my gut talking.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  7:24                                                         ` Jens Axboe
@ 2009-10-03  9:00                                                           ` Mike Galbraith
  2009-10-03  9:12                                                               ` Corrado Zoccolo
                                                                               ` (2 more replies)
       [not found]                                                           ` <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  1 sibling, 3 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-03  9:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote:

> After shutting down the computer yesterday, I was thinking a bit about
> this issue and how to solve it without incurring too much delay. If we
> add a stricter control of the depth, that may help. So instead of
> allowing up to max_quantum (or larger) depths, only allow gradual build
> up of that the farther we get away from a dispatch from the sync IO
> queues. For example, when switching to an async or seeky sync queue,
> initially allow just 1 in flight. For the next round, if there still
> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
> again, immediately drop to 1.
> 
> It could tie in with (or partly replace) the overload feature. The key
> to good latency and decent throughput is knowing when to allow queue
> build up and when not to.

Hm.  Starting at 1 sounds a bit thin (like IDLE), multiple iterations to
build/unleash any sizable IO, but that's just my gut talking.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                           ` <20091003072540.GW31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-03  8:53                                                             ` Mike Galbraith
  2009-10-03  9:01                                                             ` Corrado Zoccolo
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-03  8:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Sat, 2009-10-03 at 09:25 +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Ingo Molnar wrote:

> Oh and Mike, I forgot to mention this in the previous email - no more
> tunables, please. We'll keep this under a single knob.

OK.

Since I don't seem to be competent to operate quilt this morning anyway,
I won't send a fixed version yet.  Anyone who wants to test can easily
fix the rename booboo.  With the knob in place, it's easier to see what
load is affected by what change.

Back to rummage/test.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  7:25                                                           ` Jens Axboe
  (?)
@ 2009-10-03  8:53                                                           ` Mike Galbraith
  -1 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-03  8:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Sat, 2009-10-03 at 09:25 +0200, Jens Axboe wrote:
> On Sat, Oct 03 2009, Ingo Molnar wrote:

> Oh and Mike, I forgot to mention this in the previous email - no more
> tunables, please. We'll keep this under a single knob.

OK.

Since I don't seem to be competent to operate quilt this morning anyway,
I won't send a fixed version yet.  Anyone who wants to test can easily
fix the rename booboo.  With the knob in place, it's easier to see what
load is affected by what change.

Back to rummage/test.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                         ` <20091003072021.GB21407-X9Un+BFzKDI@public.gmane.org>
@ 2009-10-03  7:25                                                           ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03  7:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Sat, Oct 03 2009, Ingo Molnar wrote:
> 
> * Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> wrote:
> 
> >  	unsigned int cfq_desktop;
> > +	unsigned int cfq_desktop_dispatch;
> 
> > -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> > +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> > +		cfqd->desktop_dispatch_ts = jiffies;
> >  		return 0;
> > +	}
> 
> btw., i hope all those desktop_ things will be named latency_ pretty 
> soon as the consensus seems to be - the word 'desktop' feels so wrong in 
> this context.
> 
> 'desktop' is a form of use of computers and the implication of good 
> latencies goes far beyond that category of systems.

I will rename it, for now it doesn't matter (lets not get bogged down in
bike shed colors, please).

Oh and Mike, I forgot to mention this in the previous email - no more
tunables, please. We'll keep this under a single knob.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  7:20                                                         ` Ingo Molnar
@ 2009-10-03  7:25                                                           ` Jens Axboe
  -1 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03  7:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Sat, Oct 03 2009, Ingo Molnar wrote:
> 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> >  	unsigned int cfq_desktop;
> > +	unsigned int cfq_desktop_dispatch;
> 
> > -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> > +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> > +		cfqd->desktop_dispatch_ts = jiffies;
> >  		return 0;
> > +	}
> 
> btw., i hope all those desktop_ things will be named latency_ pretty 
> soon as the consensus seems to be - the word 'desktop' feels so wrong in 
> this context.
> 
> 'desktop' is a form of use of computers and the implication of good 
> latencies goes far beyond that category of systems.

I will rename it, for now it doesn't matter (lets not get bogged down in
bike shed colors, please).

Oh and Mike, I forgot to mention this in the previous email - no more
tunables, please. We'll keep this under a single knob.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-03  7:25                                                           ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03  7:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds

On Sat, Oct 03 2009, Ingo Molnar wrote:
> 
> * Mike Galbraith <efault@gmx.de> wrote:
> 
> >  	unsigned int cfq_desktop;
> > +	unsigned int cfq_desktop_dispatch;
> 
> > -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> > +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> > +		cfqd->desktop_dispatch_ts = jiffies;
> >  		return 0;
> > +	}
> 
> btw., i hope all those desktop_ things will be named latency_ pretty 
> soon as the consensus seems to be - the word 'desktop' feels so wrong in 
> this context.
> 
> 'desktop' is a form of use of computers and the implication of good 
> latencies goes far beyond that category of systems.

I will rename it, for now it doesn't matter (lets not get bogged down in
bike shed colors, please).

Oh and Mike, I forgot to mention this in the previous email - no more
tunables, please. We'll keep this under a single knob.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                         ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-03  7:24                                                           ` Jens Axboe
  2009-10-03 11:29                                                           ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03  7:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > 
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > > 
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > 
> > > then lets take it from there.
> 
> Note to self: build the darn thing after last minute changes.
> 
> Block:  Delay overloading of CFQ queues to improve read latency.
> 
> Introduce a delay maximum dispatch timestamp, and stamp it when:
>         1. we encounter a known seeky or possibly new sync IO queue.
>         2. the current queue may go idle and we're draining async IO.
>         3. we have sync IO in flight and are servicing an async queue.
>         4  we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
> 
> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.

It still doesn't build:

block/cfq-iosched.c: In function ?cfq_dispatch_requests?:
block/cfq-iosched.c:1345: error: ?max_delay? undeclared (first use in
this function)

After shutting down the computer yesterday, I was thinking a bit about
this issue and how to solve it without incurring too much delay. If we
add a stricter control of the depth, that may help. So instead of
allowing up to max_quantum (or larger) depths, only allow gradual build
up of that the farther we get away from a dispatch from the sync IO
queues. For example, when switching to an async or seeky sync queue,
initially allow just 1 in flight. For the next round, if there still
hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
again, immediately drop to 1.

It could tie in with (or partly replace) the overload feature. The key
to good latency and decent throughput is knowing when to allow queue
build up and when not to.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  5:56                                                       ` Mike Galbraith
@ 2009-10-03  7:24                                                         ` Jens Axboe
  2009-10-03  9:00                                                           ` Mike Galbraith
       [not found]                                                           ` <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-03 11:29                                                           ` Vivek Goyal
       [not found]                                                         ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2 siblings, 2 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-03  7:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Sat, Oct 03 2009, Mike Galbraith wrote:
> On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> > 
> > > If you could do a cleaned up version of your overload patch based on
> > > this:
> > > 
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > > 
> > > then lets take it from there.
> 
> Note to self: build the darn thing after last minute changes.
> 
> Block:  Delay overloading of CFQ queues to improve read latency.
> 
> Introduce a delay maximum dispatch timestamp, and stamp it when:
>         1. we encounter a known seeky or possibly new sync IO queue.
>         2. the current queue may go idle and we're draining async IO.
>         3. we have sync IO in flight and are servicing an async queue.
>         4  we are not the sole user of disk.
> Disallow exceeding quantum if any of these events have occurred recently.
> 
> Protect this behavioral change with a "desktop_dispatch" knob and default
> it to "on".. providing an easy means of regression verification prior to
> hate-mail dispatch :) to CC list.

It still doesn't build:

block/cfq-iosched.c: In function ?cfq_dispatch_requests?:
block/cfq-iosched.c:1345: error: ?max_delay? undeclared (first use in
this function)

After shutting down the computer yesterday, I was thinking a bit about
this issue and how to solve it without incurring too much delay. If we
add a stricter control of the depth, that may help. So instead of
allowing up to max_quantum (or larger) depths, only allow gradual build
up of that the farther we get away from a dispatch from the sync IO
queues. For example, when switching to an async or seeky sync queue,
initially allow just 1 in flight. For the next round, if there still
hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue
again, immediately drop to 1.

It could tie in with (or partly replace) the overload feature. The key
to good latency and decent throughput is knowing when to allow queue
build up and when not to.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                       ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-03  5:56                                                         ` Mike Galbraith
@ 2009-10-03  7:20                                                         ` Ingo Molnar
  1 sibling, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-03  7:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds


* Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> wrote:

>  	unsigned int cfq_desktop;
> +	unsigned int cfq_desktop_dispatch;

> -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}

btw., i hope all those desktop_ things will be named latency_ pretty 
soon as the consensus seems to be - the word 'desktop' feels so wrong in 
this context.

'desktop' is a form of use of computers and the implication of good 
latencies goes far beyond that category of systems.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  5:48                                                     ` Mike Galbraith
@ 2009-10-03  7:20                                                         ` Ingo Molnar
       [not found]                                                       ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-03  7:20                                                         ` Ingo Molnar
  2 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-03  7:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jens Axboe, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel


* Mike Galbraith <efault@gmx.de> wrote:

>  	unsigned int cfq_desktop;
> +	unsigned int cfq_desktop_dispatch;

> -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}

btw., i hope all those desktop_ things will be named latency_ pretty 
soon as the consensus seems to be - the word 'desktop' feels so wrong in 
this context.

'desktop' is a form of use of computers and the implication of good 
latencies goes far beyond that category of systems.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-03  7:20                                                         ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-03  7:20 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	linux-kernel, akpm, righi.andrea, Linus Torvalds


* Mike Galbraith <efault@gmx.de> wrote:

>  	unsigned int cfq_desktop;
> +	unsigned int cfq_desktop_dispatch;

> -	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> +	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
> +		cfqd->desktop_dispatch_ts = jiffies;
>  		return 0;
> +	}

btw., i hope all those desktop_ things will be named latency_ pretty 
soon as the consensus seems to be - the word 'desktop' feels so wrong in 
this context.

'desktop' is a form of use of computers and the implication of good 
latencies goes far beyond that category of systems.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                       ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-03  5:56                                                         ` Mike Galbraith
  2009-10-03  7:20                                                         ` Ingo Molnar
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-03  5:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> 
> > If you could do a cleaned up version of your overload patch based on
> > this:
> > 
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > 
> > then lets take it from there.

Note to self: build the darn thing after last minute changes.

Block:  Delay overloading of CFQ queues to improve read latency.

Introduce a delay maximum dispatch timestamp, and stamp it when:
        1. we encounter a known seeky or possibly new sync IO queue.
        2. the current queue may go idle and we're draining async IO.
        3. we have sync IO in flight and are servicing an async queue.
        4  we are not the sole user of disk.
Disallow exceeding quantum if any of these events have occurred recently.

Protect this behavioral change with a "desktop_dispatch" knob and default
it to "on".. providing an easy means of regression verification prior to
hate-mail dispatch :) to CC list.

Signed-off-by: Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org>
Cc: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
... others who let somewhat hacky tweak slip by

---
 block/cfq-iosched.c |   45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -174,6 +174,9 @@ struct cfq_data {
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 	unsigned int cfq_desktop;
+	unsigned int cfq_desktop_dispatch;
+
+	unsigned long desktop_dispatch_ts;
 
 	struct list_head cic_list;
 
@@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
+	unsigned long delay;
 
 	if (!cfqd->busy_queues)
 		return 0;
@@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
 	/*
 	 * Drain async requests before we start sync IO
 	 */
-	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+		cfqd->desktop_dispatch_ts = jiffies;
 		return 0;
+	}
 
 	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
-	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+		cfqd->desktop_dispatch_ts = jiffies;
 		return 0;
+	}
 
 	max_dispatch = cfqd->cfq_quantum;
 	if (cfq_class_idle(cfqq))
 		max_dispatch = 1;
 
+	if (cfqd->busy_queues > 1)
+		cfqd->desktop_dispatch_ts = jiffies;
+
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
@@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
 			return 0;
 
 		/*
+		 * Don't start overloading until we've been alone for a bit.
+		 */
+		if (cfqd->cfq_desktop_dispatch) {
+			delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
+
+			if (time_before(jiffies, max_delay))
+				return 0;
+		}
+
+		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
 		if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1942,7 +1963,7 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct cfq_io_context *cic)
 {
-	int old_idle, enable_idle;
+	int old_idle, enable_idle, seeky = 0;
 
 	/*
 	 * Don't idle for async or idle io prio class
@@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
+	if (cfqd->hw_tag) {
+		if (CIC_SEEKY(cic))
+			seeky = 1;
+		/*
+		 * If seeky or incalculable seekiness, delay overloading.
+		 */
+		if (seeky || !sample_valid(cic->seek_samples))
+			cfqd->desktop_dispatch_ts = jiffies;
+	}
+
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
+	    (!cfqd->cfq_desktop && seeky))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->cfq_desktop = 1;
+	cfqd->cfq_desktop_dispatch = 1;
+
+	cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
 	cfqd->hw_tag = 1;
 
 	return cfqd;
@@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
+SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
+STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
 	CFQ_ATTR(desktop),
+	CFQ_ATTR(desktop_dispatch),
 	__ATTR_NULL
 };

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-03  5:48                                                     ` Mike Galbraith
@ 2009-10-03  5:56                                                       ` Mike Galbraith
  2009-10-03  7:24                                                         ` Jens Axboe
                                                                           ` (2 more replies)
       [not found]                                                       ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-03  7:20                                                         ` Ingo Molnar
  2 siblings, 3 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-03  5:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> 
> > If you could do a cleaned up version of your overload patch based on
> > this:
> > 
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > 
> > then lets take it from there.

Note to self: build the darn thing after last minute changes.

Block:  Delay overloading of CFQ queues to improve read latency.

Introduce a delay maximum dispatch timestamp, and stamp it when:
        1. we encounter a known seeky or possibly new sync IO queue.
        2. the current queue may go idle and we're draining async IO.
        3. we have sync IO in flight and are servicing an async queue.
        4  we are not the sole user of disk.
Disallow exceeding quantum if any of these events have occurred recently.

Protect this behavioral change with a "desktop_dispatch" knob and default
it to "on".. providing an easy means of regression verification prior to
hate-mail dispatch :) to CC list.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
... others who let somewhat hacky tweak slip by

---
 block/cfq-iosched.c |   45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -174,6 +174,9 @@ struct cfq_data {
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 	unsigned int cfq_desktop;
+	unsigned int cfq_desktop_dispatch;
+
+	unsigned long desktop_dispatch_ts;
 
 	struct list_head cic_list;
 
@@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
+	unsigned long delay;
 
 	if (!cfqd->busy_queues)
 		return 0;
@@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
 	/*
 	 * Drain async requests before we start sync IO
 	 */
-	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+		cfqd->desktop_dispatch_ts = jiffies;
 		return 0;
+	}
 
 	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
-	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+		cfqd->desktop_dispatch_ts = jiffies;
 		return 0;
+	}
 
 	max_dispatch = cfqd->cfq_quantum;
 	if (cfq_class_idle(cfqq))
 		max_dispatch = 1;
 
+	if (cfqd->busy_queues > 1)
+		cfqd->desktop_dispatch_ts = jiffies;
+
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
@@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
 			return 0;
 
 		/*
+		 * Don't start overloading until we've been alone for a bit.
+		 */
+		if (cfqd->cfq_desktop_dispatch) {
+			delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
+
+			if (time_before(jiffies, max_delay))
+				return 0;
+		}
+
+		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
 		if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1942,7 +1963,7 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct cfq_io_context *cic)
 {
-	int old_idle, enable_idle;
+	int old_idle, enable_idle, seeky = 0;
 
 	/*
 	 * Don't idle for async or idle io prio class
@@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
+	if (cfqd->hw_tag) {
+		if (CIC_SEEKY(cic))
+			seeky = 1;
+		/*
+		 * If seeky or incalculable seekiness, delay overloading.
+		 */
+		if (seeky || !sample_valid(cic->seek_samples))
+			cfqd->desktop_dispatch_ts = jiffies;
+	}
+
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
+	    (!cfqd->cfq_desktop && seeky))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->cfq_desktop = 1;
+	cfqd->cfq_desktop_dispatch = 1;
+
+	cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
 	cfqd->hw_tag = 1;
 
 	return cfqd;
@@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
+SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
+STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
 	CFQ_ATTR(desktop),
+	CFQ_ATTR(desktop_dispatch),
 	__ATTR_NULL
 };
 



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                     ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-02 18:57                                                       ` Mike Galbraith
@ 2009-10-03  5:48                                                       ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-03  5:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:

> If you could do a cleaned up version of your overload patch based on
> this:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> 
> then lets take it from there.

If take it from there ends up meaning apply, and see who squeaks, feel
free to delete the "Not", and my somewhat defective sense of humor.

Block:  Delay overloading of CFQ queues to improve read latency.

Introduce a delay maximum dispatch timestamp, and stamp it when:
	1. we encounter a known seeky or possibly new sync IO queue.
	2. the current queue may go idle and we're draining async IO.
	3. we have sync IO in flight and are servicing an async queue.
	4  we are not the sole user of disk.
Disallow exceeding quantum if any of these events have occurred recently.

Protect this behavioral change with a "desktop_dispatch" knob and default
it to "on".. providing an easy means of regression verification prior to
hate-mail dispatch :) to CC list.

Signed-off-by: Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org>
Cc: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
... others who let somewhat hacky tweak slip by

LKML-Reference: <new-submission>

---
 block/cfq-iosched.c |   45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -174,6 +174,9 @@ struct cfq_data {
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 	unsigned int cfq_desktop;
+	unsigned int cfq_desktop_dispatch;
+
+	unsigned long desktop_dispatch_ts;
 
 	struct list_head cic_list;
 
@@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
+	unsigned long delay;
 
 	if (!cfqd->busy_queues)
 		return 0;
@@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
 	/*
 	 * Drain async requests before we start sync IO
 	 */
-	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+		cfqd->desktop_dispatch_ts = jiffies;
 		return 0;
+	}
 
 	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
-	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+		cfqd->desktop_dispatch_ts = jiffies;
 		return 0;
+	}
 
 	max_dispatch = cfqd->cfq_quantum;
 	if (cfq_class_idle(cfqq))
 		max_dispatch = 1;
 
+	if (cfqd->busy_queues > 1)
+		cfqd->desktop_dispatch_ts = jiffies;
+
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
@@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
 			return 0;
 
 		/*
+		 * Don't start overloading until we've been alone for a bit.
+		 */
+		if (cfqd->cfq_desktop_dispatch) {
+			delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
+
+			if (time_before(jiffies, max_delay))
+				return 0;
+		}
+
+		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
 		if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1942,7 +1963,7 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct cfq_io_context *cic)
 {
-	int old_idle, enable_idle;
+	int old_idle, enable_idle, seeky = 0;
 
 	/*
 	 * Don't idle for async or idle io prio class
@@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
+	if (cfqd->hw_tag) {
+		if (CIC_SEEKY(cic))
+			seeky = 1;
+		/*
+		 * If seeky or incalculable seekiness, delay overloading.
+		 */
+		if (seeky || !sample_valid(cic->seek_samples))
+			cfqd->desktop_dispatch_ts = jiffies;
+	}
+
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
+	    (!cfqd->cfq_desktop && seeky))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->cfq_desktop = 1;
+	cfqd->cfq_desktop_dispatch = 1;
+
+	cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
 	cfqd->hw_tag = 1;
 
 	return cfqd;
@@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
+SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
+STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
 	CFQ_ATTR(desktop),
+	CFQ_ATTR(desktop_dispatch),
 	__ATTR_NULL
 };

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:19                                                   ` Jens Axboe
  2009-10-02 18:57                                                     ` Mike Galbraith
@ 2009-10-03  5:48                                                     ` Mike Galbraith
  2009-10-03  5:56                                                       ` Mike Galbraith
                                                                         ` (2 more replies)
       [not found]                                                     ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2 siblings, 3 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-03  5:48 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:

> If you could do a cleaned up version of your overload patch based on
> this:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> 
> then lets take it from there.

If take it from there ends up meaning apply, and see who squeaks, feel
free to delete the "Not", and my somewhat defective sense of humor.

Block:  Delay overloading of CFQ queues to improve read latency.

Introduce a delay maximum dispatch timestamp, and stamp it when:
	1. we encounter a known seeky or possibly new sync IO queue.
	2. the current queue may go idle and we're draining async IO.
	3. we have sync IO in flight and are servicing an async queue.
	4  we are not the sole user of disk.
Disallow exceeding quantum if any of these events have occurred recently.

Protect this behavioral change with a "desktop_dispatch" knob and default
it to "on".. providing an easy means of regression verification prior to
hate-mail dispatch :) to CC list.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
... others who let somewhat hacky tweak slip by

LKML-Reference: <new-submission>

---
 block/cfq-iosched.c |   45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -174,6 +174,9 @@ struct cfq_data {
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 	unsigned int cfq_desktop;
+	unsigned int cfq_desktop_dispatch;
+
+	unsigned long desktop_dispatch_ts;
 
 	struct list_head cic_list;
 
@@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 	struct cfq_queue *cfqq;
 	unsigned int max_dispatch;
+	unsigned long delay;
 
 	if (!cfqd->busy_queues)
 		return 0;
@@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct
 	/*
 	 * Drain async requests before we start sync IO
 	 */
-	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+		cfqd->desktop_dispatch_ts = jiffies;
 		return 0;
+	}
 
 	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
-	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+		cfqd->desktop_dispatch_ts = jiffies;
 		return 0;
+	}
 
 	max_dispatch = cfqd->cfq_quantum;
 	if (cfq_class_idle(cfqq))
 		max_dispatch = 1;
 
+	if (cfqd->busy_queues > 1)
+		cfqd->desktop_dispatch_ts = jiffies;
+
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
@@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct
 			return 0;
 
 		/*
+		 * Don't start overloading until we've been alone for a bit.
+		 */
+		if (cfqd->cfq_desktop_dispatch) {
+			delay = cfqd->desktop_dispatch_ts + cfq_slice_sync;
+
+			if (time_before(jiffies, max_delay))
+				return 0;
+		}
+
+		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
 		if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1942,7 +1963,7 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct cfq_io_context *cic)
 {
-	int old_idle, enable_idle;
+	int old_idle, enable_idle, seeky = 0;
 
 	/*
 	 * Don't idle for async or idle io prio class
@@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data *
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
+	if (cfqd->hw_tag) {
+		if (CIC_SEEKY(cic))
+			seeky = 1;
+		/*
+		 * If seeky or incalculable seekiness, delay overloading.
+		 */
+		if (seeky || !sample_valid(cic->seek_samples))
+			cfqd->desktop_dispatch_ts = jiffies;
+	}
+
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
 	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic)))
+	    (!cfqd->cfq_desktop && seeky))
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->cfq_desktop = 1;
+	cfqd->cfq_desktop_dispatch = 1;
+
+	cfqd->desktop_dispatch_ts = INITIAL_JIFFIES;
 	cfqd->hw_tag = 1;
 
 	return cfqd;
@@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd-
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
 SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0);
+SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
 STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0);
+STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0);
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
 	CFQ_ATTR(desktop),
+	CFQ_ATTR(desktop_dispatch),
 	__ATTR_NULL
 };
 



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                       ` <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02 20:47                                                         ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 20:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, 2009-10-02 at 20:57 +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> 
> > I'm not too worried about the "single IO producer" scenarios, and it
> > looks like (from a quick look) that most of your numbers are within some
> > expected noise levels. It's the more complex mixes that are likely to
> > cause a bit of a stink, but lets worry about that later. One quick thing
> > would be to read eg 2 or more files sequentially from disk and see how
> > that performs.
> 
> Hm.  git(s) should be good for a nice repeatable load.  Suggestions?
> 
> > If you could do a cleaned up version of your overload patch based on
> > this:
> > 
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > 
> > then lets take it from there.
> 
> I'll try to find a good repeatable git beater first.  At this point, I
> only know it helps with one load.

Seems to help mixed concurrent read/write a bit too.

perf stat testo.sh                               Avg
108.12   106.33    106.34    97.00    106.52   104.8  1.000 fairness=0 overload_delay=0
 93.98   102.44     94.47    97.70     98.90    97.4   .929 fairness=0 overload_delay=1
 90.87    95.40     95.79    93.09     94.25    93.8   .895 fairness=1 overload_delay=0
 89.93    90.57     89.13    93.43     93.72    91.3   .871 fairness=1 overload_delay=1

#!/bin/sh

LOGFILE=testo.log
rm -f $LOGFILE

echo 3 > /proc/sys/vm/drop_caches
sh -c "(cd linux-2.6.23; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.23.tar)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.24; perf stat -- git archive --format=tar HEAD > ../linux-2.6.24.tar; git checkout -f)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.25; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.25.tar)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.26; perf stat -- git archive --format=tar HEAD > ../linux-2.6.26.tar; git checkout -f)" 2>&1|tee -a $LOGFILE &
wait

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:57                                                     ` Mike Galbraith
@ 2009-10-02 20:47                                                       ` Mike Galbraith
       [not found]                                                       ` <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 20:47 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, 2009-10-02 at 20:57 +0200, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:
> 
> > I'm not too worried about the "single IO producer" scenarios, and it
> > looks like (from a quick look) that most of your numbers are within some
> > expected noise levels. It's the more complex mixes that are likely to
> > cause a bit of a stink, but lets worry about that later. One quick thing
> > would be to read eg 2 or more files sequentially from disk and see how
> > that performs.
> 
> Hm.  git(s) should be good for a nice repeatable load.  Suggestions?
> 
> > If you could do a cleaned up version of your overload patch based on
> > this:
> > 
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> > 
> > then lets take it from there.
> 
> I'll try to find a good repeatable git beater first.  At this point, I
> only know it helps with one load.

Seems to help mixed concurrent read/write a bit too.

perf stat testo.sh                               Avg
108.12   106.33    106.34    97.00    106.52   104.8  1.000 fairness=0 overload_delay=0
 93.98   102.44     94.47    97.70     98.90    97.4   .929 fairness=0 overload_delay=1
 90.87    95.40     95.79    93.09     94.25    93.8   .895 fairness=1 overload_delay=0
 89.93    90.57     89.13    93.43     93.72    91.3   .871 fairness=1 overload_delay=1

#!/bin/sh

LOGFILE=testo.log
rm -f $LOGFILE

echo 3 > /proc/sys/vm/drop_caches
sh -c "(cd linux-2.6.23; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.23.tar)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.24; perf stat -- git archive --format=tar HEAD > ../linux-2.6.24.tar; git checkout -f)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.25; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.25.tar)" 2>&1|tee -a $LOGFILE &
sh -c "(cd linux-2.6.26; perf stat -- git archive --format=tar HEAD > ../linux-2.6.26.tar; git checkout -f)" 2>&1|tee -a $LOGFILE &
wait



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                   ` <20091002025731.GA2738-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-02 20:27                     ` Munehiro Ikeda
  0 siblings, 0 replies; 349+ messages in thread
From: Munehiro Ikeda @ 2009-10-02 20:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> Before finishing this mail, will throw a whacky idea in the ring. I was
> going through the request based dm-multipath paper. Will it make sense
> to implement request based dm-ioband? So basically we implement all the
> group scheduling in CFQ and let dm-ioband implement a request function
> to take the request and break it back into bios. This way we can keep
> all the group control at one place and also meet most of the requirements.
>
> So request based dm-ioband will have a request in hand once that request
> has passed group control and prio control. Because dm-ioband is a device
> mapper target, one can put it on higher level devices (practically taking
> CFQ at higher level device), and provide fairness there. One can also
> put it on those SSDs which don't use IO scheduler (this is kind of forcing
> them to use the IO scheduler.)
>
> I am sure that will be many issues but one big issue I could think of that
> CFQ thinks that there is one device beneath it and dipsatches requests
> from one queue (in case of idling) and that would kill parallelism at
> higher layer and throughput will suffer on many of the dm/md configurations.
>
> Thanks
> Vivek

As long as using CFQ, your idea is reasonable for me.  But how about for
other IO schedulers?  In my understanding, one of the keys to guarantee
group isolation in your patch is to have per-group IO scheduler internal
queue even with as, deadline, and noop scheduler.  I think this is
great idea, and to implement generic code for all IO schedulers was
concluded when we had so many IO scheduler specific proposals.
If we will still need per-group IO scheduler internal queues with
request-based dm-ioband, we have to modify elevator layer.  It seems
out of scope of dm.
I might miss something...



-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  2:57                   ` Vivek Goyal
@ 2009-10-02 20:27                     ` Munehiro Ikeda
  -1 siblings, 0 replies; 349+ messages in thread
From: Munehiro Ikeda @ 2009-10-02 20:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ryo Tsuruta, nauman, linux-kernel, jens.axboe, containers,
	dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente,
	fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel,
	yoshikawa.takuya

Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> Before finishing this mail, will throw a whacky idea in the ring. I was
> going through the request based dm-multipath paper. Will it make sense
> to implement request based dm-ioband? So basically we implement all the
> group scheduling in CFQ and let dm-ioband implement a request function
> to take the request and break it back into bios. This way we can keep
> all the group control at one place and also meet most of the requirements.
>
> So request based dm-ioband will have a request in hand once that request
> has passed group control and prio control. Because dm-ioband is a device
> mapper target, one can put it on higher level devices (practically taking
> CFQ at higher level device), and provide fairness there. One can also
> put it on those SSDs which don't use IO scheduler (this is kind of forcing
> them to use the IO scheduler.)
>
> I am sure that will be many issues but one big issue I could think of that
> CFQ thinks that there is one device beneath it and dipsatches requests
> from one queue (in case of idling) and that would kill parallelism at
> higher layer and throughput will suffer on many of the dm/md configurations.
>
> Thanks
> Vivek

As long as using CFQ, your idea is reasonable for me.  But how about for
other IO schedulers?  In my understanding, one of the keys to guarantee
group isolation in your patch is to have per-group IO scheduler internal
queue even with as, deadline, and noop scheduler.  I think this is
great idea, and to implement generic code for all IO schedulers was
concluded when we had so many IO scheduler specific proposals.
If we will still need per-group IO scheduler internal queues with
request-based dm-ioband, we have to modify elevator layer.  It seems
out of scope of dm.
I might miss something...



-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 20:27                     ` Munehiro Ikeda
  0 siblings, 0 replies; 349+ messages in thread
From: Munehiro Ikeda @ 2009-10-02 20:27 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel,
	lizf, fchecconi, s-uchida, containers, linux-kernel, akpm,
	torvalds

Vivek Goyal wrote, on 10/01/2009 10:57 PM:
> Before finishing this mail, will throw a whacky idea in the ring. I was
> going through the request based dm-multipath paper. Will it make sense
> to implement request based dm-ioband? So basically we implement all the
> group scheduling in CFQ and let dm-ioband implement a request function
> to take the request and break it back into bios. This way we can keep
> all the group control at one place and also meet most of the requirements.
>
> So request based dm-ioband will have a request in hand once that request
> has passed group control and prio control. Because dm-ioband is a device
> mapper target, one can put it on higher level devices (practically taking
> CFQ at higher level device), and provide fairness there. One can also
> put it on those SSDs which don't use IO scheduler (this is kind of forcing
> them to use the IO scheduler.)
>
> I am sure that will be many issues but one big issue I could think of that
> CFQ thinks that there is one device beneath it and dipsatches requests
> from one queue (in case of idling) and that would kill parallelism at
> higher layer and throughput will suffer on many of the dm/md configurations.
>
> Thanks
> Vivek

As long as using CFQ, your idea is reasonable for me.  But how about for
other IO schedulers?  In my understanding, one of the keys to guarantee
group isolation in your patch is to have per-group IO scheduler internal
queue even with as, deadline, and noop scheduler.  I think this is
great idea, and to implement generic code for all IO schedulers was
concluded when we had so many IO scheduler specific proposals.
If we will still need per-group IO scheduler internal queues with
request-based dm-ioband, we have to modify elevator layer.  It seems
out of scope of dm.
I might miss something...



-- 
IKEDA, Munehiro
   NEC Corporation of America
     m-ikeda@ds.jp.nec.com

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                           ` <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org>
@ 2009-10-02 19:09                                                             ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 19:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Theodore Tso,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > On Fri, Oct 02 2009, Theodore Tso wrote:
> > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > > > > a bit overladen.
> > > > 
> > > > I'm not too crazy about it either. How about just using 'desktop' 
> > > > since this is obviously what we are really targetting? 'latency' 
> > > > isn't fully descriptive either, since it may not necessarily 
> > > > provide the best single IO latency (noop would).
> > > 
> > > As Linus has already pointed out, it's not necessarily "desktop" 
> > > versus "server".  There will be certain high frequency transaction 
> > > database workloads (for example) that will very much care about 
> > > latency.  I think "low_latency" may be the best term to use.
> > 
> > Not necessarily, but typically it will be. As already noted, I don't 
> > think latency itself is a very descriptive term for this.
> 
> Why not? Nobody will think of 'latency' as something that requires noop, 
> but as something that in practice achieves low latencies, for stuff that 
> people use.

Alright, I'll acknowledge that if that's the general consensus. I may be
somewhat biased myself.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 19:01                                                         ` Ingo Molnar
@ 2009-10-02 19:09                                                             ` Jens Axboe
       [not found]                                                           ` <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 19:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Theodore Tso, Linus Torvalds, Mike Galbraith, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Oct 02 2009, Theodore Tso wrote:
> > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > > > > a bit overladen.
> > > > 
> > > > I'm not too crazy about it either. How about just using 'desktop' 
> > > > since this is obviously what we are really targetting? 'latency' 
> > > > isn't fully descriptive either, since it may not necessarily 
> > > > provide the best single IO latency (noop would).
> > > 
> > > As Linus has already pointed out, it's not necessarily "desktop" 
> > > versus "server".  There will be certain high frequency transaction 
> > > database workloads (for example) that will very much care about 
> > > latency.  I think "low_latency" may be the best term to use.
> > 
> > Not necessarily, but typically it will be. As already noted, I don't 
> > think latency itself is a very descriptive term for this.
> 
> Why not? Nobody will think of 'latency' as something that requires noop, 
> but as something that in practice achieves low latencies, for stuff that 
> people use.

Alright, I'll acknowledge that if that's the general consensus. I may be
somewhat biased myself.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 19:09                                                             ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 19:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, Theodore Tso,
	containers, Mike Galbraith, linux-kernel, akpm, righi.andrea,
	Linus Torvalds

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Oct 02 2009, Theodore Tso wrote:
> > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > > > > a bit overladen.
> > > > 
> > > > I'm not too crazy about it either. How about just using 'desktop' 
> > > > since this is obviously what we are really targetting? 'latency' 
> > > > isn't fully descriptive either, since it may not necessarily 
> > > > provide the best single IO latency (noop would).
> > > 
> > > As Linus has already pointed out, it's not necessarily "desktop" 
> > > versus "server".  There will be certain high frequency transaction 
> > > database workloads (for example) that will very much care about 
> > > latency.  I think "low_latency" may be the best term to use.
> > 
> > Not necessarily, but typically it will be. As already noted, I don't 
> > think latency itself is a very descriptive term for this.
> 
> Why not? Nobody will think of 'latency' as something that requires noop, 
> but as something that in practice achieves low latencies, for stuff that 
> people use.

Alright, I'll acknowledge that if that's the general consensus. I may be
somewhat biased myself.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                         ` <20091002184549.GS31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 19:01                                                           ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 19:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Theodore Tso,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds


* Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> On Fri, Oct 02 2009, Theodore Tso wrote:
> > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > > > a bit overladen.
> > > 
> > > I'm not too crazy about it either. How about just using 'desktop' 
> > > since this is obviously what we are really targetting? 'latency' 
> > > isn't fully descriptive either, since it may not necessarily 
> > > provide the best single IO latency (noop would).
> > 
> > As Linus has already pointed out, it's not necessarily "desktop" 
> > versus "server".  There will be certain high frequency transaction 
> > database workloads (for example) that will very much care about 
> > latency.  I think "low_latency" may be the best term to use.
> 
> Not necessarily, but typically it will be. As already noted, I don't 
> think latency itself is a very descriptive term for this.

Why not? Nobody will think of 'latency' as something that requires noop, 
but as something that in practice achieves low latencies, for stuff that 
people use.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:45                                                         ` Jens Axboe
  (?)
@ 2009-10-02 19:01                                                         ` Ingo Molnar
  2009-10-02 19:09                                                             ` Jens Axboe
       [not found]                                                           ` <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 19:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Theodore Tso, Linus Torvalds, Mike Galbraith, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Oct 02 2009, Theodore Tso wrote:
> > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > > > a bit overladen.
> > > 
> > > I'm not too crazy about it either. How about just using 'desktop' 
> > > since this is obviously what we are really targetting? 'latency' 
> > > isn't fully descriptive either, since it may not necessarily 
> > > provide the best single IO latency (noop would).
> > 
> > As Linus has already pointed out, it's not necessarily "desktop" 
> > versus "server".  There will be certain high frequency transaction 
> > database workloads (for example) that will very much care about 
> > latency.  I think "low_latency" may be the best term to use.
> 
> Not necessarily, but typically it will be. As already noted, I don't 
> think latency itself is a very descriptive term for this.

Why not? Nobody will think of 'latency' as something that requires noop, 
but as something that in practice achieves low latencies, for stuff that 
people use.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                     ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 18:57                                                       ` Mike Galbraith
  2009-10-03  5:48                                                       ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:

> I'm not too worried about the "single IO producer" scenarios, and it
> looks like (from a quick look) that most of your numbers are within some
> expected noise levels. It's the more complex mixes that are likely to
> cause a bit of a stink, but lets worry about that later. One quick thing
> would be to read eg 2 or more files sequentially from disk and see how
> that performs.

Hm.  git(s) should be good for a nice repeatable load.  Suggestions?

> If you could do a cleaned up version of your overload patch based on
> this:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> 
> then lets take it from there.

I'll try to find a good repeatable git beater first.  At this point, I
only know it helps with one load.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:19                                                   ` Jens Axboe
@ 2009-10-02 18:57                                                     ` Mike Galbraith
  2009-10-02 20:47                                                       ` Mike Galbraith
       [not found]                                                       ` <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-03  5:48                                                     ` Mike Galbraith
       [not found]                                                     ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote:

> I'm not too worried about the "single IO producer" scenarios, and it
> looks like (from a quick look) that most of your numbers are within some
> expected noise levels. It's the more complex mixes that are likely to
> cause a bit of a stink, but lets worry about that later. One quick thing
> would be to read eg 2 or more files sequentially from disk and see how
> that performs.

Hm.  git(s) should be good for a nice repeatable load.  Suggestions?

> If you could do a cleaned up version of your overload patch based on
> this:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768
> 
> then lets take it from there.

I'll try to find a good repeatable git beater first.  At this point, I
only know it helps with one load.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                       ` <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org>
@ 2009-10-02 18:45                                                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:45 UTC (permalink / raw)
  To: Theodore Tso
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Theodore Tso wrote:
> On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > > a bit overladen.
> > 
> > I'm not too crazy about it either. How about just using 'desktop' since
> > this is obviously what we are really targetting? 'latency' isn't fully
> > descriptive either, since it may not necessarily provide the best single
> > IO latency (noop would).
> 
> As Linus has already pointed out, it's not necessarily "desktop"
> versus "server".  There will be certain high frequency transaction
> database workloads (for example) that will very much care about
> latency.  I think "low_latency" may be the best term to use.

Not necessarily, but typically it will be. As already noted, I don't
think latency itself is a very descriptive term for this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:36                                                     ` Theodore Tso
@ 2009-10-02 18:45                                                         ` Jens Axboe
       [not found]                                                       ` <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:45 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Linus Torvalds, Mike Galbraith, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

On Fri, Oct 02 2009, Theodore Tso wrote:
> On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > > a bit overladen.
> > 
> > I'm not too crazy about it either. How about just using 'desktop' since
> > this is obviously what we are really targetting? 'latency' isn't fully
> > descriptive either, since it may not necessarily provide the best single
> > IO latency (noop would).
> 
> As Linus has already pointed out, it's not necessarily "desktop"
> versus "server".  There will be certain high frequency transaction
> database workloads (for example) that will very much care about
> latency.  I think "low_latency" may be the best term to use.

Not necessarily, but typically it will be. As already noted, I don't
think latency itself is a very descriptive term for this.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 18:45                                                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:45 UTC (permalink / raw)
  To: Theodore Tso
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, Mike Galbraith, linux-kernel, akpm, righi.andrea,
	Linus Torvalds

On Fri, Oct 02 2009, Theodore Tso wrote:
> On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > > a bit overladen.
> > 
> > I'm not too crazy about it either. How about just using 'desktop' since
> > this is obviously what we are really targetting? 'latency' isn't fully
> > descriptive either, since it may not necessarily provide the best single
> > IO latency (noop would).
> 
> As Linus has already pointed out, it's not necessarily "desktop"
> versus "server".  There will be certain high frequency transaction
> database workloads (for example) that will very much care about
> latency.  I think "low_latency" may be the best term to use.

Not necessarily, but typically it will be. As already noted, I don't
think latency itself is a very descriptive term for this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                     ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-02 18:22                                                       ` Mike Galbraith
@ 2009-10-02 18:36                                                       ` Theodore Tso
  1 sibling, 0 replies; 349+ messages in thread
From: Theodore Tso @ 2009-10-02 18:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > a bit overladen.
> 
> I'm not too crazy about it either. How about just using 'desktop' since
> this is obviously what we are really targetting? 'latency' isn't fully
> descriptive either, since it may not necessarily provide the best single
> IO latency (noop would).

As Linus has already pointed out, it's not necessarily "desktop"
versus "server".  There will be certain high frequency transaction
database workloads (for example) that will very much care about
latency.  I think "low_latency" may be the best term to use.

	    	  		       	   - Ted

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:04                                                   ` Jens Axboe
       [not found]                                                     ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-02 18:22                                                     ` Mike Galbraith
@ 2009-10-02 18:36                                                     ` Theodore Tso
  2009-10-02 18:45                                                         ` Jens Axboe
       [not found]                                                       ` <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org>
  2 siblings, 2 replies; 349+ messages in thread
From: Theodore Tso @ 2009-10-02 18:36 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Mike Galbraith, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote:
> > i'd say 'latency' describes it even better. 'interactivity' as a term is 
> > a bit overladen.
> 
> I'm not too crazy about it either. How about just using 'desktop' since
> this is obviously what we are really targetting? 'latency' isn't fully
> descriptive either, since it may not necessarily provide the best single
> IO latency (noop would).

As Linus has already pointed out, it's not necessarily "desktop"
versus "server".  There will be certain high frequency transaction
database workloads (for example) that will very much care about
latency.  I think "low_latency" may be the best term to use.

	    	  		       	   - Ted

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                       ` <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02 18:36                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote:
> > On Thu, Oct 01 2009, Mike Galbraith wrote:
> > >  	max_dispatch = cfqd->cfq_quantum;
> > >  	if (cfq_class_idle(cfqq))
> > >  		max_dispatch = 1;
> > >  
> > > +	if (cfqd->busy_queues > 1)
> > > +		cfqd->od_stamp = jiffies;
> > > +
> > 
> > ->busy_queues > 1 just means that they have requests ready for dispatch,
> > not that they are dispatched.
> 
> But we're not alone, somebody else is using disk.  I'm trying to make
> sure we don't have someone _about_ to come back.. like a reader, so when
> there's another player, stamp to give him some time to wake up/submit
> before putting the pedal to the metal.

OK, then the check does what you want. It'll tell you that you have a
pending request, and at least one other queue has one too. And that
could dispatch right after you finish yours, depending on idling etc.
Note that this _only_ applies to queues that have requests still sitting
in CFQ, as soon as they are on the dispatch list in the block layer they
will only be counted as busy if they still have sorted IO waiting.

But that should be OK already, since I switched CFQ to dispatch single
requests a few revisions ago. So we should not run into that anymore.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:29                     ` Mike Galbraith
@ 2009-10-02 18:36                       ` Jens Axboe
       [not found]                       ` <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:36 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote:
> > On Thu, Oct 01 2009, Mike Galbraith wrote:
> > >  	max_dispatch = cfqd->cfq_quantum;
> > >  	if (cfq_class_idle(cfqq))
> > >  		max_dispatch = 1;
> > >  
> > > +	if (cfqd->busy_queues > 1)
> > > +		cfqd->od_stamp = jiffies;
> > > +
> > 
> > ->busy_queues > 1 just means that they have requests ready for dispatch,
> > not that they are dispatched.
> 
> But we're not alone, somebody else is using disk.  I'm trying to make
> sure we don't have someone _about_ to come back.. like a reader, so when
> there's another player, stamp to give him some time to wake up/submit
> before putting the pedal to the metal.

OK, then the check does what you want. It'll tell you that you have a
pending request, and at least one other queue has one too. And that
could dispatch right after you finish yours, depending on idling etc.
Note that this _only_ applies to queues that have requests still sitting
in CFQ, as soon as they are on the dispatch list in the block layer they
will only be counted as busy if they still have sorted IO waiting.

But that should be OK already, since I switched CFQ to dispatch single
requests a few revisions ago. So we should not run into that anymore.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                         ` <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 18:33                                                           ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, 2009-10-02 at 20:26 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:
> > 
> > > I'm not too crazy about it either. How about just using 'desktop' since
> > > this is obviously what we are really targetting? 'latency' isn't fully
> > > descriptive either, since it may not necessarily provide the best single
> > > IO latency (noop would).
> > 
> > Grin. "Perfect is the enemy of good" :)
> >                                                   Avg
> >      16.24   175.82   154.38   228.97   147.16  144.5     noop
> >      43.23    57.39    96.13   148.25   180.09  105.0     deadline
> 
> Yep, that's where it falls down. Noop basically fails here because it
> treats all IO as equal, which obviously isn't true for most people. But
> even for pure read workloads (is the above the mixed read/write, or just
> read?), latency would be excellent with noop but the desktop experience
> would not.

Yeah, it's the dd vs konsole -e exit.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:26                                                       ` Jens Axboe
@ 2009-10-02 18:33                                                         ` Mike Galbraith
       [not found]                                                         ` <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, 2009-10-02 at 20:26 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:
> > On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:
> > 
> > > I'm not too crazy about it either. How about just using 'desktop' since
> > > this is obviously what we are really targetting? 'latency' isn't fully
> > > descriptive either, since it may not necessarily provide the best single
> > > IO latency (noop would).
> > 
> > Grin. "Perfect is the enemy of good" :)
> >                                                   Avg
> >      16.24   175.82   154.38   228.97   147.16  144.5     noop
> >      43.23    57.39    96.13   148.25   180.09  105.0     deadline
> 
> Yep, that's where it falls down. Noop basically fails here because it
> treats all IO as equal, which obviously isn't true for most people. But
> even for pure read workloads (is the above the mixed read/write, or just
> read?), latency would be excellent with noop but the desktop experience
> would not.

Yeah, it's the dd vs konsole -e exit.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                     ` <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 18:29                       ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote:
> On Thu, Oct 01 2009, Mike Galbraith wrote:
> >  	max_dispatch = cfqd->cfq_quantum;
> >  	if (cfq_class_idle(cfqq))
> >  		max_dispatch = 1;
> >  
> > +	if (cfqd->busy_queues > 1)
> > +		cfqd->od_stamp = jiffies;
> > +
> 
> ->busy_queues > 1 just means that they have requests ready for dispatch,
> not that they are dispatched.

But we're not alone, somebody else is using disk.  I'm trying to make
sure we don't have someone _about_ to come back.. like a reader, so when
there's another player, stamp to give him some time to wake up/submit
before putting the pedal to the metal.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:08                   ` Jens Axboe
@ 2009-10-02 18:29                     ` Mike Galbraith
  2009-10-02 18:36                       ` Jens Axboe
       [not found]                       ` <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
       [not found]                     ` <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  1 sibling, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:29 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote:
> On Thu, Oct 01 2009, Mike Galbraith wrote:
> >  	max_dispatch = cfqd->cfq_quantum;
> >  	if (cfq_class_idle(cfqq))
> >  		max_dispatch = 1;
> >  
> > +	if (cfqd->busy_queues > 1)
> > +		cfqd->od_stamp = jiffies;
> > +
> 
> ->busy_queues > 1 just means that they have requests ready for dispatch,
> not that they are dispatched.

But we're not alone, somebody else is using disk.  I'm trying to make
sure we don't have someone _about_ to come back.. like a reader, so when
there's another player, stamp to give him some time to wake up/submit
before putting the pedal to the metal.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                       ` <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02 18:26                                                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:
> 
> > I'm not too crazy about it either. How about just using 'desktop' since
> > this is obviously what we are really targetting? 'latency' isn't fully
> > descriptive either, since it may not necessarily provide the best single
> > IO latency (noop would).
> 
> Grin. "Perfect is the enemy of good" :)
>                                                   Avg
>      16.24   175.82   154.38   228.97   147.16  144.5     noop
>      43.23    57.39    96.13   148.25   180.09  105.0     deadline

Yep, that's where it falls down. Noop basically fails here because it
treats all IO as equal, which obviously isn't true for most people. But
even for pure read workloads (is the above the mixed read/write, or just
read?), latency would be excellent with noop but the desktop experience
would not.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:22                                                     ` Mike Galbraith
@ 2009-10-02 18:26                                                       ` Jens Axboe
  2009-10-02 18:33                                                         ` Mike Galbraith
       [not found]                                                         ` <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
       [not found]                                                       ` <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 2 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:26 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:
> 
> > I'm not too crazy about it either. How about just using 'desktop' since
> > this is obviously what we are really targetting? 'latency' isn't fully
> > descriptive either, since it may not necessarily provide the best single
> > IO latency (noop would).
> 
> Grin. "Perfect is the enemy of good" :)
>                                                   Avg
>      16.24   175.82   154.38   228.97   147.16  144.5     noop
>      43.23    57.39    96.13   148.25   180.09  105.0     deadline

Yep, that's where it falls down. Noop basically fails here because it
treats all IO as equal, which obviously isn't true for most people. But
even for pure read workloads (is the above the mixed read/write, or just
read?), latency would be excellent with noop but the desktop experience
would not.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                     ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 18:22                                                       ` Mike Galbraith
  2009-10-02 18:36                                                       ` Theodore Tso
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:

> I'm not too crazy about it either. How about just using 'desktop' since
> this is obviously what we are really targetting? 'latency' isn't fully
> descriptive either, since it may not necessarily provide the best single
> IO latency (noop would).

Grin. "Perfect is the enemy of good" :)
                                                  Avg
     16.24   175.82   154.38   228.97   147.16  144.5     noop
     43.23    57.39    96.13   148.25   180.09  105.0     deadline

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:04                                                   ` Jens Axboe
       [not found]                                                     ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 18:22                                                     ` Mike Galbraith
  2009-10-02 18:26                                                       ` Jens Axboe
       [not found]                                                       ` <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-02 18:36                                                     ` Theodore Tso
  2 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote:

> I'm not too crazy about it either. How about just using 'desktop' since
> this is obviously what we are really targetting? 'latency' isn't fully
> descriptive either, since it may not necessarily provide the best single
> IO latency (noop would).

Grin. "Perfect is the enemy of good" :)
                                                  Avg
     16.24   175.82   154.38   228.97   147.16  144.5     noop
     43.23    57.39    96.13   148.25   180.09  105.0     deadline



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                   ` <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02 18:19                                                     ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > 
> > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > > 
> > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > > > 
> > > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > > good example of that is actually the idling that we already do. 
> > > > > > Say you have two applications, each starting up. If you start them 
> > > > > > both at the same time and just care for the dumb low latency, then 
> > > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > > both started, while with the slice idling and priority disk access 
> > > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > > 
> > > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > > issue. And that's where it becomes complex and not so black and 
> > > > > > white. Mike's test is a really good example of that.
> > > > > 
> > > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > > 
> > > > [snip]
> > > > 
> > > > I was saying the exact opposite, that Mike's test is a good example of 
> > > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > > sequence of valid events and looking at the latency for those. It's 
> > > > benchmarking the bigger picture, not a microbenchmark.
> > > 
> > > Good, so we are in violent agreement :-)
> > 
> > Yes, perhaps that last sentence didn't provide enough evidence of which
> > category I put Mike's test into :-)
> > 
> > So to kick things off, I added an 'interactive' knob to CFQ and
> > defaulted it to on, along with re-enabling slice idling for hardware
> > that does tagged command queuing. This is almost completely identical to
> > what Vivek Goyal originally posted, it's just combined into one and uses
> > the term 'interactive' instead of 'fairness'. I think the former is a
> > better umbrella under which to add further tweaks that may sacrifice
> > throughput slightly, in the quest for better latency.
> > 
> > It's queued up in the for-linus branch.
> 
> FWIW, I did a matrix of Vivek's patch combined with my hack.  Seems we
> do lose a bit of dd throughput over stock with either or both.
> 
> dd pre         65.1     65.4     67.5     64.8     65.1   65.5     fairness=1 overload_delay=1
> perf stat      1.70     1.94     1.32     1.89     1.87    1.7
> dd post        69.4     62.3     69.7     70.3     69.6   68.2
> 
> dd pre         67.0     67.8     64.7     64.7     64.9   65.8     fairness=1 overload_delay=0
> perf stat      4.89     3.13     2.98     2.71     2.17    3.1
> dd post        67.2     63.3     62.6     62.8     63.1   63.8
> 
> dd pre         65.0     66.0     66.9     64.6     67.0   65.9     fairness=0 overload_delay=1
> perf stat      4.66     3.81     4.23     2.98     4.23    3.9
> dd post        62.0     60.8     62.4     61.4     62.2   61.7
> 
> dd pre         65.3     65.6     64.9     69.5     65.8   66.2     fairness=0 overload_delay=0
> perf stat     14.79     9.11    14.16     8.44    13.67   12.0
> dd post        64.1     66.5     64.0     66.5     64.4   65.1

I'm not too worried about the "single IO producer" scenarios, and it
looks like (from a quick look) that most of your numbers are within some
expected noise levels. It's the more complex mixes that are likely to
cause a bit of a stink, but lets worry about that later. One quick thing
would be to read eg 2 or more files sequentially from disk and see how
that performs.

If you could do a cleaned up version of your overload patch based on
this:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768

then lets take it from there.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 18:13                                                 ` Mike Galbraith
@ 2009-10-02 18:19                                                   ` Jens Axboe
  2009-10-02 18:57                                                     ` Mike Galbraith
                                                                       ` (2 more replies)
       [not found]                                                   ` <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 3 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:19 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote:
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > 
> > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > > 
> > > > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > > > 
> > > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > > good example of that is actually the idling that we already do. 
> > > > > > Say you have two applications, each starting up. If you start them 
> > > > > > both at the same time and just care for the dumb low latency, then 
> > > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > > both started, while with the slice idling and priority disk access 
> > > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > > 
> > > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > > issue. And that's where it becomes complex and not so black and 
> > > > > > white. Mike's test is a really good example of that.
> > > > > 
> > > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > > 
> > > > [snip]
> > > > 
> > > > I was saying the exact opposite, that Mike's test is a good example of 
> > > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > > sequence of valid events and looking at the latency for those. It's 
> > > > benchmarking the bigger picture, not a microbenchmark.
> > > 
> > > Good, so we are in violent agreement :-)
> > 
> > Yes, perhaps that last sentence didn't provide enough evidence of which
> > category I put Mike's test into :-)
> > 
> > So to kick things off, I added an 'interactive' knob to CFQ and
> > defaulted it to on, along with re-enabling slice idling for hardware
> > that does tagged command queuing. This is almost completely identical to
> > what Vivek Goyal originally posted, it's just combined into one and uses
> > the term 'interactive' instead of 'fairness'. I think the former is a
> > better umbrella under which to add further tweaks that may sacrifice
> > throughput slightly, in the quest for better latency.
> > 
> > It's queued up in the for-linus branch.
> 
> FWIW, I did a matrix of Vivek's patch combined with my hack.  Seems we
> do lose a bit of dd throughput over stock with either or both.
> 
> dd pre         65.1     65.4     67.5     64.8     65.1   65.5     fairness=1 overload_delay=1
> perf stat      1.70     1.94     1.32     1.89     1.87    1.7
> dd post        69.4     62.3     69.7     70.3     69.6   68.2
> 
> dd pre         67.0     67.8     64.7     64.7     64.9   65.8     fairness=1 overload_delay=0
> perf stat      4.89     3.13     2.98     2.71     2.17    3.1
> dd post        67.2     63.3     62.6     62.8     63.1   63.8
> 
> dd pre         65.0     66.0     66.9     64.6     67.0   65.9     fairness=0 overload_delay=1
> perf stat      4.66     3.81     4.23     2.98     4.23    3.9
> dd post        62.0     60.8     62.4     61.4     62.2   61.7
> 
> dd pre         65.3     65.6     64.9     69.5     65.8   66.2     fairness=0 overload_delay=0
> perf stat     14.79     9.11    14.16     8.44    13.67   12.0
> dd post        64.1     66.5     64.0     66.5     64.4   65.1

I'm not too worried about the "single IO producer" scenarios, and it
looks like (from a quick look) that most of your numbers are within some
expected noise levels. It's the more complex mixes that are likely to
cause a bit of a stink, but lets worry about that later. One quick thing
would be to read eg 2 or more files sequentially from disk and see how
that performs.

If you could do a cleaned up version of your overload patch based on
this:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768

then lets take it from there.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                 ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-02 17:56                                                   ` Ingo Molnar
@ 2009-10-02 18:13                                                   ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > 
> > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > > 
> > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > good example of that is actually the idling that we already do. 
> > > > > Say you have two applications, each starting up. If you start them 
> > > > > both at the same time and just care for the dumb low latency, then 
> > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > both started, while with the slice idling and priority disk access 
> > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > 
> > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > issue. And that's where it becomes complex and not so black and 
> > > > > white. Mike's test is a really good example of that.
> > > > 
> > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > 
> > > [snip]
> > > 
> > > I was saying the exact opposite, that Mike's test is a good example of 
> > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > sequence of valid events and looking at the latency for those. It's 
> > > benchmarking the bigger picture, not a microbenchmark.
> > 
> > Good, so we are in violent agreement :-)
> 
> Yes, perhaps that last sentence didn't provide enough evidence of which
> category I put Mike's test into :-)
> 
> So to kick things off, I added an 'interactive' knob to CFQ and
> defaulted it to on, along with re-enabling slice idling for hardware
> that does tagged command queuing. This is almost completely identical to
> what Vivek Goyal originally posted, it's just combined into one and uses
> the term 'interactive' instead of 'fairness'. I think the former is a
> better umbrella under which to add further tweaks that may sacrifice
> throughput slightly, in the quest for better latency.
> 
> It's queued up in the for-linus branch.

FWIW, I did a matrix of Vivek's patch combined with my hack.  Seems we
do lose a bit of dd throughput over stock with either or both.

dd pre         65.1     65.4     67.5     64.8     65.1   65.5     fairness=1 overload_delay=1
perf stat      1.70     1.94     1.32     1.89     1.87    1.7
dd post        69.4     62.3     69.7     70.3     69.6   68.2

dd pre         67.0     67.8     64.7     64.7     64.9   65.8     fairness=1 overload_delay=0
perf stat      4.89     3.13     2.98     2.71     2.17    3.1
dd post        67.2     63.3     62.6     62.8     63.1   63.8

dd pre         65.0     66.0     66.9     64.6     67.0   65.9     fairness=0 overload_delay=1
perf stat      4.66     3.81     4.23     2.98     4.23    3.9
dd post        62.0     60.8     62.4     61.4     62.2   61.7

dd pre         65.3     65.6     64.9     69.5     65.8   66.2     fairness=0 overload_delay=0
perf stat     14.79     9.11    14.16     8.44    13.67   12.0
dd post        64.1     66.5     64.0     66.5     64.4   65.1

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 17:37                                               ` Jens Axboe
       [not found]                                                 ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-02 17:56                                                   ` Ingo Molnar
@ 2009-10-02 18:13                                                 ` Mike Galbraith
  2009-10-02 18:19                                                   ` Jens Axboe
       [not found]                                                   ` <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 18:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > 
> > > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > > 
> > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > good example of that is actually the idling that we already do. 
> > > > > Say you have two applications, each starting up. If you start them 
> > > > > both at the same time and just care for the dumb low latency, then 
> > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > both started, while with the slice idling and priority disk access 
> > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > 
> > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > issue. And that's where it becomes complex and not so black and 
> > > > > white. Mike's test is a really good example of that.
> > > > 
> > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > 
> > > [snip]
> > > 
> > > I was saying the exact opposite, that Mike's test is a good example of 
> > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > sequence of valid events and looking at the latency for those. It's 
> > > benchmarking the bigger picture, not a microbenchmark.
> > 
> > Good, so we are in violent agreement :-)
> 
> Yes, perhaps that last sentence didn't provide enough evidence of which
> category I put Mike's test into :-)
> 
> So to kick things off, I added an 'interactive' knob to CFQ and
> defaulted it to on, along with re-enabling slice idling for hardware
> that does tagged command queuing. This is almost completely identical to
> what Vivek Goyal originally posted, it's just combined into one and uses
> the term 'interactive' instead of 'fairness'. I think the former is a
> better umbrella under which to add further tweaks that may sacrifice
> throughput slightly, in the quest for better latency.
> 
> It's queued up in the for-linus branch.

FWIW, I did a matrix of Vivek's patch combined with my hack.  Seems we
do lose a bit of dd throughput over stock with either or both.

dd pre         65.1     65.4     67.5     64.8     65.1   65.5     fairness=1 overload_delay=1
perf stat      1.70     1.94     1.32     1.89     1.87    1.7
dd post        69.4     62.3     69.7     70.3     69.6   68.2

dd pre         67.0     67.8     64.7     64.7     64.9   65.8     fairness=1 overload_delay=0
perf stat      4.89     3.13     2.98     2.71     2.17    3.1
dd post        67.2     63.3     62.6     62.8     63.1   63.8

dd pre         65.0     66.0     66.9     64.6     67.0   65.9     fairness=0 overload_delay=1
perf stat      4.66     3.81     4.23     2.98     4.23    3.9
dd post        62.0     60.8     62.4     61.4     62.2   61.7

dd pre         65.3     65.6     64.9     69.5     65.8   66.2     fairness=0 overload_delay=0
perf stat     14.79     9.11    14.16     8.44    13.67   12.0
dd post        64.1     66.5     64.0     66.5     64.4   65.1




^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                   ` <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-01 18:58                       ` Jens Axboe
@ 2009-10-02 18:08                     ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Oct 01 2009, Mike Galbraith wrote:
>  	max_dispatch = cfqd->cfq_quantum;
>  	if (cfq_class_idle(cfqq))
>  		max_dispatch = 1;
>  
> +	if (cfqd->busy_queues > 1)
> +		cfqd->od_stamp = jiffies;
> +

->busy_queues > 1 just means that they have requests ready for dispatch,
not that they are dispatched.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-01  7:33                 ` Mike Galbraith
       [not found]                   ` <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02 18:08                   ` Jens Axboe
  2009-10-02 18:29                     ` Mike Galbraith
       [not found]                     ` <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  1 sibling, 2 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:08 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Thu, Oct 01 2009, Mike Galbraith wrote:
>  	max_dispatch = cfqd->cfq_quantum;
>  	if (cfq_class_idle(cfqq))
>  		max_dispatch = 1;
>  
> +	if (cfqd->busy_queues > 1)
> +		cfqd->od_stamp = jiffies;
> +

->busy_queues > 1 just means that they have requests ready for dispatch,
not that they are dispatched.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                   ` <20091002175629.GA14860-X9Un+BFzKDI@public.gmane.org>
@ 2009-10-02 18:04                                                     ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > 
> > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > > 
> > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > > > 
> > > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > > good example of that is actually the idling that we already do. 
> > > > > > Say you have two applications, each starting up. If you start them 
> > > > > > both at the same time and just care for the dumb low latency, then 
> > > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > > both started, while with the slice idling and priority disk access 
> > > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > > 
> > > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > > issue. And that's where it becomes complex and not so black and 
> > > > > > white. Mike's test is a really good example of that.
> > > > > 
> > > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > > 
> > > > [snip]
> > > > 
> > > > I was saying the exact opposite, that Mike's test is a good example of 
> > > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > > sequence of valid events and looking at the latency for those. It's 
> > > > benchmarking the bigger picture, not a microbenchmark.
> > > 
> > > Good, so we are in violent agreement :-)
> > 
> > Yes, perhaps that last sentence didn't provide enough evidence of 
> > which category I put Mike's test into :-)
> > 
> > So to kick things off, I added an 'interactive' knob to CFQ and 
> > defaulted it to on, along with re-enabling slice idling for hardware 
> > that does tagged command queuing. This is almost completely identical 
> > to what Vivek Goyal originally posted, it's just combined into one and 
> > uses the term 'interactive' instead of 'fairness'. I think the former 
> > is a better umbrella under which to add further tweaks that may 
> > sacrifice throughput slightly, in the quest for better latency.
> > 
> > It's queued up in the for-linus branch.
> 
> i'd say 'latency' describes it even better. 'interactivity' as a term is 
> a bit overladen.

I'm not too crazy about it either. How about just using 'desktop' since
this is obviously what we are really targetting? 'latency' isn't fully
descriptive either, since it may not necessarily provide the best single
IO latency (noop would).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 17:56                                                   ` Ingo Molnar
  (?)
  (?)
@ 2009-10-02 18:04                                                   ` Jens Axboe
       [not found]                                                     ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
                                                                       ` (2 more replies)
  -1 siblings, 3 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 18:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > 
> > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > > 
> > > > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > > > 
> > > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > > good example of that is actually the idling that we already do. 
> > > > > > Say you have two applications, each starting up. If you start them 
> > > > > > both at the same time and just care for the dumb low latency, then 
> > > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > > both started, while with the slice idling and priority disk access 
> > > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > > 
> > > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > > issue. And that's where it becomes complex and not so black and 
> > > > > > white. Mike's test is a really good example of that.
> > > > > 
> > > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > > 
> > > > [snip]
> > > > 
> > > > I was saying the exact opposite, that Mike's test is a good example of 
> > > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > > sequence of valid events and looking at the latency for those. It's 
> > > > benchmarking the bigger picture, not a microbenchmark.
> > > 
> > > Good, so we are in violent agreement :-)
> > 
> > Yes, perhaps that last sentence didn't provide enough evidence of 
> > which category I put Mike's test into :-)
> > 
> > So to kick things off, I added an 'interactive' knob to CFQ and 
> > defaulted it to on, along with re-enabling slice idling for hardware 
> > that does tagged command queuing. This is almost completely identical 
> > to what Vivek Goyal originally posted, it's just combined into one and 
> > uses the term 'interactive' instead of 'fairness'. I think the former 
> > is a better umbrella under which to add further tweaks that may 
> > sacrifice throughput slightly, in the quest for better latency.
> > 
> > It's queued up in the for-linus branch.
> 
> i'd say 'latency' describes it even better. 'interactivity' as a term is 
> a bit overladen.

I'm not too crazy about it either. How about just using 'desktop' since
this is obviously what we are really targetting? 'latency' isn't fully
descriptive either, since it may not necessarily provide the best single
IO latency (noop would).

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                                 ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 17:56                                                   ` Ingo Molnar
  2009-10-02 18:13                                                   ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds


* Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> On Fri, Oct 02 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > 
> > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > > 
> > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > good example of that is actually the idling that we already do. 
> > > > > Say you have two applications, each starting up. If you start them 
> > > > > both at the same time and just care for the dumb low latency, then 
> > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > both started, while with the slice idling and priority disk access 
> > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > 
> > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > issue. And that's where it becomes complex and not so black and 
> > > > > white. Mike's test is a really good example of that.
> > > > 
> > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > 
> > > [snip]
> > > 
> > > I was saying the exact opposite, that Mike's test is a good example of 
> > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > sequence of valid events and looking at the latency for those. It's 
> > > benchmarking the bigger picture, not a microbenchmark.
> > 
> > Good, so we are in violent agreement :-)
> 
> Yes, perhaps that last sentence didn't provide enough evidence of 
> which category I put Mike's test into :-)
> 
> So to kick things off, I added an 'interactive' knob to CFQ and 
> defaulted it to on, along with re-enabling slice idling for hardware 
> that does tagged command queuing. This is almost completely identical 
> to what Vivek Goyal originally posted, it's just combined into one and 
> uses the term 'interactive' instead of 'fairness'. I think the former 
> is a better umbrella under which to add further tweaks that may 
> sacrifice throughput slightly, in the quest for better latency.
> 
> It's queued up in the for-linus branch.

i'd say 'latency' describes it even better. 'interactivity' as a term is 
a bit overladen.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 17:37                                               ` Jens Axboe
@ 2009-10-02 17:56                                                   ` Ingo Molnar
  2009-10-02 17:56                                                   ` Ingo Molnar
  2009-10-02 18:13                                                 ` Mike Galbraith
  2 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Oct 02 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > 
> > > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > > 
> > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > good example of that is actually the idling that we already do. 
> > > > > Say you have two applications, each starting up. If you start them 
> > > > > both at the same time and just care for the dumb low latency, then 
> > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > both started, while with the slice idling and priority disk access 
> > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > 
> > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > issue. And that's where it becomes complex and not so black and 
> > > > > white. Mike's test is a really good example of that.
> > > > 
> > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > 
> > > [snip]
> > > 
> > > I was saying the exact opposite, that Mike's test is a good example of 
> > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > sequence of valid events and looking at the latency for those. It's 
> > > benchmarking the bigger picture, not a microbenchmark.
> > 
> > Good, so we are in violent agreement :-)
> 
> Yes, perhaps that last sentence didn't provide enough evidence of 
> which category I put Mike's test into :-)
> 
> So to kick things off, I added an 'interactive' knob to CFQ and 
> defaulted it to on, along with re-enabling slice idling for hardware 
> that does tagged command queuing. This is almost completely identical 
> to what Vivek Goyal originally posted, it's just combined into one and 
> uses the term 'interactive' instead of 'fairness'. I think the former 
> is a better umbrella under which to add further tweaks that may 
> sacrifice throughput slightly, in the quest for better latency.
> 
> It's queued up in the for-linus branch.

i'd say 'latency' describes it even better. 'interactivity' as a term is 
a bit overladen.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 17:56                                                   ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Oct 02 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > > 
> > > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > > 
> > > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > > good example of that is actually the idling that we already do. 
> > > > > Say you have two applications, each starting up. If you start them 
> > > > > both at the same time and just care for the dumb low latency, then 
> > > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > > but throughput will be aweful. And this means that in 20s they are 
> > > > > both started, while with the slice idling and priority disk access 
> > > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > > 
> > > > > So latency is good, definitely, but sometimes you have to worry 
> > > > > about the bigger picture too. Latency is more than single IOs, 
> > > > > it's often for complete operation which may involve lots of IOs. 
> > > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > > issue. And that's where it becomes complex and not so black and 
> > > > > white. Mike's test is a really good example of that.
> > > > 
> > > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > > test - he tested 'konsole' cache-cold startup latency, such as:
> > > 
> > > [snip]
> > > 
> > > I was saying the exact opposite, that Mike's test is a good example of 
> > > a valid test. It's not measuring single IO latencies, it's doing a 
> > > sequence of valid events and looking at the latency for those. It's 
> > > benchmarking the bigger picture, not a microbenchmark.
> > 
> > Good, so we are in violent agreement :-)
> 
> Yes, perhaps that last sentence didn't provide enough evidence of 
> which category I put Mike's test into :-)
> 
> So to kick things off, I added an 'interactive' knob to CFQ and 
> defaulted it to on, along with re-enabling slice idling for hardware 
> that does tagged command queuing. This is almost completely identical 
> to what Vivek Goyal originally posted, it's just combined into one and 
> uses the term 'interactive' instead of 'fairness'. I think the former 
> is a better umbrella under which to add further tweaks that may 
> sacrifice throughput slightly, in the quest for better latency.
> 
> It's queued up in the for-linus branch.

i'd say 'latency' describes it even better. 'interactivity' as a term is 
a bit overladen.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                               ` <20091002172842.GA4884-X9Un+BFzKDI@public.gmane.org>
@ 2009-10-02 17:37                                                 ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > 
> > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > > 
> > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > good example of that is actually the idling that we already do. 
> > > > Say you have two applications, each starting up. If you start them 
> > > > both at the same time and just care for the dumb low latency, then 
> > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > but throughput will be aweful. And this means that in 20s they are 
> > > > both started, while with the slice idling and priority disk access 
> > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > 
> > > > So latency is good, definitely, but sometimes you have to worry 
> > > > about the bigger picture too. Latency is more than single IOs, 
> > > > it's often for complete operation which may involve lots of IOs. 
> > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > issue. And that's where it becomes complex and not so black and 
> > > > white. Mike's test is a really good example of that.
> > > 
> > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > test - he tested 'konsole' cache-cold startup latency, such as:
> > 
> > [snip]
> > 
> > I was saying the exact opposite, that Mike's test is a good example of 
> > a valid test. It's not measuring single IO latencies, it's doing a 
> > sequence of valid events and looking at the latency for those. It's 
> > benchmarking the bigger picture, not a microbenchmark.
> 
> Good, so we are in violent agreement :-)

Yes, perhaps that last sentence didn't provide enough evidence of which
category I put Mike's test into :-)

So to kick things off, I added an 'interactive' knob to CFQ and
defaulted it to on, along with re-enabling slice idling for hardware
that does tagged command queuing. This is almost completely identical to
what Vivek Goyal originally posted, it's just combined into one and uses
the term 'interactive' instead of 'fairness'. I think the former is a
better umbrella under which to add further tweaks that may sacrifice
throughput slightly, in the quest for better latency.

It's queued up in the for-linus branch.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 17:28                                               ` Ingo Molnar
  (?)
  (?)
@ 2009-10-02 17:37                                               ` Jens Axboe
       [not found]                                                 ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
                                                                   ` (2 more replies)
  -1 siblings, 3 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > On Fri, Oct 02 2009, Ingo Molnar wrote:
> > > 
> > > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > > 
> > > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > > good example of that is actually the idling that we already do. 
> > > > Say you have two applications, each starting up. If you start them 
> > > > both at the same time and just care for the dumb low latency, then 
> > > > you'll do one IO from each of them in turn. Latency will be good, 
> > > > but throughput will be aweful. And this means that in 20s they are 
> > > > both started, while with the slice idling and priority disk access 
> > > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > > 
> > > > So latency is good, definitely, but sometimes you have to worry 
> > > > about the bigger picture too. Latency is more than single IOs, 
> > > > it's often for complete operation which may involve lots of IOs. 
> > > > Single IO latency is a benchmark thing, it's not a real life 
> > > > issue. And that's where it becomes complex and not so black and 
> > > > white. Mike's test is a really good example of that.
> > > 
> > > To the extent of you arguing that Mike's test is artificial (i'm not 
> > > sure you are arguing that) - Mike certainly did not do an artificial 
> > > test - he tested 'konsole' cache-cold startup latency, such as:
> > 
> > [snip]
> > 
> > I was saying the exact opposite, that Mike's test is a good example of 
> > a valid test. It's not measuring single IO latencies, it's doing a 
> > sequence of valid events and looking at the latency for those. It's 
> > benchmarking the bigger picture, not a microbenchmark.
> 
> Good, so we are in violent agreement :-)

Yes, perhaps that last sentence didn't provide enough evidence of which
category I put Mike's test into :-)

So to kick things off, I added an 'interactive' knob to CFQ and
defaulted it to on, along with re-enabling slice idling for hardware
that does tagged command queuing. This is almost completely identical to
what Vivek Goyal originally posted, it's just combined into one and uses
the term 'interactive' instead of 'fairness'. I think the former is a
better umbrella under which to add further tweaks that may sacrifice
throughput slightly, in the quest for better latency.

It's queued up in the for-linus branch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                             ` <20091002172554.GJ31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 17:28                                               ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds


* Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> On Fri, Oct 02 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > good example of that is actually the idling that we already do. 
> > > Say you have two applications, each starting up. If you start them 
> > > both at the same time and just care for the dumb low latency, then 
> > > you'll do one IO from each of them in turn. Latency will be good, 
> > > but throughput will be aweful. And this means that in 20s they are 
> > > both started, while with the slice idling and priority disk access 
> > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > 
> > > So latency is good, definitely, but sometimes you have to worry 
> > > about the bigger picture too. Latency is more than single IOs, 
> > > it's often for complete operation which may involve lots of IOs. 
> > > Single IO latency is a benchmark thing, it's not a real life 
> > > issue. And that's where it becomes complex and not so black and 
> > > white. Mike's test is a really good example of that.
> > 
> > To the extent of you arguing that Mike's test is artificial (i'm not 
> > sure you are arguing that) - Mike certainly did not do an artificial 
> > test - he tested 'konsole' cache-cold startup latency, such as:
> 
> [snip]
> 
> I was saying the exact opposite, that Mike's test is a good example of 
> a valid test. It's not measuring single IO latencies, it's doing a 
> sequence of valid events and looking at the latency for those. It's 
> benchmarking the bigger picture, not a microbenchmark.

Good, so we are in violent agreement :-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 17:25                                             ` Jens Axboe
@ 2009-10-02 17:28                                               ` Ingo Molnar
  -1 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Oct 02 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > good example of that is actually the idling that we already do. 
> > > Say you have two applications, each starting up. If you start them 
> > > both at the same time and just care for the dumb low latency, then 
> > > you'll do one IO from each of them in turn. Latency will be good, 
> > > but throughput will be aweful. And this means that in 20s they are 
> > > both started, while with the slice idling and priority disk access 
> > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > 
> > > So latency is good, definitely, but sometimes you have to worry 
> > > about the bigger picture too. Latency is more than single IOs, 
> > > it's often for complete operation which may involve lots of IOs. 
> > > Single IO latency is a benchmark thing, it's not a real life 
> > > issue. And that's where it becomes complex and not so black and 
> > > white. Mike's test is a really good example of that.
> > 
> > To the extent of you arguing that Mike's test is artificial (i'm not 
> > sure you are arguing that) - Mike certainly did not do an artificial 
> > test - he tested 'konsole' cache-cold startup latency, such as:
> 
> [snip]
> 
> I was saying the exact opposite, that Mike's test is a good example of 
> a valid test. It's not measuring single IO latencies, it's doing a 
> sequence of valid events and looking at the latency for those. It's 
> benchmarking the bigger picture, not a microbenchmark.

Good, so we are in violent agreement :-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 17:28                                               ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds


* Jens Axboe <jens.axboe@oracle.com> wrote:

> On Fri, Oct 02 2009, Ingo Molnar wrote:
> > 
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > It's not _that_ easy, it depends a lot on the access patterns. A 
> > > good example of that is actually the idling that we already do. 
> > > Say you have two applications, each starting up. If you start them 
> > > both at the same time and just care for the dumb low latency, then 
> > > you'll do one IO from each of them in turn. Latency will be good, 
> > > but throughput will be aweful. And this means that in 20s they are 
> > > both started, while with the slice idling and priority disk access 
> > > that CFQ does, you'd hopefully have both up and running in 2s.
> > > 
> > > So latency is good, definitely, but sometimes you have to worry 
> > > about the bigger picture too. Latency is more than single IOs, 
> > > it's often for complete operation which may involve lots of IOs. 
> > > Single IO latency is a benchmark thing, it's not a real life 
> > > issue. And that's where it becomes complex and not so black and 
> > > white. Mike's test is a really good example of that.
> > 
> > To the extent of you arguing that Mike's test is artificial (i'm not 
> > sure you are arguing that) - Mike certainly did not do an artificial 
> > test - he tested 'konsole' cache-cold startup latency, such as:
> 
> [snip]
> 
> I was saying the exact opposite, that Mike's test is a good example of 
> a valid test. It's not measuring single IO latencies, it's doing a 
> sequence of valid events and looking at the latency for those. It's 
> benchmarking the bigger picture, not a microbenchmark.

Good, so we are in violent agreement :-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                           ` <20091002172046.GA2376-X9Un+BFzKDI@public.gmane.org>
@ 2009-10-02 17:25                                             ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > It's not _that_ easy, it depends a lot on the access patterns. A good 
> > example of that is actually the idling that we already do. Say you 
> > have two applications, each starting up. If you start them both at the 
> > same time and just care for the dumb low latency, then you'll do one 
> > IO from each of them in turn. Latency will be good, but throughput 
> > will be aweful. And this means that in 20s they are both started, 
> > while with the slice idling and priority disk access that CFQ does, 
> > you'd hopefully have both up and running in 2s.
> > 
> > So latency is good, definitely, but sometimes you have to worry about 
> > the bigger picture too. Latency is more than single IOs, it's often 
> > for complete operation which may involve lots of IOs. Single IO 
> > latency is a benchmark thing, it's not a real life issue. And that's 
> > where it becomes complex and not so black and white. Mike's test is a 
> > really good example of that.
> 
> To the extent of you arguing that Mike's test is artificial (i'm not 
> sure you are arguing that) - Mike certainly did not do an artificial 
> test - he tested 'konsole' cache-cold startup latency, such as:

[snip]

I was saying the exact opposite, that Mike's test is a good example of a
valid test. It's not measuring single IO latencies, it's doing a
sequence of valid events and looking at the latency for those. It's
benchmarking the bigger picture, not a microbenchmark.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 17:20                                           ` Ingo Molnar
@ 2009-10-02 17:25                                             ` Jens Axboe
  -1 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > It's not _that_ easy, it depends a lot on the access patterns. A good 
> > example of that is actually the idling that we already do. Say you 
> > have two applications, each starting up. If you start them both at the 
> > same time and just care for the dumb low latency, then you'll do one 
> > IO from each of them in turn. Latency will be good, but throughput 
> > will be aweful. And this means that in 20s they are both started, 
> > while with the slice idling and priority disk access that CFQ does, 
> > you'd hopefully have both up and running in 2s.
> > 
> > So latency is good, definitely, but sometimes you have to worry about 
> > the bigger picture too. Latency is more than single IOs, it's often 
> > for complete operation which may involve lots of IOs. Single IO 
> > latency is a benchmark thing, it's not a real life issue. And that's 
> > where it becomes complex and not so black and white. Mike's test is a 
> > really good example of that.
> 
> To the extent of you arguing that Mike's test is artificial (i'm not 
> sure you are arguing that) - Mike certainly did not do an artificial 
> test - he tested 'konsole' cache-cold startup latency, such as:

[snip]

I was saying the exact opposite, that Mike's test is a good example of a
valid test. It's not measuring single IO latencies, it's doing a
sequence of valid events and looking at the latency for those. It's
benchmarking the bigger picture, not a microbenchmark.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 17:25                                             ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > It's not _that_ easy, it depends a lot on the access patterns. A good 
> > example of that is actually the idling that we already do. Say you 
> > have two applications, each starting up. If you start them both at the 
> > same time and just care for the dumb low latency, then you'll do one 
> > IO from each of them in turn. Latency will be good, but throughput 
> > will be aweful. And this means that in 20s they are both started, 
> > while with the slice idling and priority disk access that CFQ does, 
> > you'd hopefully have both up and running in 2s.
> > 
> > So latency is good, definitely, but sometimes you have to worry about 
> > the bigger picture too. Latency is more than single IOs, it's often 
> > for complete operation which may involve lots of IOs. Single IO 
> > latency is a benchmark thing, it's not a real life issue. And that's 
> > where it becomes complex and not so black and white. Mike's test is a 
> > really good example of that.
> 
> To the extent of you arguing that Mike's test is artificial (i'm not 
> sure you are arguing that) - Mike certainly did not do an artificial 
> test - he tested 'konsole' cache-cold startup latency, such as:

[snip]

I was saying the exact opposite, that Mike's test is a good example of a
valid test. It's not measuring single IO latencies, it's doing a
sequence of valid events and looking at the latency for those. It's
benchmarking the bigger picture, not a microbenchmark.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                         ` <20091002171129.GG31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 17:20                                           ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds


* Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> It's not _that_ easy, it depends a lot on the access patterns. A good 
> example of that is actually the idling that we already do. Say you 
> have two applications, each starting up. If you start them both at the 
> same time and just care for the dumb low latency, then you'll do one 
> IO from each of them in turn. Latency will be good, but throughput 
> will be aweful. And this means that in 20s they are both started, 
> while with the slice idling and priority disk access that CFQ does, 
> you'd hopefully have both up and running in 2s.
> 
> So latency is good, definitely, but sometimes you have to worry about 
> the bigger picture too. Latency is more than single IOs, it's often 
> for complete operation which may involve lots of IOs. Single IO 
> latency is a benchmark thing, it's not a real life issue. And that's 
> where it becomes complex and not so black and white. Mike's test is a 
> really good example of that.

To the extent of you arguing that Mike's test is artificial (i'm not 
sure you are arguing that) - Mike certainly did not do an artificial 
test - he tested 'konsole' cache-cold startup latency, such as:

    sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE

against a streaming dd.

That is a _very_ relevant benchmark IMHO and konsole's cache footprint 
is far from trivial. (In fact i'd argue it's one of the most important 
IO benchmarks on a desktop system - how does your desktop hold up to 
something doing streaming IO.)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 17:11                                         ` Jens Axboe
@ 2009-10-02 17:20                                           ` Ingo Molnar
  -1 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> It's not _that_ easy, it depends a lot on the access patterns. A good 
> example of that is actually the idling that we already do. Say you 
> have two applications, each starting up. If you start them both at the 
> same time and just care for the dumb low latency, then you'll do one 
> IO from each of them in turn. Latency will be good, but throughput 
> will be aweful. And this means that in 20s they are both started, 
> while with the slice idling and priority disk access that CFQ does, 
> you'd hopefully have both up and running in 2s.
> 
> So latency is good, definitely, but sometimes you have to worry about 
> the bigger picture too. Latency is more than single IOs, it's often 
> for complete operation which may involve lots of IOs. Single IO 
> latency is a benchmark thing, it's not a real life issue. And that's 
> where it becomes complex and not so black and white. Mike's test is a 
> really good example of that.

To the extent of you arguing that Mike's test is artificial (i'm not 
sure you are arguing that) - Mike certainly did not do an artificial 
test - he tested 'konsole' cache-cold startup latency, such as:

    sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE

against a streaming dd.

That is a _very_ relevant benchmark IMHO and konsole's cache footprint 
is far from trivial. (In fact i'd argue it's one of the most important 
IO benchmarks on a desktop system - how does your desktop hold up to 
something doing streaming IO.)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 17:20                                           ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 17:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds


* Jens Axboe <jens.axboe@oracle.com> wrote:

> It's not _that_ easy, it depends a lot on the access patterns. A good 
> example of that is actually the idling that we already do. Say you 
> have two applications, each starting up. If you start them both at the 
> same time and just care for the dumb low latency, then you'll do one 
> IO from each of them in turn. Latency will be good, but throughput 
> will be aweful. And this means that in 20s they are both started, 
> while with the slice idling and priority disk access that CFQ does, 
> you'd hopefully have both up and running in 2s.
> 
> So latency is good, definitely, but sometimes you have to worry about 
> the bigger picture too. Latency is more than single IOs, it's often 
> for complete operation which may involve lots of IOs. Single IO 
> latency is a benchmark thing, it's not a real life issue. And that's 
> where it becomes complex and not so black and white. Mike's test is a 
> really good example of that.

To the extent of you arguing that Mike's test is artificial (i'm not 
sure you are arguing that) - Mike certainly did not do an artificial 
test - he tested 'konsole' cache-cold startup latency, such as:

    sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE

against a streaming dd.

That is a _very_ relevant benchmark IMHO and konsole's cache footprint 
is far from trivial. (In fact i'd argue it's one of the most important 
IO benchmarks on a desktop system - how does your desktop hold up to 
something doing streaming IO.)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                       ` <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-02 17:13                                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:13 UTC (permalink / raw)
  To: Ray Lee
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Ray Lee wrote:
> On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > In some cases I wish we had a server vs desktop switch, since it would
> > decisions on this easier. I know you say that servers care about
> > latency, but not at all to the extent that desktops do. Most desktop
> > users would gladly give away the top of the performance for latency,
> > that's not true of most server users. Depends on what the server does,
> > of course.
> 
> If most of the I/O on a system exhibits seeky tendencies, couldn't the
> schedulers notice that and use that as the hint for what to optimize?
> 
> I mean, there's no switch better than the actual I/O behavior itself.

Heuristics like that have a tendency to fail. What's the cut-off point?
Additionally, heuristics based on past process/system behaviour also has
a tendency to be suboptimal, since things aren't static.

We already look at seekiness of individual processes or groups. IIRC,
as-iosched also keeps a per-queue tracking.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 16:33                                     ` Ray Lee
@ 2009-10-02 17:13                                         ` Jens Axboe
       [not found]                                       ` <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:13 UTC (permalink / raw)
  To: Ray Lee
  Cc: Linus Torvalds, Ingo Molnar, Mike Galbraith, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

On Fri, Oct 02 2009, Ray Lee wrote:
> On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > In some cases I wish we had a server vs desktop switch, since it would
> > decisions on this easier. I know you say that servers care about
> > latency, but not at all to the extent that desktops do. Most desktop
> > users would gladly give away the top of the performance for latency,
> > that's not true of most server users. Depends on what the server does,
> > of course.
> 
> If most of the I/O on a system exhibits seeky tendencies, couldn't the
> schedulers notice that and use that as the hint for what to optimize?
> 
> I mean, there's no switch better than the actual I/O behavior itself.

Heuristics like that have a tendency to fail. What's the cut-off point?
Additionally, heuristics based on past process/system behaviour also has
a tendency to be suboptimal, since things aren't static.

We already look at seekiness of individual processes or groups. IIRC,
as-iosched also keeps a per-queue tracking.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 17:13                                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:13 UTC (permalink / raw)
  To: Ray Lee
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, Mike Galbraith, linux-kernel, akpm, righi.andrea,
	Linus Torvalds

On Fri, Oct 02 2009, Ray Lee wrote:
> On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> > In some cases I wish we had a server vs desktop switch, since it would
> > decisions on this easier. I know you say that servers care about
> > latency, but not at all to the extent that desktops do. Most desktop
> > users would gladly give away the top of the performance for latency,
> > that's not true of most server users. Depends on what the server does,
> > of course.
> 
> If most of the I/O on a system exhibits seeky tendencies, couldn't the
> schedulers notice that and use that as the hint for what to optimize?
> 
> I mean, there's no switch better than the actual I/O behavior itself.

Heuristics like that have a tendency to fail. What's the cut-off point?
Additionally, heuristics based on past process/system behaviour also has
a tendency to be suboptimal, since things aren't static.

We already look at seekiness of individual processes or groups. IIRC,
as-iosched also keeps a per-queue tracking.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                       ` <alpine.LFD.2.01.0910020811490.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2009-10-02 16:01                                         ` jim owens
@ 2009-10-02 17:11                                         ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, Oct 02 2009, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > Mostly they care about throughput, and when they come running because
> > some their favorite app/benchmark/etc is now 2% slower, I get to hear
> > about it all the time. So yes, latency is not ignored, but mostly they
> > yack about throughput.
> 
> The reason they yack about it is that they can measure it.
> 
> Give them the benchmark where it goes the other way, and tell them why 
> they see a 2% deprovement. Give them some button they can tweak, because 
> they will.

To some extent that's true, and I didn't want to generalize. If they are
adament that the benchmark models their real life, then no amount of
pointing in the other direction will change that.

Your point about tuning is definitely true, these people are used to
tuning things. For the desktop we care a lot more about working out of
the box.

> But make the default be low-latency. Because everybody cares about low 
> latency, and the people who do so are _not_ the people who you give 
> buttons to tweak things with.

Totally agree.

> > I agree, we can easily make CFQ be very about about latency. If you
> > think that is fine, then lets just do that. Then we'll get to fix the
> > server side up when the next RHEL/SLES/whatever cycle is honing in on a
> > kernel, hopefully we wont have to start over when that happens.
> 
> I really think we should do latency first, and throughput second.
> 
> It's _easy_ to get throughput. The people who care just about throughput 
> can always just disable all the work we do for latency. If they really 
> care about just throughput, they won't want fairness either - none of that 
> complex stuff.

It's not _that_ easy, it depends a lot on the access patterns. A good
example of that is actually the idling that we already do. Say you have
two applications, each starting up. If you start them both at the same
time and just care for the dumb low latency, then you'll do one IO from
each of them in turn. Latency will be good, but throughput will be
aweful. And this means that in 20s they are both started, while with the
slice idling and priority disk access that CFQ does, you'd hopefully
have both up and running in 2s.

So latency is good, definitely, but sometimes you have to worry about
the bigger picture too. Latency is more than single IOs, it's often for
complete operation which may involve lots of IOs. Single IO latency is
a benchmark thing, it's not a real life issue. And that's where it
becomes complex and not so black and white. Mike's test is a really good
example of that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:14                                       ` Linus Torvalds
@ 2009-10-02 17:11                                         ` Jens Axboe
  -1 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, Oct 02 2009, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > Mostly they care about throughput, and when they come running because
> > some their favorite app/benchmark/etc is now 2% slower, I get to hear
> > about it all the time. So yes, latency is not ignored, but mostly they
> > yack about throughput.
> 
> The reason they yack about it is that they can measure it.
> 
> Give them the benchmark where it goes the other way, and tell them why 
> they see a 2% deprovement. Give them some button they can tweak, because 
> they will.

To some extent that's true, and I didn't want to generalize. If they are
adament that the benchmark models their real life, then no amount of
pointing in the other direction will change that.

Your point about tuning is definitely true, these people are used to
tuning things. For the desktop we care a lot more about working out of
the box.

> But make the default be low-latency. Because everybody cares about low 
> latency, and the people who do so are _not_ the people who you give 
> buttons to tweak things with.

Totally agree.

> > I agree, we can easily make CFQ be very about about latency. If you
> > think that is fine, then lets just do that. Then we'll get to fix the
> > server side up when the next RHEL/SLES/whatever cycle is honing in on a
> > kernel, hopefully we wont have to start over when that happens.
> 
> I really think we should do latency first, and throughput second.
> 
> It's _easy_ to get throughput. The people who care just about throughput 
> can always just disable all the work we do for latency. If they really 
> care about just throughput, they won't want fairness either - none of that 
> complex stuff.

It's not _that_ easy, it depends a lot on the access patterns. A good
example of that is actually the idling that we already do. Say you have
two applications, each starting up. If you start them both at the same
time and just care for the dumb low latency, then you'll do one IO from
each of them in turn. Latency will be good, but throughput will be
aweful. And this means that in 20s they are both started, while with the
slice idling and priority disk access that CFQ does, you'd hopefully
have both up and running in 2s.

So latency is good, definitely, but sometimes you have to worry about
the bigger picture too. Latency is more than single IOs, it's often for
complete operation which may involve lots of IOs. Single IO latency is
a benchmark thing, it's not a real life issue. And that's where it
becomes complex and not so black and white. Mike's test is a really good
example of that.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 17:11                                         ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 17:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, Mike Galbraith, linux-kernel, akpm, righi.andrea

On Fri, Oct 02 2009, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > Mostly they care about throughput, and when they come running because
> > some their favorite app/benchmark/etc is now 2% slower, I get to hear
> > about it all the time. So yes, latency is not ignored, but mostly they
> > yack about throughput.
> 
> The reason they yack about it is that they can measure it.
> 
> Give them the benchmark where it goes the other way, and tell them why 
> they see a 2% deprovement. Give them some button they can tweak, because 
> they will.

To some extent that's true, and I didn't want to generalize. If they are
adament that the benchmark models their real life, then no amount of
pointing in the other direction will change that.

Your point about tuning is definitely true, these people are used to
tuning things. For the desktop we care a lot more about working out of
the box.

> But make the default be low-latency. Because everybody cares about low 
> latency, and the people who do so are _not_ the people who you give 
> buttons to tweak things with.

Totally agree.

> > I agree, we can easily make CFQ be very about about latency. If you
> > think that is fine, then lets just do that. Then we'll get to fix the
> > server side up when the next RHEL/SLES/whatever cycle is honing in on a
> > kernel, hopefully we wont have to start over when that happens.
> 
> I really think we should do latency first, and throughput second.
> 
> It's _easy_ to get throughput. The people who care just about throughput 
> can always just disable all the work we do for latency. If they really 
> care about just throughput, they won't want fairness either - none of that 
> complex stuff.

It's not _that_ easy, it depends a lot on the access patterns. A good
example of that is actually the idling that we already do. Say you have
two applications, each starting up. If you start them both at the same
time and just care for the dumb low latency, then you'll do one IO from
each of them in turn. Latency will be good, but throughput will be
aweful. And this means that in 20s they are both started, while with the
slice idling and priority disk access that CFQ does, you'd hopefully
have both up and running in 2s.

So latency is good, definitely, but sometimes you have to worry about
the bigger picture too. Latency is more than single IOs, it's often for
complete operation which may involve lots of IOs. Single IO latency is
a benchmark thing, it's not a real life issue. And that's where it
becomes complex and not so black and white. Mike's test is a really good
example of that.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                               ` <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02 16:37                                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 16:37 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


* Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> wrote:

> On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote:
> > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> > 
> > > It's not hard to make the latency good, the hard bit is making sure we 
> > > also perform well for all other scenarios.
> > 
> > Looking at the numbers from Mike:
> > 
> >  | dd competing against perf stat -- konsole -e exec timings, 5 back to 
> >  | back runs
> >  |                                                         Avg
> >  | before         9.15    14.51     9.39    15.06     9.90   11.6
> >  | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7
> > 
> > _PLEASE_ make read latencies this good - the numbers are _vastly_ 
> > better. We'll worry about the 'other' things _after_ we've reached good 
> > latencies.
> > 
> > I thought this principle was a well established basic rule of Linux 
> > IO scheduling. Why do we have to have a 'latency vs. bandwidth' 
> > discussion again and again? I thought latency won hands down.
> 
> Just a note: In the testing I've done so far, we're better off today 
> than ever, [...]

Definitely so, and a couple of months ago i've sung praises of that 
progress on the IO/fs latencies front:

   http://lkml.org/lkml/2009/4/9/461

... but we are greedy bastards and dont define excellence by how far 
down we have come from but by how high we can still climb ;-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  9:36                             ` Mike Galbraith
@ 2009-10-02 16:37                                 ` Ingo Molnar
       [not found]                               ` <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 16:37 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jens Axboe, Vivek Goyal, Ulrich Lukas, linux-kernel, containers,
	dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, riel


* Mike Galbraith <efault@gmx.de> wrote:

> On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote:
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > It's not hard to make the latency good, the hard bit is making sure we 
> > > also perform well for all other scenarios.
> > 
> > Looking at the numbers from Mike:
> > 
> >  | dd competing against perf stat -- konsole -e exec timings, 5 back to 
> >  | back runs
> >  |                                                         Avg
> >  | before         9.15    14.51     9.39    15.06     9.90   11.6
> >  | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7
> > 
> > _PLEASE_ make read latencies this good - the numbers are _vastly_ 
> > better. We'll worry about the 'other' things _after_ we've reached good 
> > latencies.
> > 
> > I thought this principle was a well established basic rule of Linux 
> > IO scheduling. Why do we have to have a 'latency vs. bandwidth' 
> > discussion again and again? I thought latency won hands down.
> 
> Just a note: In the testing I've done so far, we're better off today 
> than ever, [...]

Definitely so, and a couple of months ago i've sung praises of that 
progress on the IO/fs latencies front:

   http://lkml.org/lkml/2009/4/9/461

... but we are greedy bastards and dont define excellence by how far 
down we have come from but by how high we can still climb ;-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 16:37                                 ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 16:37 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	linux-kernel, akpm, righi.andrea, torvalds


* Mike Galbraith <efault@gmx.de> wrote:

> On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote:
> > * Jens Axboe <jens.axboe@oracle.com> wrote:
> > 
> > > It's not hard to make the latency good, the hard bit is making sure we 
> > > also perform well for all other scenarios.
> > 
> > Looking at the numbers from Mike:
> > 
> >  | dd competing against perf stat -- konsole -e exec timings, 5 back to 
> >  | back runs
> >  |                                                         Avg
> >  | before         9.15    14.51     9.39    15.06     9.90   11.6
> >  | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7
> > 
> > _PLEASE_ make read latencies this good - the numbers are _vastly_ 
> > better. We'll worry about the 'other' things _after_ we've reached good 
> > latencies.
> > 
> > I thought this principle was a well established basic rule of Linux 
> > IO scheduling. Why do we have to have a 'latency vs. bandwidth' 
> > discussion again and again? I thought latency won hands down.
> 
> Just a note: In the testing I've done so far, we're better off today 
> than ever, [...]

Definitely so, and a couple of months ago i've sung praises of that 
progress on the IO/fs latencies front:

   http://lkml.org/lkml/2009/4/9/461

... but we are greedy bastards and dont define excellence by how far 
down we have come from but by how high we can still climb ;-)

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                     ` <20091002145610.GD31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-02 15:14                                       ` Linus Torvalds
@ 2009-10-02 16:33                                       ` Ray Lee
  1 sibling, 0 replies; 349+ messages in thread
From: Ray Lee @ 2009-10-02 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> In some cases I wish we had a server vs desktop switch, since it would
> decisions on this easier. I know you say that servers care about
> latency, but not at all to the extent that desktops do. Most desktop
> users would gladly give away the top of the performance for latency,
> that's not true of most server users. Depends on what the server does,
> of course.

If most of the I/O on a system exhibits seeky tendencies, couldn't the
schedulers notice that and use that as the hint for what to optimize?

I mean, there's no switch better than the actual I/O behavior itself.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 14:56                                     ` Jens Axboe
  (?)
  (?)
@ 2009-10-02 16:33                                     ` Ray Lee
  2009-10-02 17:13                                         ` Jens Axboe
       [not found]                                       ` <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Ray Lee @ 2009-10-02 16:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Linus Torvalds, Ingo Molnar, Mike Galbraith, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe@oracle.com> wrote:
> In some cases I wish we had a server vs desktop switch, since it would
> decisions on this easier. I know you say that servers care about
> latency, but not at all to the extent that desktops do. Most desktop
> users would gladly give away the top of the performance for latency,
> that's not true of most server users. Depends on what the server does,
> of course.

If most of the I/O on a system exhibits seeky tendencies, couldn't the
schedulers notice that and use that as the hint for what to optimize?

I mean, there's no switch better than the actual I/O behavior itself.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                   ` <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2009-10-02 14:45                                     ` Mike Galbraith
  2009-10-02 14:56                                     ` Jens Axboe
@ 2009-10-02 16:22                                     ` Ingo Molnar
  2 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 16:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w


* Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > It's really not that simple, if we go and do easy latency bits, then 
> > throughput drops 30% or more.
> 
> Well, if we're talking 500-950% improvement vs 30% deprovement, I 
> think it's pretty clear, though. Even the server people do care about 
> latencies.
> 
> Often they care quite a bit, in fact.

The other thing is that latency is basically a given property in any 
system - as an app writer you have to live with it, there's not much you 
can do to improve it.

Bandwidth on the other hand is a lot more engineerable, as it tends to 
be about batching things and you can batch in user-space too. Batching 
is often easier to do than getting good latencies.

Then there's also the fact that the range of apps that care about 
bandwidth is a lot smaller than the range of apps which care about 
latencies. The default should help more apps - i.e. latencies.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 14:24                                   ` Linus Torvalds
@ 2009-10-02 16:22                                     ` Ingo Molnar
  -1 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 16:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > It's really not that simple, if we go and do easy latency bits, then 
> > throughput drops 30% or more.
> 
> Well, if we're talking 500-950% improvement vs 30% deprovement, I 
> think it's pretty clear, though. Even the server people do care about 
> latencies.
> 
> Often they care quite a bit, in fact.

The other thing is that latency is basically a given property in any 
system - as an app writer you have to live with it, there's not much you 
can do to improve it.

Bandwidth on the other hand is a lot more engineerable, as it tends to 
be about batching things and you can batch in user-space too. Batching 
is often easier to do than getting good latencies.

Then there's also the fact that the range of apps that care about 
bandwidth is a lot smaller than the range of apps which care about 
latencies. The default should help more apps - i.e. latencies.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 16:22                                     ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02 16:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > It's really not that simple, if we go and do easy latency bits, then 
> > throughput drops 30% or more.
> 
> Well, if we're talking 500-950% improvement vs 30% deprovement, I 
> think it's pretty clear, though. Even the server people do care about 
> latencies.
> 
> Often they care quite a bit, in fact.

The other thing is that latency is basically a given property in any 
system - as an app writer you have to live with it, there's not much you 
can do to improve it.

Bandwidth on the other hand is a lot more engineerable, as it tends to 
be about batching things and you can batch in user-space too. Batching 
is often easier to do than getting good latencies.

Then there's also the fact that the range of apps that care about 
bandwidth is a lot smaller than the range of apps which care about 
latencies. The default should help more apps - i.e. latencies.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                       ` <alpine.LFD.2.01.0910020811490.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2009-10-02 16:01                                         ` jim owens
  2009-10-02 17:11                                         ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: jim owens @ 2009-10-02 16:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

Linus Torvalds wrote:
> 
> I really think we should do latency first, and throughput second.

Agree.

> It's _easy_ to get throughput. The people who care just about throughput 
> can always just disable all the work we do for latency.

But in my experience it is not that simple...

The argument latency vs throughput or desktop vs server is wrong.

I/O can never keep up with the ability of CPUs to dirty data.

On desktops and servers (really many-user-desktops) we want
minimum latency but the enemy is dirty VM.  If we ignore the
need for throughput to flush dirty pages, VM gets angry and
forced VM page cleaning I/O is bad I/O.

We want min latency with low dirty page percent but need to
switch to max write throughput at some high dirty page percent.

We can not prevent the cliff we fall off where the system
chokes because the dirty page load is too high, but if we
only worry about latency, we bring that choke point cliff in
so it happens with a lower load.  A 10% lower overload point
might be fine to get 100% better latency, but would desktop
users accept a 50% lower overload point where running one
more application makes the system appear hung?

Even desktop users commonly measure "how much work can I do
before the system becomes unresponsive".

jim

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 15:14                                       ` Linus Torvalds
@ 2009-10-02 16:01                                         ` jim owens
  -1 siblings, 0 replies; 349+ messages in thread
From: jim owens @ 2009-10-02 16:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ingo Molnar, Mike Galbraith, Vivek Goyal,
	Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, riel

Linus Torvalds wrote:
> 
> I really think we should do latency first, and throughput second.

Agree.

> It's _easy_ to get throughput. The people who care just about throughput 
> can always just disable all the work we do for latency.

But in my experience it is not that simple...

The argument latency vs throughput or desktop vs server is wrong.

I/O can never keep up with the ability of CPUs to dirty data.

On desktops and servers (really many-user-desktops) we want
minimum latency but the enemy is dirty VM.  If we ignore the
need for throughput to flush dirty pages, VM gets angry and
forced VM page cleaning I/O is bad I/O.

We want min latency with low dirty page percent but need to
switch to max write throughput at some high dirty page percent.

We can not prevent the cliff we fall off where the system
chokes because the dirty page load is too high, but if we
only worry about latency, we bring that choke point cliff in
so it happens with a lower load.  A 10% lower overload point
might be fine to get 100% better latency, but would desktop
users accept a 50% lower overload point where running one
more application makes the system appear hung?

Even desktop users commonly measure "how much work can I do
before the system becomes unresponsive".

jim

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 16:01                                         ` jim owens
  0 siblings, 0 replies; 349+ messages in thread
From: jim owens @ 2009-10-02 16:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, Mike Galbraith, linux-kernel, akpm, righi.andrea

Linus Torvalds wrote:
> 
> I really think we should do latency first, and throughput second.

Agree.

> It's _easy_ to get throughput. The people who care just about throughput 
> can always just disable all the work we do for latency.

But in my experience it is not that simple...

The argument latency vs throughput or desktop vs server is wrong.

I/O can never keep up with the ability of CPUs to dirty data.

On desktops and servers (really many-user-desktops) we want
minimum latency but the enemy is dirty VM.  If we ignore the
need for throughput to flush dirty pages, VM gets angry and
forced VM page cleaning I/O is bad I/O.

We want min latency with low dirty page percent but need to
switch to max write throughput at some high dirty page percent.

We can not prevent the cliff we fall off where the system
chokes because the dirty page load is too high, but if we
only worry about latency, we bring that choke point cliff in
so it happens with a lower load.  A 10% lower overload point
might be fine to get 100% better latency, but would desktop
users accept a 50% lower overload point where running one
more application makes the system appear hung?

Even desktop users commonly measure "how much work can I do
before the system becomes unresponsive".

jim

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                     ` <20091002145610.GD31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 15:14                                       ` Linus Torvalds
  2009-10-02 16:33                                       ` Ray Lee
  1 sibling, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2009-10-02 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w



On Fri, 2 Oct 2009, Jens Axboe wrote:
> 
> Mostly they care about throughput, and when they come running because
> some their favorite app/benchmark/etc is now 2% slower, I get to hear
> about it all the time. So yes, latency is not ignored, but mostly they
> yack about throughput.

The reason they yack about it is that they can measure it.

Give them the benchmark where it goes the other way, and tell them why 
they see a 2% deprovement. Give them some button they can tweak, because 
they will.

But make the default be low-latency. Because everybody cares about low 
latency, and the people who do so are _not_ the people who you give 
buttons to tweak things with.

> I agree, we can easily make CFQ be very about about latency. If you
> think that is fine, then lets just do that. Then we'll get to fix the
> server side up when the next RHEL/SLES/whatever cycle is honing in on a
> kernel, hopefully we wont have to start over when that happens.

I really think we should do latency first, and throughput second.

It's _easy_ to get throughput. The people who care just about throughput 
can always just disable all the work we do for latency. If they really 
care about just throughput, they won't want fairness either - none of that 
complex stuff.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 14:56                                     ` Jens Axboe
@ 2009-10-02 15:14                                       ` Linus Torvalds
  -1 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2009-10-02 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel



On Fri, 2 Oct 2009, Jens Axboe wrote:
> 
> Mostly they care about throughput, and when they come running because
> some their favorite app/benchmark/etc is now 2% slower, I get to hear
> about it all the time. So yes, latency is not ignored, but mostly they
> yack about throughput.

The reason they yack about it is that they can measure it.

Give them the benchmark where it goes the other way, and tell them why 
they see a 2% deprovement. Give them some button they can tweak, because 
they will.

But make the default be low-latency. Because everybody cares about low 
latency, and the people who do so are _not_ the people who you give 
buttons to tweak things with.

> I agree, we can easily make CFQ be very about about latency. If you
> think that is fine, then lets just do that. Then we'll get to fix the
> server side up when the next RHEL/SLES/whatever cycle is honing in on a
> kernel, hopefully we wont have to start over when that happens.

I really think we should do latency first, and throughput second.

It's _easy_ to get throughput. The people who care just about throughput 
can always just disable all the work we do for latency. If they really 
care about just throughput, they won't want fairness either - none of that 
complex stuff.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 15:14                                       ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2009-10-02 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, Mike Galbraith, linux-kernel, akpm, righi.andrea



On Fri, 2 Oct 2009, Jens Axboe wrote:
> 
> Mostly they care about throughput, and when they come running because
> some their favorite app/benchmark/etc is now 2% slower, I get to hear
> about it all the time. So yes, latency is not ignored, but mostly they
> yack about throughput.

The reason they yack about it is that they can measure it.

Give them the benchmark where it goes the other way, and tell them why 
they see a 2% deprovement. Give them some button they can tweak, because 
they will.

But make the default be low-latency. Because everybody cares about low 
latency, and the people who do so are _not_ the people who you give 
buttons to tweak things with.

> I agree, we can easily make CFQ be very about about latency. If you
> think that is fine, then lets just do that. Then we'll get to fix the
> server side up when the next RHEL/SLES/whatever cycle is honing in on a
> kernel, hopefully we wont have to start over when that happens.

I really think we should do latency first, and throughput second.

It's _easy_ to get throughput. The people who care just about throughput 
can always just disable all the work we do for latency. If they really 
care about just throughput, they won't want fairness either - none of that 
complex stuff.

			Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                     ` <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02 14:57                                       ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 14:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote:
> > 
> > On Fri, 2 Oct 2009, Jens Axboe wrote:
> > > 
> > > It's really not that simple, if we go and do easy latency bits, then
> > > throughput drops 30% or more.
> > 
> > Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
> > it's pretty clear, though. Even the server people do care about latencies. 
> > 
> > Often they care quite a bit, in fact.
> > 
> > And Mike's patch didn't look big or complicated.
> 
> But it is a hack.  (thought about and measured, but hack nonetheless)
> 
> I haven't tested it on much other than reader vs streaming writer.  It
> may well destroy the rest of the IO universe. I don't have the hw to
> even test any hairy chested IO.

I'll get a desktop box going on this too. The plan is to make the
latency as good as we can without making too many stupid decisions in
the io scheduler, then we can care about the throughput later. Rinse
and repeat.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 14:45                                   ` Mike Galbraith
@ 2009-10-02 14:57                                     ` Jens Axboe
       [not found]                                     ` <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 14:57 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Linus Torvalds, Ingo Molnar, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote:
> > 
> > On Fri, 2 Oct 2009, Jens Axboe wrote:
> > > 
> > > It's really not that simple, if we go and do easy latency bits, then
> > > throughput drops 30% or more.
> > 
> > Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
> > it's pretty clear, though. Even the server people do care about latencies. 
> > 
> > Often they care quite a bit, in fact.
> > 
> > And Mike's patch didn't look big or complicated.
> 
> But it is a hack.  (thought about and measured, but hack nonetheless)
> 
> I haven't tested it on much other than reader vs streaming writer.  It
> may well destroy the rest of the IO universe. I don't have the hw to
> even test any hairy chested IO.

I'll get a desktop box going on this too. The plan is to make the
latency as good as we can without making too many stupid decisions in
the io scheduler, then we can care about the throughput later. Rinse
and repeat.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                   ` <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2009-10-02 14:45                                     ` Mike Galbraith
@ 2009-10-02 14:56                                     ` Jens Axboe
  2009-10-02 16:22                                     ` Ingo Molnar
  2 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, Oct 02 2009, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more.
> 
> Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
> it's pretty clear, though. Even the server people do care about latencies. 
> 
> Often they care quite a bit, in fact.

Mostly they care about throughput, and when they come running because
some their favorite app/benchmark/etc is now 2% slower, I get to hear
about it all the time. So yes, latency is not ignored, but mostly they
yack about throughput.

> And Mike's patch didn't look big or complicated. 

It wasn't, it was more of a hack than something mergeable though (and I
think Mike will agree on that). So I'll repeat what I said to Mike, I'm
very well prepared to get something worked out and merged and I very
much appreciate the work he's putting into this.

> > You can't say it's black and white latency vs throughput issue,
> 
> Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn 
> black-and-white _regardless_ of what you're measuring. Plus you probably 
> made up the 30% - have you tested the patch?

The 30% is totally made up, it's based on previous latency vs throughput
tradeoffs. I haven't tested Mike's patch.

> And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's 
> just harder to measure, so people seldom attach numbers to it. But that 
> again means that when people _are_ able to attach numbers to it, we should 
> take those numbers _more_ seriously rather than less.

I agree, we can easily make CFQ be very about about latency. If you
think that is fine, then lets just do that. Then we'll get to fix the
server side up when the next RHEL/SLES/whatever cycle is honing in on a
kernel, hopefully we wont have to start over when that happens.

> So the 30% you threw out as a number is pretty much worthless. 

It's hand waving, definitely. But I've been doing io scheduler tweaking
for years, and I know how hard it is to balance. If you want latency,
then you basically only ever give the device 1 thing to do. And you let
things cool down before switching over. If you do that, then your nice
big array of SSDs or rotating drives will easily drop to 1/4th of the
original performance. So we try and tweak the logic to make everybody
happy.

In some cases I wish we had a server vs desktop switch, since it would
decisions on this easier. I know you say that servers care about
latency, but not at all to the extent that desktops do. Most desktop
users would gladly give away the top of the performance for latency,
that's not true of most server users. Depends on what the server does,
of course.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 14:24                                   ` Linus Torvalds
@ 2009-10-02 14:56                                     ` Jens Axboe
  -1 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, Oct 02 2009, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more.
> 
> Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
> it's pretty clear, though. Even the server people do care about latencies. 
> 
> Often they care quite a bit, in fact.

Mostly they care about throughput, and when they come running because
some their favorite app/benchmark/etc is now 2% slower, I get to hear
about it all the time. So yes, latency is not ignored, but mostly they
yack about throughput.

> And Mike's patch didn't look big or complicated. 

It wasn't, it was more of a hack than something mergeable though (and I
think Mike will agree on that). So I'll repeat what I said to Mike, I'm
very well prepared to get something worked out and merged and I very
much appreciate the work he's putting into this.

> > You can't say it's black and white latency vs throughput issue,
> 
> Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn 
> black-and-white _regardless_ of what you're measuring. Plus you probably 
> made up the 30% - have you tested the patch?

The 30% is totally made up, it's based on previous latency vs throughput
tradeoffs. I haven't tested Mike's patch.

> And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's 
> just harder to measure, so people seldom attach numbers to it. But that 
> again means that when people _are_ able to attach numbers to it, we should 
> take those numbers _more_ seriously rather than less.

I agree, we can easily make CFQ be very about about latency. If you
think that is fine, then lets just do that. Then we'll get to fix the
server side up when the next RHEL/SLES/whatever cycle is honing in on a
kernel, hopefully we wont have to start over when that happens.

> So the 30% you threw out as a number is pretty much worthless. 

It's hand waving, definitely. But I've been doing io scheduler tweaking
for years, and I know how hard it is to balance. If you want latency,
then you basically only ever give the device 1 thing to do. And you let
things cool down before switching over. If you do that, then your nice
big array of SSDs or rotating drives will easily drop to 1/4th of the
original performance. So we try and tweak the logic to make everybody
happy.

In some cases I wish we had a server vs desktop switch, since it would
decisions on this easier. I know you say that servers care about
latency, but not at all to the extent that desktops do. Most desktop
users would gladly give away the top of the performance for latency,
that's not true of most server users. Depends on what the server does,
of course.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 14:56                                     ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02 14:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	containers, Mike Galbraith, linux-kernel, akpm, righi.andrea

On Fri, Oct 02 2009, Linus Torvalds wrote:
> 
> 
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more.
> 
> Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
> it's pretty clear, though. Even the server people do care about latencies. 
> 
> Often they care quite a bit, in fact.

Mostly they care about throughput, and when they come running because
some their favorite app/benchmark/etc is now 2% slower, I get to hear
about it all the time. So yes, latency is not ignored, but mostly they
yack about throughput.

> And Mike's patch didn't look big or complicated. 

It wasn't, it was more of a hack than something mergeable though (and I
think Mike will agree on that). So I'll repeat what I said to Mike, I'm
very well prepared to get something worked out and merged and I very
much appreciate the work he's putting into this.

> > You can't say it's black and white latency vs throughput issue,
> 
> Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn 
> black-and-white _regardless_ of what you're measuring. Plus you probably 
> made up the 30% - have you tested the patch?

The 30% is totally made up, it's based on previous latency vs throughput
tradeoffs. I haven't tested Mike's patch.

> And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's 
> just harder to measure, so people seldom attach numbers to it. But that 
> again means that when people _are_ able to attach numbers to it, we should 
> take those numbers _more_ seriously rather than less.

I agree, we can easily make CFQ be very about about latency. If you
think that is fine, then lets just do that. Then we'll get to fix the
server side up when the next RHEL/SLES/whatever cycle is honing in on a
kernel, hopefully we wont have to start over when that happens.

> So the 30% you threw out as a number is pretty much worthless. 

It's hand waving, definitely. But I've been doing io scheduler tweaking
for years, and I know how hard it is to balance. If you want latency,
then you basically only ever give the device 1 thing to do. And you let
things cool down before switching over. If you do that, then your nice
big array of SSDs or rotating drives will easily drop to 1/4th of the
original performance. So we try and tweak the logic to make everybody
happy.

In some cases I wish we had a server vs desktop switch, since it would
decisions on this easier. I know you say that servers care about
latency, but not at all to the extent that desktops do. Most desktop
users would gladly give away the top of the performance for latency,
that's not true of most server users. Depends on what the server does,
of course.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                                   ` <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2009-10-02 14:45                                     ` Mike Galbraith
  2009-10-02 14:56                                     ` Jens Axboe
  2009-10-02 16:22                                     ` Ingo Molnar
  2 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 14:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w

On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote:
> 
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more.
> 
> Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
> it's pretty clear, though. Even the server people do care about latencies. 
> 
> Often they care quite a bit, in fact.
> 
> And Mike's patch didn't look big or complicated.

But it is a hack.  (thought about and measured, but hack nonetheless)

I haven't tested it on much other than reader vs streaming writer.  It
may well destroy the rest of the IO universe. I don't have the hw to
even test any hairy chested IO.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02 14:24                                   ` Linus Torvalds
  (?)
@ 2009-10-02 14:45                                   ` Mike Galbraith
  2009-10-02 14:57                                     ` Jens Axboe
       [not found]                                     ` <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 14:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Ingo Molnar, Vivek Goyal, Ulrich Lukas, linux-kernel,
	containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel

On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote:
> 
> On Fri, 2 Oct 2009, Jens Axboe wrote:
> > 
> > It's really not that simple, if we go and do easy latency bits, then
> > throughput drops 30% or more.
> 
> Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
> it's pretty clear, though. Even the server people do care about latencies. 
> 
> Often they care quite a bit, in fact.
> 
> And Mike's patch didn't look big or complicated.

But it is a hack.  (thought about and measured, but hack nonetheless)

I haven't tested it on much other than reader vs streaming writer.  It
may well destroy the rest of the IO universe. I don't have the hw to
even test any hairy chested IO.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  9:28                               ` Jens Axboe
@ 2009-10-02 14:24                                   ` Linus Torvalds
  -1 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2009-10-02 14:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w



On Fri, 2 Oct 2009, Jens Axboe wrote:
> 
> It's really not that simple, if we go and do easy latency bits, then
> throughput drops 30% or more.

Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
it's pretty clear, though. Even the server people do care about latencies. 

Often they care quite a bit, in fact.

And Mike's patch didn't look big or complicated. 

> You can't say it's black and white latency vs throughput issue,

Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn 
black-and-white _regardless_ of what you're measuring. Plus you probably 
made up the 30% - have you tested the patch?

And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's 
just harder to measure, so people seldom attach numbers to it. But that 
again means that when people _are_ able to attach numbers to it, we should 
take those numbers _more_ seriously rather than less.

So the 30% you threw out as a number is pretty much worthless. 

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 14:24                                   ` Linus Torvalds
  0 siblings, 0 replies; 349+ messages in thread
From: Linus Torvalds @ 2009-10-02 14:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas,
	linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel



On Fri, 2 Oct 2009, Jens Axboe wrote:
> 
> It's really not that simple, if we go and do easy latency bits, then
> throughput drops 30% or more.

Well, if we're talking 500-950% improvement vs 30% deprovement, I think 
it's pretty clear, though. Even the server people do care about latencies. 

Often they care quite a bit, in fact.

And Mike's patch didn't look big or complicated. 

> You can't say it's black and white latency vs throughput issue,

Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn 
black-and-white _regardless_ of what you're measuring. Plus you probably 
made up the 30% - have you tested the patch?

And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's 
just harder to measure, so people seldom attach numbers to it. But that 
again means that when people _are_ able to attach numbers to it, we should 
take those numbers _more_ seriously rather than less.

So the 30% you threw out as a number is pretty much worthless. 

		Linus

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                               ` <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02 12:22                                 ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 12:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 2009-10-02 at 11:55 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:
> >
> >         /*
> >          * Drain async requests before we start sync IO
> >          */
> >         if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> > 
> > Looked about the same to me as..
> >  
> > 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
> > 
> > ..where Vivek prevented turning 1 into 0, so I stamped it ;-)
> 
> cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter
> idling, not that it is currently idling. The actual idling happens from
> cfq_completed_request(), here:
> 
>                 else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
>                          sync && !rq_noidle(rq))
>                         cfq_arm_slice_timer(cfqd);
> 
> and after that the queue will be marked as waiting, so
> cfq_cfqq_wait_request(cfqq) is a better indication of whether we are
> currently waiting for a request (idling) or not.

Hm.  Then cfq_cfqq_idle_window(cfqq) actually suits my intent better.

(If I want to reduce async's advantage, I should target specifically, ie
only stamp if this queue is a sync queue....otoh, if this queue is sync,
it is now officially too late, whereas if this queue is dd about to
inflict the wrath of kjournald on my reader's world, stamping now is a
really good idea.. scritch scritch scritch <smoke>)

I'll go tinker with it.  Thanks for the clue.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  9:55                             ` Jens Axboe
@ 2009-10-02 12:22                               ` Mike Galbraith
       [not found]                               ` <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02 12:22 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Fri, 2009-10-02 at 11:55 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:
> >
> >         /*
> >          * Drain async requests before we start sync IO
> >          */
> >         if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> > 
> > Looked about the same to me as..
> >  
> > 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
> > 
> > ..where Vivek prevented turning 1 into 0, so I stamped it ;-)
> 
> cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter
> idling, not that it is currently idling. The actual idling happens from
> cfq_completed_request(), here:
> 
>                 else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
>                          sync && !rq_noidle(rq))
>                         cfq_arm_slice_timer(cfqd);
> 
> and after that the queue will be marked as waiting, so
> cfq_cfqq_wait_request(cfqq) is a better indication of whether we are
> currently waiting for a request (idling) or not.

Hm.  Then cfq_cfqq_idle_window(cfqq) actually suits my intent better.

(If I want to reduce async's advantage, I should target specifically, ie
only stamp if this queue is a sync queue....otoh, if this queue is sync,
it is now officially too late, whereas if this queue is dd about to
inflict the wrath of kjournald on my reader's world, stamping now is a
really good idea.. scritch scritch scritch <smoke>)

I'll go tinker with it.  Thanks for the clue.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02 10:55 Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-10-02 10:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Jens,
On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> On Fri, Oct 02 2009, Ingo Molnar wrote:
>>
>> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>>
>
> It's really not that simple, if we go and do easy latency bits, then
> throughput drops 30% or more. You can't say it's black and white latency
> vs throughput issue, that's just not how the real world works. The
> server folks would be most unpleased.
Could we be more selective when the latency optimization is introduced?

The code that is currently touched by Vivek's patch is:
        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
            (cfqd->hw_tag && CIC_SEEKY(cic)))
                enable_idle = 0;
basically, when fairness=1, it becomes just:
        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle)
                enable_idle = 0;

Note that, even if we enable idling here, the cfq_arm_slice_timer will use
a different idle window for seeky (2ms) than for normal I/O.

I think that the 2ms idle window is good for a single rotational SATA disk scenario,
even if it supports NCQ. Realistic access times for those disks are still around 8ms
(but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby
request may pay off, not only in latency and fairness, but also in throughput.

What we don't want to do is to enable idling for NCQ enabled SSDs
(and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs.
If we agree that hardware RAIDs should be marked as non-rotational, then that
code could become:

        if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
            (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic)))
                enable_idle = 0;
        else if (sample_valid(cic->ttime_samples)) {
		unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle;
		if (cic->ttime_mean > idle_time)
                        enable_idle = 0;
                else
                        enable_idle = 1;
        }

Thanks,
Corrado

>
> --
> Jens Axboe
>

-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                             ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-02  9:00                               ` Mike Galbraith
@ 2009-10-02  9:55                               ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02  9:55 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote:
> > On Fri, Oct 02 2009, Mike Galbraith wrote:
> 
> > > If we're in the idle window and doing the async drain thing, we've at
> > > the spot where Vivek's patch helps a ton.  Seemed like a great time to
> > > limit the size of any io that may land in front of my sync reader to
> > > plain "you are not alone" quantity.
> > 
> > You can't be in the idle window and doing async drain at the same time,
> > the idle window doesn't start until the sync queue has completed a
> > request. Hence my above rant on device interference.
> 
> I'll take your word for it.
> 
>         /*
>          * Drain async requests before we start sync IO
>          */
>         if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> 
> Looked about the same to me as..
>  
> 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
> 
> ..where Vivek prevented turning 1 into 0, so I stamped it ;-)

cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter
idling, not that it is currently idling. The actual idling happens from
cfq_completed_request(), here:

                else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
                         sync && !rq_noidle(rq))
                        cfq_arm_slice_timer(cfqd);

and after that the queue will be marked as waiting, so
cfq_cfqq_wait_request(cfqq) is a better indication of whether we are
currently waiting for a request (idling) or not.

> > > Dunno, I was just tossing rocks and sticks at it.
> > > 
> > > I don't really understand the reasoning behind overloading:  I can see
> > > that allows cutting thicker slabs for the disk, but with the streaming
> > > writer vs reader case, seems only the writers can do that.  The reader
> > > is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
> > > thread or kjournald is going to be there with it, which gives dd a huge
> > > advantage.. it has two proxies to help it squabble over disk, konsole
> > > has none.
> > 
> > That is true, async queues have a huge advantage over sync ones. But
> > sync vs async is only part of it, any combination of queued sync, queued
> > sync random etc have different ramifications on behaviour of the
> > individual queue.
> > 
> > It's not hard to make the latency good, the hard bit is making sure we
> > also perform well for all other scenarios.
> 
> Yeah, that's why I'm trying to be careful about what I say, I know full
> well this ain't easy to get right.  I'm not even thinking of submitting
> anything, it's just diagnostic testing.

It's much appreciated btw, if we can make this better without killing
throughput, then I'm surely interested in picking up your interesting
bits and getting them massaged into something we can include. So don't
be discouraged, I'm just being realistic :-)


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  8:53                           ` Mike Galbraith
       [not found]                             ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-02  9:00                             ` Mike Galbraith
@ 2009-10-02  9:55                             ` Jens Axboe
  2009-10-02 12:22                               ` Mike Galbraith
       [not found]                               ` <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2 siblings, 2 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02  9:55 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote:
> > On Fri, Oct 02 2009, Mike Galbraith wrote:
> 
> > > If we're in the idle window and doing the async drain thing, we've at
> > > the spot where Vivek's patch helps a ton.  Seemed like a great time to
> > > limit the size of any io that may land in front of my sync reader to
> > > plain "you are not alone" quantity.
> > 
> > You can't be in the idle window and doing async drain at the same time,
> > the idle window doesn't start until the sync queue has completed a
> > request. Hence my above rant on device interference.
> 
> I'll take your word for it.
> 
>         /*
>          * Drain async requests before we start sync IO
>          */
>         if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
> 
> Looked about the same to me as..
>  
> 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
> 
> ..where Vivek prevented turning 1 into 0, so I stamped it ;-)

cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter
idling, not that it is currently idling. The actual idling happens from
cfq_completed_request(), here:

                else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) &&
                         sync && !rq_noidle(rq))
                        cfq_arm_slice_timer(cfqd);

and after that the queue will be marked as waiting, so
cfq_cfqq_wait_request(cfqq) is a better indication of whether we are
currently waiting for a request (idling) or not.

> > > Dunno, I was just tossing rocks and sticks at it.
> > > 
> > > I don't really understand the reasoning behind overloading:  I can see
> > > that allows cutting thicker slabs for the disk, but with the streaming
> > > writer vs reader case, seems only the writers can do that.  The reader
> > > is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
> > > thread or kjournald is going to be there with it, which gives dd a huge
> > > advantage.. it has two proxies to help it squabble over disk, konsole
> > > has none.
> > 
> > That is true, async queues have a huge advantage over sync ones. But
> > sync vs async is only part of it, any combination of queued sync, queued
> > sync random etc have different ramifications on behaviour of the
> > individual queue.
> > 
> > It's not hard to make the latency good, the hard bit is making sure we
> > also perform well for all other scenarios.
> 
> Yeah, that's why I'm trying to be careful about what I say, I know full
> well this ain't easy to get right.  I'm not even thinking of submitting
> anything, it's just diagnostic testing.

It's much appreciated btw, if we can make this better without killing
throughput, then I'm surely interested in picking up your interesting
bits and getting them massaged into something we can include. So don't
be discouraged, I'm just being realistic :-)


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                             ` <20091002092409.GA19529-X9Un+BFzKDI@public.gmane.org>
  2009-10-02  9:28                               ` Jens Axboe
@ 2009-10-02  9:36                               ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02  9:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote:
> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > It's not hard to make the latency good, the hard bit is making sure we 
> > also perform well for all other scenarios.
> 
> Looking at the numbers from Mike:
> 
>  | dd competing against perf stat -- konsole -e exec timings, 5 back to 
>  | back runs
>  |                                                         Avg
>  | before         9.15    14.51     9.39    15.06     9.90   11.6
>  | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7
> 
> _PLEASE_ make read latencies this good - the numbers are _vastly_ 
> better. We'll worry about the 'other' things _after_ we've reached good 
> latencies.
> 
> I thought this principle was a well established basic rule of Linux IO 
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion 
> again and again? I thought latency won hands down.

Just a note:  In the testing I've done so far, we're better off today
than ever, and I can't recall beating on root ever being anything less
than agony for interactivity.  IO seekers look a lot like CPU sleepers
to me.  Looks like both can be as annoying as hell ;-)

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  9:24                             ` Ingo Molnar
                                               ` (2 preceding siblings ...)
  (?)
@ 2009-10-02  9:36                             ` Mike Galbraith
  2009-10-02 16:37                                 ` Ingo Molnar
       [not found]                               ` <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02  9:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jens Axboe, Vivek Goyal, Ulrich Lukas, linux-kernel, containers,
	dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, riel

On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote:
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > It's not hard to make the latency good, the hard bit is making sure we 
> > also perform well for all other scenarios.
> 
> Looking at the numbers from Mike:
> 
>  | dd competing against perf stat -- konsole -e exec timings, 5 back to 
>  | back runs
>  |                                                         Avg
>  | before         9.15    14.51     9.39    15.06     9.90   11.6
>  | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7
> 
> _PLEASE_ make read latencies this good - the numbers are _vastly_ 
> better. We'll worry about the 'other' things _after_ we've reached good 
> latencies.
> 
> I thought this principle was a well established basic rule of Linux IO 
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion 
> again and again? I thought latency won hands down.

Just a note:  In the testing I've done so far, we're better off today
than ever, and I can't recall beating on root ever being anything less
than agony for interactivity.  IO seekers look a lot like CPU sleepers
to me.  Looks like both can be as annoying as hell ;-)

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                             ` <20091002092409.GA19529-X9Un+BFzKDI@public.gmane.org>
@ 2009-10-02  9:28                               ` Jens Axboe
  2009-10-02  9:36                               ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > It's not hard to make the latency good, the hard bit is making sure we 
> > also perform well for all other scenarios.
> 
> Looking at the numbers from Mike:
> 
>  | dd competing against perf stat -- konsole -e exec timings, 5 back to 
>  | back runs
>  |                                                         Avg
>  | before         9.15    14.51     9.39    15.06     9.90   11.6
>  | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7
> 
> _PLEASE_ make read latencies this good - the numbers are _vastly_ 
> better. We'll worry about the 'other' things _after_ we've reached good 
> latencies.
> 
> I thought this principle was a well established basic rule of Linux IO 
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion 
> again and again? I thought latency won hands down.

It's really not that simple, if we go and do easy latency bits, then
throughput drops 30% or more. You can't say it's black and white latency
vs throughput issue, that's just not how the real world works. The
server folks would be most unpleased.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  9:24                             ` Ingo Molnar
@ 2009-10-02  9:28                               ` Jens Axboe
  -1 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel,
	containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > It's not hard to make the latency good, the hard bit is making sure we 
> > also perform well for all other scenarios.
> 
> Looking at the numbers from Mike:
> 
>  | dd competing against perf stat -- konsole -e exec timings, 5 back to 
>  | back runs
>  |                                                         Avg
>  | before         9.15    14.51     9.39    15.06     9.90   11.6
>  | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7
> 
> _PLEASE_ make read latencies this good - the numbers are _vastly_ 
> better. We'll worry about the 'other' things _after_ we've reached good 
> latencies.
> 
> I thought this principle was a well established basic rule of Linux IO 
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion 
> again and again? I thought latency won hands down.

It's really not that simple, if we go and do easy latency bits, then
throughput drops 30% or more. You can't say it's black and white latency
vs throughput issue, that's just not how the real world works. The
server folks would be most unpleased.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02  9:28                               ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02  9:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds

On Fri, Oct 02 2009, Ingo Molnar wrote:
> 
> * Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > It's not hard to make the latency good, the hard bit is making sure we 
> > also perform well for all other scenarios.
> 
> Looking at the numbers from Mike:
> 
>  | dd competing against perf stat -- konsole -e exec timings, 5 back to 
>  | back runs
>  |                                                         Avg
>  | before         9.15    14.51     9.39    15.06     9.90   11.6
>  | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7
> 
> _PLEASE_ make read latencies this good - the numbers are _vastly_ 
> better. We'll worry about the 'other' things _after_ we've reached good 
> latencies.
> 
> I thought this principle was a well established basic rule of Linux IO 
> scheduling. Why do we have to have a 'latency vs. bandwidth' discussion 
> again and again? I thought latency won hands down.

It's really not that simple, if we go and do easy latency bits, then
throughput drops 30% or more. You can't say it's black and white latency
vs throughput issue, that's just not how the real world works. The
server folks would be most unpleased.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                           ` <20091002080417.GG14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-10-02  8:53                             ` Mike Galbraith
@ 2009-10-02  9:24                             ` Ingo Molnar
  1 sibling, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02  9:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


* Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:

> It's not hard to make the latency good, the hard bit is making sure we 
> also perform well for all other scenarios.

Looking at the numbers from Mike:

 | dd competing against perf stat -- konsole -e exec timings, 5 back to 
 | back runs
 |                                                         Avg
 | before         9.15    14.51     9.39    15.06     9.90   11.6
 | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7

_PLEASE_ make read latencies this good - the numbers are _vastly_ 
better. We'll worry about the 'other' things _after_ we've reached good 
latencies.

I thought this principle was a well established basic rule of Linux IO 
scheduling. Why do we have to have a 'latency vs. bandwidth' discussion 
again and again? I thought latency won hands down.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  8:04                           ` Jens Axboe
@ 2009-10-02  9:24                             ` Ingo Molnar
  -1 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02  9:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel,
	containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	riel


* Jens Axboe <jens.axboe@oracle.com> wrote:

> It's not hard to make the latency good, the hard bit is making sure we 
> also perform well for all other scenarios.

Looking at the numbers from Mike:

 | dd competing against perf stat -- konsole -e exec timings, 5 back to 
 | back runs
 |                                                         Avg
 | before         9.15    14.51     9.39    15.06     9.90   11.6
 | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7

_PLEASE_ make read latencies this good - the numbers are _vastly_ 
better. We'll worry about the 'other' things _after_ we've reached good 
latencies.

I thought this principle was a well established basic rule of Linux IO 
scheduling. Why do we have to have a 'latency vs. bandwidth' discussion 
again and again? I thought latency won hands down.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02  9:24                             ` Ingo Molnar
  0 siblings, 0 replies; 349+ messages in thread
From: Ingo Molnar @ 2009-10-02  9:24 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds


* Jens Axboe <jens.axboe@oracle.com> wrote:

> It's not hard to make the latency good, the hard bit is making sure we 
> also perform well for all other scenarios.

Looking at the numbers from Mike:

 | dd competing against perf stat -- konsole -e exec timings, 5 back to 
 | back runs
 |                                                         Avg
 | before         9.15    14.51     9.39    15.06     9.90   11.6
 | after [+patch] 1.76     1.54     1.93     1.88     1.56    1.7

_PLEASE_ make read latencies this good - the numbers are _vastly_ 
better. We'll worry about the 'other' things _after_ we've reached good 
latencies.

I thought this principle was a well established basic rule of Linux IO 
scheduling. Why do we have to have a 'latency vs. bandwidth' discussion 
again and again? I thought latency won hands down.

	Ingo

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                             ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02  9:00                               ` Mike Galbraith
  2009-10-02  9:55                               ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02  9:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


> WRT my who can overload theory, I instrumented for my own edification.
> 
> Overload totally forbidden, stamps ergo disabled.
> 
> fairness=0  11.3 avg  (ie == virgin source)
> fairness=1   2.8 avg

(oops, quantum was set to 16 as well there.  not that it matters, but
for completeness)

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  8:53                           ` Mike Galbraith
       [not found]                             ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02  9:00                             ` Mike Galbraith
  2009-10-02  9:55                             ` Jens Axboe
  2 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02  9:00 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel


> WRT my who can overload theory, I instrumented for my own edification.
> 
> Overload totally forbidden, stamps ergo disabled.
> 
> fairness=0  11.3 avg  (ie == virgin source)
> fairness=1   2.8 avg

(oops, quantum was set to 16 as well there.  not that it matters, but
for completeness)


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                           ` <20091002080417.GG14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02  8:53                             ` Mike Galbraith
  2009-10-02  9:24                             ` Ingo Molnar
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02  8:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:

> > If we're in the idle window and doing the async drain thing, we've at
> > the spot where Vivek's patch helps a ton.  Seemed like a great time to
> > limit the size of any io that may land in front of my sync reader to
> > plain "you are not alone" quantity.
> 
> You can't be in the idle window and doing async drain at the same time,
> the idle window doesn't start until the sync queue has completed a
> request. Hence my above rant on device interference.

I'll take your word for it.

        /*
         * Drain async requests before we start sync IO
         */
        if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])

Looked about the same to me as..
 
	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

..where Vivek prevented turning 1 into 0, so I stamped it ;-)

> > Dunno, I was just tossing rocks and sticks at it.
> > 
> > I don't really understand the reasoning behind overloading:  I can see
> > that allows cutting thicker slabs for the disk, but with the streaming
> > writer vs reader case, seems only the writers can do that.  The reader
> > is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
> > thread or kjournald is going to be there with it, which gives dd a huge
> > advantage.. it has two proxies to help it squabble over disk, konsole
> > has none.
> 
> That is true, async queues have a huge advantage over sync ones. But
> sync vs async is only part of it, any combination of queued sync, queued
> sync random etc have different ramifications on behaviour of the
> individual queue.
> 
> It's not hard to make the latency good, the hard bit is making sure we
> also perform well for all other scenarios.

Yeah, that's why I'm trying to be careful about what I say, I know full
well this ain't easy to get right.  I'm not even thinking of submitting
anything, it's just diagnostic testing.

WRT my who can overload theory, I instrumented for my own edification.

Overload totally forbidden, stamps ergo disabled.

fairness=0  11.3 avg  (ie == virgin source)
fairness=1   2.8 avg

Back to virgin settings, instrument who is overloading during sequences of..
        echo 2 > /proc/sys/vm/drop_caches
        sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE
..with dd continually running.

1 second counts for above.
...
[  916.585880] od_sync: 0  od_async: 87  reject_sync: 0  reject_async: 37
[  917.662585] od_sync: 0  od_async: 126  reject_sync: 0  reject_async: 53
[  918.732872] od_sync: 0  od_async: 96  reject_sync: 0  reject_async: 22
[  919.743730] od_sync: 0  od_async: 75  reject_sync: 0  reject_async: 15
[  920.914549] od_sync: 0  od_async: 81  reject_sync: 0  reject_async: 17
[  921.988198] od_sync: 0  od_async: 123  reject_sync: 0  reject_async: 30
...minutes long

(reject == fqq->dispatched >= 4 * max_dispatch)

Doing the same with firefox, I did see the burst below one time, dunno
what triggered that.  I watched 6 runs, and only saw such a burst once.
Typically, numbers are the same as konsole, with a very rare 4 or
5 for sync sneaking in.

[ 1988.177758] od_sync: 0  od_async: 104  reject_sync: 0  reject_async: 48
[ 1992.291779] od_sync: 19  od_async: 83  reject_sync: 0  reject_async: 82
[ 1993.300850] od_sync: 79  od_async: 0  reject_sync: 28  reject_async: 0
[ 1994.313327] od_sync: 147  od_async: 104  reject_sync: 90  reject_async: 16
[ 1995.378025] od_sync: 14  od_async: 45  reject_sync: 0  reject_async: 2
[ 1996.456871] od_sync: 15  od_async: 74  reject_sync: 1  reject_async: 7
[ 1997.611226] od_sync: 0  od_async: 84  reject_sync: 0  reject_async: 14

Never noticed a sync overload watching a make -j4 for a couple minutes.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  8:04                           ` Jens Axboe
  (?)
@ 2009-10-02  8:53                           ` Mike Galbraith
       [not found]                             ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
                                               ` (2 more replies)
  -1 siblings, 3 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02  8:53 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote:
> On Fri, Oct 02 2009, Mike Galbraith wrote:

> > If we're in the idle window and doing the async drain thing, we've at
> > the spot where Vivek's patch helps a ton.  Seemed like a great time to
> > limit the size of any io that may land in front of my sync reader to
> > plain "you are not alone" quantity.
> 
> You can't be in the idle window and doing async drain at the same time,
> the idle window doesn't start until the sync queue has completed a
> request. Hence my above rant on device interference.

I'll take your word for it.

        /*
         * Drain async requests before we start sync IO
         */
        if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])

Looked about the same to me as..
 
	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);

..where Vivek prevented turning 1 into 0, so I stamped it ;-)

> > Dunno, I was just tossing rocks and sticks at it.
> > 
> > I don't really understand the reasoning behind overloading:  I can see
> > that allows cutting thicker slabs for the disk, but with the streaming
> > writer vs reader case, seems only the writers can do that.  The reader
> > is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
> > thread or kjournald is going to be there with it, which gives dd a huge
> > advantage.. it has two proxies to help it squabble over disk, konsole
> > has none.
> 
> That is true, async queues have a huge advantage over sync ones. But
> sync vs async is only part of it, any combination of queued sync, queued
> sync random etc have different ramifications on behaviour of the
> individual queue.
> 
> It's not hard to make the latency good, the hard bit is making sure we
> also perform well for all other scenarios.

Yeah, that's why I'm trying to be careful about what I say, I know full
well this ain't easy to get right.  I'm not even thinking of submitting
anything, it's just diagnostic testing.

WRT my who can overload theory, I instrumented for my own edification.

Overload totally forbidden, stamps ergo disabled.

fairness=0  11.3 avg  (ie == virgin source)
fairness=1   2.8 avg

Back to virgin settings, instrument who is overloading during sequences of..
        echo 2 > /proc/sys/vm/drop_caches
        sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE
..with dd continually running.

1 second counts for above.
...
[  916.585880] od_sync: 0  od_async: 87  reject_sync: 0  reject_async: 37
[  917.662585] od_sync: 0  od_async: 126  reject_sync: 0  reject_async: 53
[  918.732872] od_sync: 0  od_async: 96  reject_sync: 0  reject_async: 22
[  919.743730] od_sync: 0  od_async: 75  reject_sync: 0  reject_async: 15
[  920.914549] od_sync: 0  od_async: 81  reject_sync: 0  reject_async: 17
[  921.988198] od_sync: 0  od_async: 123  reject_sync: 0  reject_async: 30
...minutes long

(reject == fqq->dispatched >= 4 * max_dispatch)

Doing the same with firefox, I did see the burst below one time, dunno
what triggered that.  I watched 6 runs, and only saw such a burst once.
Typically, numbers are the same as konsole, with a very rare 4 or
5 for sync sneaking in.

[ 1988.177758] od_sync: 0  od_async: 104  reject_sync: 0  reject_async: 48
[ 1992.291779] od_sync: 19  od_async: 83  reject_sync: 0  reject_async: 82
[ 1993.300850] od_sync: 79  od_async: 0  reject_sync: 28  reject_async: 0
[ 1994.313327] od_sync: 147  od_async: 104  reject_sync: 90  reject_async: 16
[ 1995.378025] od_sync: 14  od_async: 45  reject_sync: 0  reject_async: 2
[ 1996.456871] od_sync: 15  od_async: 74  reject_sync: 1  reject_async: 7
[ 1997.611226] od_sync: 0  od_async: 84  reject_sync: 0  reject_async: 14

Never noticed a sync overload watching a make -j4 for a couple minutes.



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                         ` <1254464628.7158.101.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-10-02  8:04                           ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02  8:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote:
> > On Thu, Oct 01 2009, Mike Galbraith wrote:
> > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > > > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > > > short depending on the disk speed.
> > > 
> > > Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
> > > some new task is determined to be seeky, the damage is already done.
> > > 
> > > The below does better, though not as well as "just say no to overload"
> > > of course ;-)
> > 
> > So this essentially takes the "avoid impact from previous slice" to a
> > new extreme, but idling even before dispatching requests from the new
> > queue. We basically do two things to prevent this already - one is to
> > only set the slice when the first request is actually serviced, and the
> > other is to drain async requests completely before starting sync ones.
> > I'm a bit surprised that the former doesn't solve the problem fully, I
> > guess what happens is that if the drive has been flooded with writes, it
> > may service the new read immediately and then return to finish emptying
> > its writeback cache. This will cause an impact for any sync IO until
> > that cache is flushed, and then cause that sync queue to not get as much
> > service as it should have.
> 
> I did the stamping selection other than how long have we been solo based
> on these possibly wrong speculations:
> 
> If we're in the idle window and doing the async drain thing, we've at
> the spot where Vivek's patch helps a ton.  Seemed like a great time to
> limit the size of any io that may land in front of my sync reader to
> plain "you are not alone" quantity.

You can't be in the idle window and doing async drain at the same time,
the idle window doesn't start until the sync queue has completed a
request. Hence my above rant on device interference.

> If we've got sync io in flight, that should mean that my new or old
> known seeky queue has been serviced at least once.  There's likely to be
> more on the way, so delay overloading then too. 
> 
> The seeky bit is supposed to be the earlier "last time we saw a seeker"
> thing, but known seeky is too late to help a new task at all unless you
> turn off the overloading for ages, so I added the if incalculable check
> for good measure, hoping that meant the task is new, may want to exec.
> 
> Stamping any place may (see below) possibly limit the size of the io the
> reader can generate as well as writer, but I figured what's good for the
> goose is good for the the gander, or it ain't really good.  The overload
> was causing the observed pain, definitely ain't good for both at these
> times at least, so don't let it do that.
> 
> > Perhaps the "set slice on first complete" isn't working correctly? Or
> > perhaps we just need to be more extreme.
> 
> Dunno, I was just tossing rocks and sticks at it.
> 
> I don't really understand the reasoning behind overloading:  I can see
> that allows cutting thicker slabs for the disk, but with the streaming
> writer vs reader case, seems only the writers can do that.  The reader
> is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
> thread or kjournald is going to be there with it, which gives dd a huge
> advantage.. it has two proxies to help it squabble over disk, konsole
> has none.

That is true, async queues have a huge advantage over sync ones. But
sync vs async is only part of it, any combination of queued sync, queued
sync random etc have different ramifications on behaviour of the
individual queue.

It's not hard to make the latency good, the hard bit is making sure we
also perform well for all other scenarios.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-02  6:23                       ` Mike Galbraith
@ 2009-10-02  8:04                           ` Jens Axboe
  2009-10-02  8:04                           ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02  8:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote:
> > On Thu, Oct 01 2009, Mike Galbraith wrote:
> > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > > > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > > > short depending on the disk speed.
> > > 
> > > Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
> > > some new task is determined to be seeky, the damage is already done.
> > > 
> > > The below does better, though not as well as "just say no to overload"
> > > of course ;-)
> > 
> > So this essentially takes the "avoid impact from previous slice" to a
> > new extreme, but idling even before dispatching requests from the new
> > queue. We basically do two things to prevent this already - one is to
> > only set the slice when the first request is actually serviced, and the
> > other is to drain async requests completely before starting sync ones.
> > I'm a bit surprised that the former doesn't solve the problem fully, I
> > guess what happens is that if the drive has been flooded with writes, it
> > may service the new read immediately and then return to finish emptying
> > its writeback cache. This will cause an impact for any sync IO until
> > that cache is flushed, and then cause that sync queue to not get as much
> > service as it should have.
> 
> I did the stamping selection other than how long have we been solo based
> on these possibly wrong speculations:
> 
> If we're in the idle window and doing the async drain thing, we've at
> the spot where Vivek's patch helps a ton.  Seemed like a great time to
> limit the size of any io that may land in front of my sync reader to
> plain "you are not alone" quantity.

You can't be in the idle window and doing async drain at the same time,
the idle window doesn't start until the sync queue has completed a
request. Hence my above rant on device interference.

> If we've got sync io in flight, that should mean that my new or old
> known seeky queue has been serviced at least once.  There's likely to be
> more on the way, so delay overloading then too. 
> 
> The seeky bit is supposed to be the earlier "last time we saw a seeker"
> thing, but known seeky is too late to help a new task at all unless you
> turn off the overloading for ages, so I added the if incalculable check
> for good measure, hoping that meant the task is new, may want to exec.
> 
> Stamping any place may (see below) possibly limit the size of the io the
> reader can generate as well as writer, but I figured what's good for the
> goose is good for the the gander, or it ain't really good.  The overload
> was causing the observed pain, definitely ain't good for both at these
> times at least, so don't let it do that.
> 
> > Perhaps the "set slice on first complete" isn't working correctly? Or
> > perhaps we just need to be more extreme.
> 
> Dunno, I was just tossing rocks and sticks at it.
> 
> I don't really understand the reasoning behind overloading:  I can see
> that allows cutting thicker slabs for the disk, but with the streaming
> writer vs reader case, seems only the writers can do that.  The reader
> is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
> thread or kjournald is going to be there with it, which gives dd a huge
> advantage.. it has two proxies to help it squabble over disk, konsole
> has none.

That is true, async queues have a huge advantage over sync ones. But
sync vs async is only part of it, any combination of queued sync, queued
sync random etc have different ramifications on behaviour of the
individual queue.

It's not hard to make the latency good, the hard bit is making sure we
also perform well for all other scenarios.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02  8:04                           ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-02  8:04 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente,
	jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, mingo,
	Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Fri, Oct 02 2009, Mike Galbraith wrote:
> On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote:
> > On Thu, Oct 01 2009, Mike Galbraith wrote:
> > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > > > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > > > short depending on the disk speed.
> > > 
> > > Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
> > > some new task is determined to be seeky, the damage is already done.
> > > 
> > > The below does better, though not as well as "just say no to overload"
> > > of course ;-)
> > 
> > So this essentially takes the "avoid impact from previous slice" to a
> > new extreme, but idling even before dispatching requests from the new
> > queue. We basically do two things to prevent this already - one is to
> > only set the slice when the first request is actually serviced, and the
> > other is to drain async requests completely before starting sync ones.
> > I'm a bit surprised that the former doesn't solve the problem fully, I
> > guess what happens is that if the drive has been flooded with writes, it
> > may service the new read immediately and then return to finish emptying
> > its writeback cache. This will cause an impact for any sync IO until
> > that cache is flushed, and then cause that sync queue to not get as much
> > service as it should have.
> 
> I did the stamping selection other than how long have we been solo based
> on these possibly wrong speculations:
> 
> If we're in the idle window and doing the async drain thing, we've at
> the spot where Vivek's patch helps a ton.  Seemed like a great time to
> limit the size of any io that may land in front of my sync reader to
> plain "you are not alone" quantity.

You can't be in the idle window and doing async drain at the same time,
the idle window doesn't start until the sync queue has completed a
request. Hence my above rant on device interference.

> If we've got sync io in flight, that should mean that my new or old
> known seeky queue has been serviced at least once.  There's likely to be
> more on the way, so delay overloading then too. 
> 
> The seeky bit is supposed to be the earlier "last time we saw a seeker"
> thing, but known seeky is too late to help a new task at all unless you
> turn off the overloading for ages, so I added the if incalculable check
> for good measure, hoping that meant the task is new, may want to exec.
> 
> Stamping any place may (see below) possibly limit the size of the io the
> reader can generate as well as writer, but I figured what's good for the
> goose is good for the the gander, or it ain't really good.  The overload
> was causing the observed pain, definitely ain't good for both at these
> times at least, so don't let it do that.
> 
> > Perhaps the "set slice on first complete" isn't working correctly? Or
> > perhaps we just need to be more extreme.
> 
> Dunno, I was just tossing rocks and sticks at it.
> 
> I don't really understand the reasoning behind overloading:  I can see
> that allows cutting thicker slabs for the disk, but with the streaming
> writer vs reader case, seems only the writers can do that.  The reader
> is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
> thread or kjournald is going to be there with it, which gives dd a huge
> advantage.. it has two proxies to help it squabble over disk, konsole
> has none.

That is true, async queues have a huge advantage over sync ones. But
sync vs async is only part of it, any combination of queued sync, queued
sync random etc have different ramifications on behaviour of the
individual queue.

It's not hard to make the latency good, the hard bit is making sure we
also perform well for all other scenarios.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                       ` <20091001185816.GU14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-10-02  6:23                         ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02  6:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote:
> On Thu, Oct 01 2009, Mike Galbraith wrote:
> > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > > short depending on the disk speed.
> > 
> > Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
> > some new task is determined to be seeky, the damage is already done.
> > 
> > The below does better, though not as well as "just say no to overload"
> > of course ;-)
> 
> So this essentially takes the "avoid impact from previous slice" to a
> new extreme, but idling even before dispatching requests from the new
> queue. We basically do two things to prevent this already - one is to
> only set the slice when the first request is actually serviced, and the
> other is to drain async requests completely before starting sync ones.
> I'm a bit surprised that the former doesn't solve the problem fully, I
> guess what happens is that if the drive has been flooded with writes, it
> may service the new read immediately and then return to finish emptying
> its writeback cache. This will cause an impact for any sync IO until
> that cache is flushed, and then cause that sync queue to not get as much
> service as it should have.

I did the stamping selection other than how long have we been solo based
on these possibly wrong speculations:

If we're in the idle window and doing the async drain thing, we've at
the spot where Vivek's patch helps a ton.  Seemed like a great time to
limit the size of any io that may land in front of my sync reader to
plain "you are not alone" quantity.

If we've got sync io in flight, that should mean that my new or old
known seeky queue has been serviced at least once.  There's likely to be
more on the way, so delay overloading then too. 

The seeky bit is supposed to be the earlier "last time we saw a seeker"
thing, but known seeky is too late to help a new task at all unless you
turn off the overloading for ages, so I added the if incalculable check
for good measure, hoping that meant the task is new, may want to exec.

Stamping any place may (see below) possibly limit the size of the io the
reader can generate as well as writer, but I figured what's good for the
goose is good for the the gander, or it ain't really good.  The overload
was causing the observed pain, definitely ain't good for both at these
times at least, so don't let it do that.

> Perhaps the "set slice on first complete" isn't working correctly? Or
> perhaps we just need to be more extreme.

Dunno, I was just tossing rocks and sticks at it.

I don't really understand the reasoning behind overloading:  I can see
that allows cutting thicker slabs for the disk, but with the streaming
writer vs reader case, seems only the writers can do that.  The reader
is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
thread or kjournald is going to be there with it, which gives dd a huge
advantage.. it has two proxies to help it squabble over disk, konsole
has none.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-01 18:58                       ` Jens Axboe
  (?)
@ 2009-10-02  6:23                       ` Mike Galbraith
       [not found]                         ` <1254464628.7158.101.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-02  8:04                           ` Jens Axboe
  -1 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-02  6:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote:
> On Thu, Oct 01 2009, Mike Galbraith wrote:
> > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > > short depending on the disk speed.
> > 
> > Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
> > some new task is determined to be seeky, the damage is already done.
> > 
> > The below does better, though not as well as "just say no to overload"
> > of course ;-)
> 
> So this essentially takes the "avoid impact from previous slice" to a
> new extreme, but idling even before dispatching requests from the new
> queue. We basically do two things to prevent this already - one is to
> only set the slice when the first request is actually serviced, and the
> other is to drain async requests completely before starting sync ones.
> I'm a bit surprised that the former doesn't solve the problem fully, I
> guess what happens is that if the drive has been flooded with writes, it
> may service the new read immediately and then return to finish emptying
> its writeback cache. This will cause an impact for any sync IO until
> that cache is flushed, and then cause that sync queue to not get as much
> service as it should have.

I did the stamping selection other than how long have we been solo based
on these possibly wrong speculations:

If we're in the idle window and doing the async drain thing, we've at
the spot where Vivek's patch helps a ton.  Seemed like a great time to
limit the size of any io that may land in front of my sync reader to
plain "you are not alone" quantity.

If we've got sync io in flight, that should mean that my new or old
known seeky queue has been serviced at least once.  There's likely to be
more on the way, so delay overloading then too. 

The seeky bit is supposed to be the earlier "last time we saw a seeker"
thing, but known seeky is too late to help a new task at all unless you
turn off the overloading for ages, so I added the if incalculable check
for good measure, hoping that meant the task is new, may want to exec.

Stamping any place may (see below) possibly limit the size of the io the
reader can generate as well as writer, but I figured what's good for the
goose is good for the the gander, or it ain't really good.  The overload
was causing the observed pain, definitely ain't good for both at these
times at least, so don't let it do that.

> Perhaps the "set slice on first complete" isn't working correctly? Or
> perhaps we just need to be more extreme.

Dunno, I was just tossing rocks and sticks at it.

I don't really understand the reasoning behind overloading:  I can see
that allows cutting thicker slabs for the disk, but with the streaming
writer vs reader case, seems only the writers can do that.  The reader
is unlikely to be alone isn't it?  Seems to me that either dd, a flusher
thread or kjournald is going to be there with it, which gives dd a huge
advantage.. it has two proxies to help it squabble over disk, konsole
has none.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                 ` <20091001133109.GA4058-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-02  2:57                   ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02  2:57 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Oct 01, 2009 at 09:31:09AM -0400, Vivek Goyal wrote:
> On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > > Hi Vivek,
> > > > 
> > > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > > into the disk and again timestamp with finish time once request finishes.
> > > > > 
> > > > > This way higher layer can get an idea how much disk time a group of bios
> > > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > > then time accounting becomes an issue.
> > > > > 
> > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > > time elapsed between each of milestones is t. Also assume that all these
> > > > > requests are from same queue/group.
> > > > > 
> > > > >         t0   t1   t2   t3  t4   t5   t6   t7
> > > > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > > > 
> > > > > Now higher layer will think that time consumed by group is:
> > > > > 
> > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > > > 
> > > > > But the time elapsed is only 7t.
> > > > 
> > > > IO controller can know how many requests are issued and still in
> > > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > > exist?
> > > > 
> > > 
> > > That time would not reflect disk time used. It will be follwoing.
> > > 
> > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > > (time spent in disk)
> > 
> > In the case where multiple IO requests are issued from IO controller,
> > that time measurement is the time from when the first IO request is
> > issued until when the endio is called for the last IO request. Does
> > not it reflect disk time?
> > 
> 
> Not accurately as it will be including the time spent in CFQ queues as
> well as dispatch queue. I will not worry much about dispatch queue time
> but time spent CFQ queues can be significant.
> 
> This is assuming that you are using token based scheme and will be
> dispatching requests from multiple groups at the same time.
> 

Thinking more about it...

Does time based fairness make sense at higher level logical devices?

- Time based fairness generally helps with rotational devices which have
  high seek costs. At higher level we don't even know what is the nature
  of underlying device where IO will ultimately go.

- For time based fairness to work accurately at higher level, most likely
  it will require dispatch from the single group at a time and wait for
  requests to complete from that group and then dispatch from next. 
  Something like CFQ model of queue.  

  Dispatching from single queue/group works well in case of a single
  underlying device where CFQ is operating but at higher level devices
  where typically there will be multiple physical devices under it, it
  might not make sense as it made things more linear and reduced
  parallel processing further. So dispatching from single group at a time
  and waiting before we dispatch from next group will most likely be
  killer for throughput in higher level devices and might not make sense.

  If we don't adopt the policy of dispatch from single group, then we run
  into all the issues of weak isolation between groups, higher latencies,
  preemptions across groups etc.

More I think about the whole issue and desired set of requirements, more
I am convinced that we probably need two io controlling mechanisms. One
which focusses purely on providing bandwidth fairness numbers on high
level devices and the other which works at low level devices with CFQ
and provides good bandwidth shaping, strong isolation, preserves fairness
with-in group and good control on latencies.

Higher level controller will not worry about time based policies. It can
implemente max bw and proportional bw control based on size of IO and
number of IO. 

Lower level controller at CFQ level will implement time based group
scheduling. Keeping it at low level will have the advantage of better
utitlization of hardware in various dm/md configurations (as no throttling
takes place at higher level) but at the cost of not so strict fairness numbers
at higher level. So those who want strict fairness number policies at higher
level devices irrespective of shortcomings, can use that. Others can stick to
lower level controller.

For buffered write control we anyway have to do either something in memory
controller or come up with another cgroup controller which throttles IO
before it goes into cache. Or, in fact we can have a re-look at Andrea
Righi's controller which provided max BW and throttled buffered writes
before they got into page cache and try to provide proportional BW also
there.

Basically I see the space for two IO controllers. At the moment can't
think of a way of coming up with single controller which satisfies all
the requirements. So instead provide two and let user choose one based on
his need.

Any thoughts?

Before finishing this mail, will throw a whacky idea in the ring. I was
going through the request based dm-multipath paper. Will it make sense
to implement request based dm-ioband? So basically we implement all the
group scheduling in CFQ and let dm-ioband implement a request function
to take the request and break it back into bios. This way we can keep
all the group control at one place and also meet most of the requirements.

So request based dm-ioband will have a request in hand once that request
has passed group control and prio control. Because dm-ioband is a device
mapper target, one can put it on higher level devices (practically taking
CFQ at higher level device), and provide fairness there. One can also 
put it on those SSDs which don't use IO scheduler (this is kind of forcing
them to use the IO scheduler.)

I am sure that will be many issues but one big issue I could think of that
CFQ thinks that there is one device beneath it and dipsatches requests
from one queue (in case of idling) and that would kill parallelism at
higher layer and throughput will suffer on many of the dm/md configurations.

Thanks
Vivek

> But if you figure out a way that you dispatch requests from one group only
> at one time and wait for all requests to finish and then let next group
> go, then above can work fairly accurately. In that case it will become
> like CFQ with the only difference that effectively we have one queue per
> group instread of per process.
> 
> > > > > Secondly if a different group is running only single sequential reader,
> > > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > > groups.
> > > > >
> > > > > So we need something better to get a sense which group used how much of
> > > > > disk time.
> > > > 
> > > > It could be solved by implementing the way to pass on such information
> > > > from IO scheduler to higher layer controller.
> > > > 
> > > 
> > > How would you do that? Can you give some details exactly how and what
> > > information IO scheduler will pass to higher level IO controller so that IO
> > > controller can attribute right time to the group.
> > 
> > If you would like to know when the idle timer is expired, how about
> > adding a function to IO controller to be notified it from IO
> > scheduler? IO scheduler calls the function when the timer is expired.
> > 
> 
> This probably can be done. So this is like syncing between lower layers
> and higher layers about when do we start idling and when do we stop it and
> both the layers should be in sync.
> 
> This is something my common layer approach does. Becuase it is so close to
> IO scheuler, I can do it relatively easily.
> 
> One probably can create interfaces to even propogate this information up.
> But this all will probably come into the picture only if we don't use
> token based schemes and come up with something where at one point of time
> dispatch are from one group only.
> 
> > > > > > How about making throttling policy be user selectable like the IO
> > > > > > scheduler and putting it in the higher layer? So we could support
> > > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > > with starting with proportional bandwidth control first. 
> > > > > > 
> > > > > 
> > > > > What are the cases where time based policy does not work and size based
> > > > > policy works better and user would choose size based policy and not timed
> > > > > based one?
> > > > 
> > > > I think that disk time is not simply proportional to IO size. If there
> > > > are two groups whose wights are equally assigned and they issue
> > > > different sized IOs repsectively, the bandwidth of each group would
> > > > not distributed equally as expected. 
> > > > 
> > > 
> > > If we are providing fairness in terms of time, it is fair. If we provide
> > > equal time slots to two processes and if one got more IO done because it
> > > was not wasting time seeking or it issued bigger size IO, it deserves that
> > > higher BW. IO controller will make sure that process gets fair share in
> > > terms of time and exactly how much BW one got will depend on the workload.
> > > 
> > > That's the precise reason that fairness in terms of time is better on
> > > seeky media.
> > 
> > If the seek time is negligible, the bandwidth would not be distributed 
> > according to a proportion of weight settings. I think that it would be
> > unclear for users to understand how bandwidth is distributed. And I
> > also think that seeky media would gradually become obsolete,
> > 
> 
> I can understand that if lesser the seek cost game starts changing and
> probably a size based policy also work decently.
> 
> In that case at some point of time probably CFQ will also need to support
> another mode/policy where fairness is provided in terms of size of IO, if
> it detects a SSD with hardware queuing. Currently it seem to be disabling
> the idling in that case. But this is not very good from fairness point of
> view. I guess if CFQ wants to provide fairness in such cases, it needs to
> dynamically change the shape and start thinking in terms of size of IO.
> 
> So far my testing has been very limited to hard disks connected to my
> computer. I will do some testing on high end enterprise storage and see
> how much do seek matter and how well both the implementations work.
> 
> > > > > I am not against implementing things in higher layer as long as we can
> > > > > ensure tight control on latencies, strong isolation between groups and
> > > > > not break CFQ's class and ioprio model with-in group.
> > > > > 
> > > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > > > 
> > > > > Can you elaborate little bit on this?
> > > > 
> > > > bio is grabbed in generic_make_request() and throttled as well as
> > > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > > > 
> > > 
> > > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > > throttling policies will apply. So until and unless we figure out a
> > > better way, the issues I have pointed out will still exists even in
> > > new implementation.
> > 
> > Yes, those still exist, but somehow I would like to try to solve them.
> > 
> > > > The default value of io_limit on the previous test was 128 (not 192)
> > > > which is equall to the default value of nr_request.
> > > 
> > > Hm..., I used following commands to create two ioband devices.
> > > 
> > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > > "weight 0 :100" | dmsetup create ioband1
> > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > > "weight 0 :100" | dmsetup create ioband2
> > > 
> > > Here io_limit value is zero so it should pick default value. Following is
> > > output of "dmsetup table" command.
> > > 
> > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> > >                                     ^^^^
> > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > > to be 192?
> > 
> > The default vaule has changed since v1.12.0 and increased from 128 to 192.
> > 
> > > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > > writes.
> > > > 
> > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > > sync/async requests separately, and it solves this
> > > > buffered-write-starves-read problem. I would like to post it soon
> > > > after doing some more test.
> > > > 
> > > > > On top of that can you please give some details how increasing the
> > > > > buffered queue length reduces the impact of writers?
> > > > 
> > > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > > than the value of io_limit, so it was a bottleneck of the throughput.
> > > > 
> > > 
> > > Ok, so it should have been throughput bottleneck but how did it solve the
> > > issue of writer starving the reader as you had mentioned in the mail.
> > 
> > As wrote above, I modified dm-ioband to handle sync/async requests
> > separately, so even if writers do a lot of buffered IOs, readers can
> > issue IOs regardless writers' busyness. Once the IOs are backlogged
> > for throttling, the both sync and async requests are issued according
> > to the other of arrival.
> > 
> 
> Ok, so if both the readers and writers are buffered and some tokens become
> available then these tokens will be divided half and half between readers
> or writer queues?
> 
> > > Secondly, you mentioned that processes are made to sleep once we cross 
> > > io_limit. This sounds like request descriptor facility on requeust queue
> > > where processes are made to sleep.
> > >
> > > There are threads in kernel which don't want to sleep while submitting
> > > bios. For example, btrfs has bio submitting thread which does not want
> > > to sleep hence it checks with device if it is congested or not and not
> > > submit the bio if it is congested.  How would you handle such cases. Have
> > > you implemented any per group congestion kind of interface to make sure
> > > such IO's don't sleep if group is congested.
> > >
> > > Or this limit is per ioband device which every group on the device is
> > > sharing. If yes, then how would you provide isolation between groups 
> > > because if one groups consumes io_limit tokens, then other will simply
> > > be serialized on that device?
> > 
> > There are two kind of limit and both limit the number of IO requests
> > which can be issued simultaneously, but one is for per ioband device, 
> > the other is for per ioband group. The per group limit assigned to
> > each group is calculated by dividing io_limit according to their
> > proportion of weight.
> > 
> > The kernel thread is not made to sleep by the per group limit, because
> > several kinds of kernel threads submit IOs from multiple groups and
> > for multiple devices in a single thread. At this time, the kernel
> > thread is made to sleep by the per device limit only.
> > 
> 
> Interesting. Actually not blocking kernel threads on per group limit
> and instead blocking it only on per device limts sounds like a good idea.
> 
> I can also do something similar and that will take away the need of
> exporting per group congestion interface to higher layers and reduce
> complexity. If some kernel thread does not want to block, these will
> continue to use existing per device/bdi congestion interface.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-01 13:31                 ` Vivek Goyal
@ 2009-10-02  2:57                   ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02  2:57 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah,
	lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

On Thu, Oct 01, 2009 at 09:31:09AM -0400, Vivek Goyal wrote:
> On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > > Hi Vivek,
> > > > 
> > > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > > into the disk and again timestamp with finish time once request finishes.
> > > > > 
> > > > > This way higher layer can get an idea how much disk time a group of bios
> > > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > > then time accounting becomes an issue.
> > > > > 
> > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > > time elapsed between each of milestones is t. Also assume that all these
> > > > > requests are from same queue/group.
> > > > > 
> > > > >         t0   t1   t2   t3  t4   t5   t6   t7
> > > > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > > > 
> > > > > Now higher layer will think that time consumed by group is:
> > > > > 
> > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > > > 
> > > > > But the time elapsed is only 7t.
> > > > 
> > > > IO controller can know how many requests are issued and still in
> > > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > > exist?
> > > > 
> > > 
> > > That time would not reflect disk time used. It will be follwoing.
> > > 
> > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > > (time spent in disk)
> > 
> > In the case where multiple IO requests are issued from IO controller,
> > that time measurement is the time from when the first IO request is
> > issued until when the endio is called for the last IO request. Does
> > not it reflect disk time?
> > 
> 
> Not accurately as it will be including the time spent in CFQ queues as
> well as dispatch queue. I will not worry much about dispatch queue time
> but time spent CFQ queues can be significant.
> 
> This is assuming that you are using token based scheme and will be
> dispatching requests from multiple groups at the same time.
> 

Thinking more about it...

Does time based fairness make sense at higher level logical devices?

- Time based fairness generally helps with rotational devices which have
  high seek costs. At higher level we don't even know what is the nature
  of underlying device where IO will ultimately go.

- For time based fairness to work accurately at higher level, most likely
  it will require dispatch from the single group at a time and wait for
  requests to complete from that group and then dispatch from next. 
  Something like CFQ model of queue.  

  Dispatching from single queue/group works well in case of a single
  underlying device where CFQ is operating but at higher level devices
  where typically there will be multiple physical devices under it, it
  might not make sense as it made things more linear and reduced
  parallel processing further. So dispatching from single group at a time
  and waiting before we dispatch from next group will most likely be
  killer for throughput in higher level devices and might not make sense.

  If we don't adopt the policy of dispatch from single group, then we run
  into all the issues of weak isolation between groups, higher latencies,
  preemptions across groups etc.

More I think about the whole issue and desired set of requirements, more
I am convinced that we probably need two io controlling mechanisms. One
which focusses purely on providing bandwidth fairness numbers on high
level devices and the other which works at low level devices with CFQ
and provides good bandwidth shaping, strong isolation, preserves fairness
with-in group and good control on latencies.

Higher level controller will not worry about time based policies. It can
implemente max bw and proportional bw control based on size of IO and
number of IO. 

Lower level controller at CFQ level will implement time based group
scheduling. Keeping it at low level will have the advantage of better
utitlization of hardware in various dm/md configurations (as no throttling
takes place at higher level) but at the cost of not so strict fairness numbers
at higher level. So those who want strict fairness number policies at higher
level devices irrespective of shortcomings, can use that. Others can stick to
lower level controller.

For buffered write control we anyway have to do either something in memory
controller or come up with another cgroup controller which throttles IO
before it goes into cache. Or, in fact we can have a re-look at Andrea
Righi's controller which provided max BW and throttled buffered writes
before they got into page cache and try to provide proportional BW also
there.

Basically I see the space for two IO controllers. At the moment can't
think of a way of coming up with single controller which satisfies all
the requirements. So instead provide two and let user choose one based on
his need.

Any thoughts?

Before finishing this mail, will throw a whacky idea in the ring. I was
going through the request based dm-multipath paper. Will it make sense
to implement request based dm-ioband? So basically we implement all the
group scheduling in CFQ and let dm-ioband implement a request function
to take the request and break it back into bios. This way we can keep
all the group control at one place and also meet most of the requirements.

So request based dm-ioband will have a request in hand once that request
has passed group control and prio control. Because dm-ioband is a device
mapper target, one can put it on higher level devices (practically taking
CFQ at higher level device), and provide fairness there. One can also 
put it on those SSDs which don't use IO scheduler (this is kind of forcing
them to use the IO scheduler.)

I am sure that will be many issues but one big issue I could think of that
CFQ thinks that there is one device beneath it and dipsatches requests
from one queue (in case of idling) and that would kill parallelism at
higher layer and throughput will suffer on many of the dm/md configurations.

Thanks
Vivek

> But if you figure out a way that you dispatch requests from one group only
> at one time and wait for all requests to finish and then let next group
> go, then above can work fairly accurately. In that case it will become
> like CFQ with the only difference that effectively we have one queue per
> group instread of per process.
> 
> > > > > Secondly if a different group is running only single sequential reader,
> > > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > > groups.
> > > > >
> > > > > So we need something better to get a sense which group used how much of
> > > > > disk time.
> > > > 
> > > > It could be solved by implementing the way to pass on such information
> > > > from IO scheduler to higher layer controller.
> > > > 
> > > 
> > > How would you do that? Can you give some details exactly how and what
> > > information IO scheduler will pass to higher level IO controller so that IO
> > > controller can attribute right time to the group.
> > 
> > If you would like to know when the idle timer is expired, how about
> > adding a function to IO controller to be notified it from IO
> > scheduler? IO scheduler calls the function when the timer is expired.
> > 
> 
> This probably can be done. So this is like syncing between lower layers
> and higher layers about when do we start idling and when do we stop it and
> both the layers should be in sync.
> 
> This is something my common layer approach does. Becuase it is so close to
> IO scheuler, I can do it relatively easily.
> 
> One probably can create interfaces to even propogate this information up.
> But this all will probably come into the picture only if we don't use
> token based schemes and come up with something where at one point of time
> dispatch are from one group only.
> 
> > > > > > How about making throttling policy be user selectable like the IO
> > > > > > scheduler and putting it in the higher layer? So we could support
> > > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > > with starting with proportional bandwidth control first. 
> > > > > > 
> > > > > 
> > > > > What are the cases where time based policy does not work and size based
> > > > > policy works better and user would choose size based policy and not timed
> > > > > based one?
> > > > 
> > > > I think that disk time is not simply proportional to IO size. If there
> > > > are two groups whose wights are equally assigned and they issue
> > > > different sized IOs repsectively, the bandwidth of each group would
> > > > not distributed equally as expected. 
> > > > 
> > > 
> > > If we are providing fairness in terms of time, it is fair. If we provide
> > > equal time slots to two processes and if one got more IO done because it
> > > was not wasting time seeking or it issued bigger size IO, it deserves that
> > > higher BW. IO controller will make sure that process gets fair share in
> > > terms of time and exactly how much BW one got will depend on the workload.
> > > 
> > > That's the precise reason that fairness in terms of time is better on
> > > seeky media.
> > 
> > If the seek time is negligible, the bandwidth would not be distributed 
> > according to a proportion of weight settings. I think that it would be
> > unclear for users to understand how bandwidth is distributed. And I
> > also think that seeky media would gradually become obsolete,
> > 
> 
> I can understand that if lesser the seek cost game starts changing and
> probably a size based policy also work decently.
> 
> In that case at some point of time probably CFQ will also need to support
> another mode/policy where fairness is provided in terms of size of IO, if
> it detects a SSD with hardware queuing. Currently it seem to be disabling
> the idling in that case. But this is not very good from fairness point of
> view. I guess if CFQ wants to provide fairness in such cases, it needs to
> dynamically change the shape and start thinking in terms of size of IO.
> 
> So far my testing has been very limited to hard disks connected to my
> computer. I will do some testing on high end enterprise storage and see
> how much do seek matter and how well both the implementations work.
> 
> > > > > I am not against implementing things in higher layer as long as we can
> > > > > ensure tight control on latencies, strong isolation between groups and
> > > > > not break CFQ's class and ioprio model with-in group.
> > > > > 
> > > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > > > 
> > > > > Can you elaborate little bit on this?
> > > > 
> > > > bio is grabbed in generic_make_request() and throttled as well as
> > > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > > > 
> > > 
> > > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > > throttling policies will apply. So until and unless we figure out a
> > > better way, the issues I have pointed out will still exists even in
> > > new implementation.
> > 
> > Yes, those still exist, but somehow I would like to try to solve them.
> > 
> > > > The default value of io_limit on the previous test was 128 (not 192)
> > > > which is equall to the default value of nr_request.
> > > 
> > > Hm..., I used following commands to create two ioband devices.
> > > 
> > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > > "weight 0 :100" | dmsetup create ioband1
> > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > > "weight 0 :100" | dmsetup create ioband2
> > > 
> > > Here io_limit value is zero so it should pick default value. Following is
> > > output of "dmsetup table" command.
> > > 
> > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> > >                                     ^^^^
> > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > > to be 192?
> > 
> > The default vaule has changed since v1.12.0 and increased from 128 to 192.
> > 
> > > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > > writes.
> > > > 
> > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > > sync/async requests separately, and it solves this
> > > > buffered-write-starves-read problem. I would like to post it soon
> > > > after doing some more test.
> > > > 
> > > > > On top of that can you please give some details how increasing the
> > > > > buffered queue length reduces the impact of writers?
> > > > 
> > > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > > than the value of io_limit, so it was a bottleneck of the throughput.
> > > > 
> > > 
> > > Ok, so it should have been throughput bottleneck but how did it solve the
> > > issue of writer starving the reader as you had mentioned in the mail.
> > 
> > As wrote above, I modified dm-ioband to handle sync/async requests
> > separately, so even if writers do a lot of buffered IOs, readers can
> > issue IOs regardless writers' busyness. Once the IOs are backlogged
> > for throttling, the both sync and async requests are issued according
> > to the other of arrival.
> > 
> 
> Ok, so if both the readers and writers are buffered and some tokens become
> available then these tokens will be divided half and half between readers
> or writer queues?
> 
> > > Secondly, you mentioned that processes are made to sleep once we cross 
> > > io_limit. This sounds like request descriptor facility on requeust queue
> > > where processes are made to sleep.
> > >
> > > There are threads in kernel which don't want to sleep while submitting
> > > bios. For example, btrfs has bio submitting thread which does not want
> > > to sleep hence it checks with device if it is congested or not and not
> > > submit the bio if it is congested.  How would you handle such cases. Have
> > > you implemented any per group congestion kind of interface to make sure
> > > such IO's don't sleep if group is congested.
> > >
> > > Or this limit is per ioband device which every group on the device is
> > > sharing. If yes, then how would you provide isolation between groups 
> > > because if one groups consumes io_limit tokens, then other will simply
> > > be serialized on that device?
> > 
> > There are two kind of limit and both limit the number of IO requests
> > which can be issued simultaneously, but one is for per ioband device, 
> > the other is for per ioband group. The per group limit assigned to
> > each group is calculated by dividing io_limit according to their
> > proportion of weight.
> > 
> > The kernel thread is not made to sleep by the per group limit, because
> > several kinds of kernel threads submit IOs from multiple groups and
> > for multiple devices in a single thread. At this time, the kernel
> > thread is made to sleep by the per device limit only.
> > 
> 
> Interesting. Actually not blocking kernel threads on per group limit
> and instead blocking it only on per device limts sounds like a good idea.
> 
> I can also do something similar and that will take away the need of
> exporting per group congestion interface to higher layers and reduce
> complexity. If some kernel thread does not want to block, these will
> continue to use existing per device/bdi congestion interface.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-02  2:57                   ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-02  2:57 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea, torvalds

On Thu, Oct 01, 2009 at 09:31:09AM -0400, Vivek Goyal wrote:
> On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > > Hi Vivek,
> > > > 
> > > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > > into the disk and again timestamp with finish time once request finishes.
> > > > > 
> > > > > This way higher layer can get an idea how much disk time a group of bios
> > > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > > then time accounting becomes an issue.
> > > > > 
> > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > > time elapsed between each of milestones is t. Also assume that all these
> > > > > requests are from same queue/group.
> > > > > 
> > > > >         t0   t1   t2   t3  t4   t5   t6   t7
> > > > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > > > 
> > > > > Now higher layer will think that time consumed by group is:
> > > > > 
> > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > > > 
> > > > > But the time elapsed is only 7t.
> > > > 
> > > > IO controller can know how many requests are issued and still in
> > > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > > exist?
> > > > 
> > > 
> > > That time would not reflect disk time used. It will be follwoing.
> > > 
> > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > > (time spent in disk)
> > 
> > In the case where multiple IO requests are issued from IO controller,
> > that time measurement is the time from when the first IO request is
> > issued until when the endio is called for the last IO request. Does
> > not it reflect disk time?
> > 
> 
> Not accurately as it will be including the time spent in CFQ queues as
> well as dispatch queue. I will not worry much about dispatch queue time
> but time spent CFQ queues can be significant.
> 
> This is assuming that you are using token based scheme and will be
> dispatching requests from multiple groups at the same time.
> 

Thinking more about it...

Does time based fairness make sense at higher level logical devices?

- Time based fairness generally helps with rotational devices which have
  high seek costs. At higher level we don't even know what is the nature
  of underlying device where IO will ultimately go.

- For time based fairness to work accurately at higher level, most likely
  it will require dispatch from the single group at a time and wait for
  requests to complete from that group and then dispatch from next. 
  Something like CFQ model of queue.  

  Dispatching from single queue/group works well in case of a single
  underlying device where CFQ is operating but at higher level devices
  where typically there will be multiple physical devices under it, it
  might not make sense as it made things more linear and reduced
  parallel processing further. So dispatching from single group at a time
  and waiting before we dispatch from next group will most likely be
  killer for throughput in higher level devices and might not make sense.

  If we don't adopt the policy of dispatch from single group, then we run
  into all the issues of weak isolation between groups, higher latencies,
  preemptions across groups etc.

More I think about the whole issue and desired set of requirements, more
I am convinced that we probably need two io controlling mechanisms. One
which focusses purely on providing bandwidth fairness numbers on high
level devices and the other which works at low level devices with CFQ
and provides good bandwidth shaping, strong isolation, preserves fairness
with-in group and good control on latencies.

Higher level controller will not worry about time based policies. It can
implemente max bw and proportional bw control based on size of IO and
number of IO. 

Lower level controller at CFQ level will implement time based group
scheduling. Keeping it at low level will have the advantage of better
utitlization of hardware in various dm/md configurations (as no throttling
takes place at higher level) but at the cost of not so strict fairness numbers
at higher level. So those who want strict fairness number policies at higher
level devices irrespective of shortcomings, can use that. Others can stick to
lower level controller.

For buffered write control we anyway have to do either something in memory
controller or come up with another cgroup controller which throttles IO
before it goes into cache. Or, in fact we can have a re-look at Andrea
Righi's controller which provided max BW and throttled buffered writes
before they got into page cache and try to provide proportional BW also
there.

Basically I see the space for two IO controllers. At the moment can't
think of a way of coming up with single controller which satisfies all
the requirements. So instead provide two and let user choose one based on
his need.

Any thoughts?

Before finishing this mail, will throw a whacky idea in the ring. I was
going through the request based dm-multipath paper. Will it make sense
to implement request based dm-ioband? So basically we implement all the
group scheduling in CFQ and let dm-ioband implement a request function
to take the request and break it back into bios. This way we can keep
all the group control at one place and also meet most of the requirements.

So request based dm-ioband will have a request in hand once that request
has passed group control and prio control. Because dm-ioband is a device
mapper target, one can put it on higher level devices (practically taking
CFQ at higher level device), and provide fairness there. One can also 
put it on those SSDs which don't use IO scheduler (this is kind of forcing
them to use the IO scheduler.)

I am sure that will be many issues but one big issue I could think of that
CFQ thinks that there is one device beneath it and dipsatches requests
from one queue (in case of idling) and that would kill parallelism at
higher layer and throughput will suffer on many of the dm/md configurations.

Thanks
Vivek

> But if you figure out a way that you dispatch requests from one group only
> at one time and wait for all requests to finish and then let next group
> go, then above can work fairly accurately. In that case it will become
> like CFQ with the only difference that effectively we have one queue per
> group instread of per process.
> 
> > > > > Secondly if a different group is running only single sequential reader,
> > > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > > groups.
> > > > >
> > > > > So we need something better to get a sense which group used how much of
> > > > > disk time.
> > > > 
> > > > It could be solved by implementing the way to pass on such information
> > > > from IO scheduler to higher layer controller.
> > > > 
> > > 
> > > How would you do that? Can you give some details exactly how and what
> > > information IO scheduler will pass to higher level IO controller so that IO
> > > controller can attribute right time to the group.
> > 
> > If you would like to know when the idle timer is expired, how about
> > adding a function to IO controller to be notified it from IO
> > scheduler? IO scheduler calls the function when the timer is expired.
> > 
> 
> This probably can be done. So this is like syncing between lower layers
> and higher layers about when do we start idling and when do we stop it and
> both the layers should be in sync.
> 
> This is something my common layer approach does. Becuase it is so close to
> IO scheuler, I can do it relatively easily.
> 
> One probably can create interfaces to even propogate this information up.
> But this all will probably come into the picture only if we don't use
> token based schemes and come up with something where at one point of time
> dispatch are from one group only.
> 
> > > > > > How about making throttling policy be user selectable like the IO
> > > > > > scheduler and putting it in the higher layer? So we could support
> > > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > > with starting with proportional bandwidth control first. 
> > > > > > 
> > > > > 
> > > > > What are the cases where time based policy does not work and size based
> > > > > policy works better and user would choose size based policy and not timed
> > > > > based one?
> > > > 
> > > > I think that disk time is not simply proportional to IO size. If there
> > > > are two groups whose wights are equally assigned and they issue
> > > > different sized IOs repsectively, the bandwidth of each group would
> > > > not distributed equally as expected. 
> > > > 
> > > 
> > > If we are providing fairness in terms of time, it is fair. If we provide
> > > equal time slots to two processes and if one got more IO done because it
> > > was not wasting time seeking or it issued bigger size IO, it deserves that
> > > higher BW. IO controller will make sure that process gets fair share in
> > > terms of time and exactly how much BW one got will depend on the workload.
> > > 
> > > That's the precise reason that fairness in terms of time is better on
> > > seeky media.
> > 
> > If the seek time is negligible, the bandwidth would not be distributed 
> > according to a proportion of weight settings. I think that it would be
> > unclear for users to understand how bandwidth is distributed. And I
> > also think that seeky media would gradually become obsolete,
> > 
> 
> I can understand that if lesser the seek cost game starts changing and
> probably a size based policy also work decently.
> 
> In that case at some point of time probably CFQ will also need to support
> another mode/policy where fairness is provided in terms of size of IO, if
> it detects a SSD with hardware queuing. Currently it seem to be disabling
> the idling in that case. But this is not very good from fairness point of
> view. I guess if CFQ wants to provide fairness in such cases, it needs to
> dynamically change the shape and start thinking in terms of size of IO.
> 
> So far my testing has been very limited to hard disks connected to my
> computer. I will do some testing on high end enterprise storage and see
> how much do seek matter and how well both the implementations work.
> 
> > > > > I am not against implementing things in higher layer as long as we can
> > > > > ensure tight control on latencies, strong isolation between groups and
> > > > > not break CFQ's class and ioprio model with-in group.
> > > > > 
> > > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > > > 
> > > > > Can you elaborate little bit on this?
> > > > 
> > > > bio is grabbed in generic_make_request() and throttled as well as
> > > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > > > 
> > > 
> > > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > > throttling policies will apply. So until and unless we figure out a
> > > better way, the issues I have pointed out will still exists even in
> > > new implementation.
> > 
> > Yes, those still exist, but somehow I would like to try to solve them.
> > 
> > > > The default value of io_limit on the previous test was 128 (not 192)
> > > > which is equall to the default value of nr_request.
> > > 
> > > Hm..., I used following commands to create two ioband devices.
> > > 
> > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > > "weight 0 :100" | dmsetup create ioband1
> > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > > "weight 0 :100" | dmsetup create ioband2
> > > 
> > > Here io_limit value is zero so it should pick default value. Following is
> > > output of "dmsetup table" command.
> > > 
> > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> > >                                     ^^^^
> > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > > to be 192?
> > 
> > The default vaule has changed since v1.12.0 and increased from 128 to 192.
> > 
> > > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > > writes.
> > > > 
> > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > > sync/async requests separately, and it solves this
> > > > buffered-write-starves-read problem. I would like to post it soon
> > > > after doing some more test.
> > > > 
> > > > > On top of that can you please give some details how increasing the
> > > > > buffered queue length reduces the impact of writers?
> > > > 
> > > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > > than the value of io_limit, so it was a bottleneck of the throughput.
> > > > 
> > > 
> > > Ok, so it should have been throughput bottleneck but how did it solve the
> > > issue of writer starving the reader as you had mentioned in the mail.
> > 
> > As wrote above, I modified dm-ioband to handle sync/async requests
> > separately, so even if writers do a lot of buffered IOs, readers can
> > issue IOs regardless writers' busyness. Once the IOs are backlogged
> > for throttling, the both sync and async requests are issued according
> > to the other of arrival.
> > 
> 
> Ok, so if both the readers and writers are buffered and some tokens become
> available then these tokens will be divided half and half between readers
> or writer queues?
> 
> > > Secondly, you mentioned that processes are made to sleep once we cross 
> > > io_limit. This sounds like request descriptor facility on requeust queue
> > > where processes are made to sleep.
> > >
> > > There are threads in kernel which don't want to sleep while submitting
> > > bios. For example, btrfs has bio submitting thread which does not want
> > > to sleep hence it checks with device if it is congested or not and not
> > > submit the bio if it is congested.  How would you handle such cases. Have
> > > you implemented any per group congestion kind of interface to make sure
> > > such IO's don't sleep if group is congested.
> > >
> > > Or this limit is per ioband device which every group on the device is
> > > sharing. If yes, then how would you provide isolation between groups 
> > > because if one groups consumes io_limit tokens, then other will simply
> > > be serialized on that device?
> > 
> > There are two kind of limit and both limit the number of IO requests
> > which can be issued simultaneously, but one is for per ioband device, 
> > the other is for per ioband group. The per group limit assigned to
> > each group is calculated by dividing io_limit according to their
> > proportion of weight.
> > 
> > The kernel thread is not made to sleep by the per group limit, because
> > several kinds of kernel threads submit IOs from multiple groups and
> > for multiple devices in a single thread. At this time, the kernel
> > thread is made to sleep by the per device limit only.
> > 
> 
> Interesting. Actually not blocking kernel threads on per group limit
> and instead blocking it only on per device limts sounds like a good idea.
> 
> I can also do something similar and that will take away the need of
> exporting per group congestion interface to higher layers and reduce
> complexity. If some kernel thread does not want to block, these will
> continue to use existing per device/bdi congestion interface.
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-01  7:33                 ` Mike Galbraith
@ 2009-10-01 18:58                       ` Jens Axboe
  2009-10-02 18:08                   ` Jens Axboe
  1 sibling, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-01 18:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Oct 01 2009, Mike Galbraith wrote:
> > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > short depending on the disk speed.
> 
> Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
> some new task is determined to be seeky, the damage is already done.
> 
> The below does better, though not as well as "just say no to overload"
> of course ;-)

So this essentially takes the "avoid impact from previous slice" to a
new extreme, but idling even before dispatching requests from the new
queue. We basically do two things to prevent this already - one is to
only set the slice when the first request is actually serviced, and the
other is to drain async requests completely before starting sync ones.
I'm a bit surprised that the former doesn't solve the problem fully, I
guess what happens is that if the drive has been flooded with writes, it
may service the new read immediately and then return to finish emptying
its writeback cache. This will cause an impact for any sync IO until
that cache is flushed, and then cause that sync queue to not get as much
service as it should have.

Perhaps the "set slice on first complete" isn't working correctly? Or
perhaps we just need to be more extreme.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-01 18:58                       ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-10-01 18:58 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Thu, Oct 01 2009, Mike Galbraith wrote:
> > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> > using one "slice_idle" period of 8 ms. But it might turn out to be too
> > short depending on the disk speed.
> 
> Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
> some new task is determined to be seeky, the damage is already done.
> 
> The below does better, though not as well as "just say no to overload"
> of course ;-)

So this essentially takes the "avoid impact from previous slice" to a
new extreme, but idling even before dispatching requests from the new
queue. We basically do two things to prevent this already - one is to
only set the slice when the first request is actually serviced, and the
other is to drain async requests completely before starting sync ones.
I'm a bit surprised that the former doesn't solve the problem fully, I
guess what happens is that if the drive has been flooded with writes, it
may service the new read immediately and then return to finish emptying
its writeback cache. This will cause an impact for any sync IO until
that cache is flushed, and then cause that sync queue to not get as much
service as it should have.

Perhaps the "set slice on first complete" isn't working correctly? Or
perhaps we just need to be more extreme.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]               ` <20091001.154125.104044685.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-10-01 13:31                 ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-01 13:31 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > > 
> > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > into the disk and again timestamp with finish time once request finishes.
> > > > 
> > > > This way higher layer can get an idea how much disk time a group of bios
> > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > then time accounting becomes an issue.
> > > > 
> > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > time elapsed between each of milestones is t. Also assume that all these
> > > > requests are from same queue/group.
> > > > 
> > > >         t0   t1   t2   t3  t4   t5   t6   t7
> > > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > > 
> > > > Now higher layer will think that time consumed by group is:
> > > > 
> > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > > 
> > > > But the time elapsed is only 7t.
> > > 
> > > IO controller can know how many requests are issued and still in
> > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > exist?
> > > 
> > 
> > That time would not reflect disk time used. It will be follwoing.
> > 
> > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > (time spent in disk)
> 
> In the case where multiple IO requests are issued from IO controller,
> that time measurement is the time from when the first IO request is
> issued until when the endio is called for the last IO request. Does
> not it reflect disk time?
> 

Not accurately as it will be including the time spent in CFQ queues as
well as dispatch queue. I will not worry much about dispatch queue time
but time spent CFQ queues can be significant.

This is assuming that you are using token based scheme and will be
dispatching requests from multiple groups at the same time.

But if you figure out a way that you dispatch requests from one group only
at one time and wait for all requests to finish and then let next group
go, then above can work fairly accurately. In that case it will become
like CFQ with the only difference that effectively we have one queue per
group instread of per process.

> > > > Secondly if a different group is running only single sequential reader,
> > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > groups.
> > > >
> > > > So we need something better to get a sense which group used how much of
> > > > disk time.
> > > 
> > > It could be solved by implementing the way to pass on such information
> > > from IO scheduler to higher layer controller.
> > > 
> > 
> > How would you do that? Can you give some details exactly how and what
> > information IO scheduler will pass to higher level IO controller so that IO
> > controller can attribute right time to the group.
> 
> If you would like to know when the idle timer is expired, how about
> adding a function to IO controller to be notified it from IO
> scheduler? IO scheduler calls the function when the timer is expired.
> 

This probably can be done. So this is like syncing between lower layers
and higher layers about when do we start idling and when do we stop it and
both the layers should be in sync.

This is something my common layer approach does. Becuase it is so close to
IO scheuler, I can do it relatively easily.

One probably can create interfaces to even propogate this information up.
But this all will probably come into the picture only if we don't use
token based schemes and come up with something where at one point of time
dispatch are from one group only.

> > > > > How about making throttling policy be user selectable like the IO
> > > > > scheduler and putting it in the higher layer? So we could support
> > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > with starting with proportional bandwidth control first. 
> > > > > 
> > > > 
> > > > What are the cases where time based policy does not work and size based
> > > > policy works better and user would choose size based policy and not timed
> > > > based one?
> > > 
> > > I think that disk time is not simply proportional to IO size. If there
> > > are two groups whose wights are equally assigned and they issue
> > > different sized IOs repsectively, the bandwidth of each group would
> > > not distributed equally as expected. 
> > > 
> > 
> > If we are providing fairness in terms of time, it is fair. If we provide
> > equal time slots to two processes and if one got more IO done because it
> > was not wasting time seeking or it issued bigger size IO, it deserves that
> > higher BW. IO controller will make sure that process gets fair share in
> > terms of time and exactly how much BW one got will depend on the workload.
> > 
> > That's the precise reason that fairness in terms of time is better on
> > seeky media.
> 
> If the seek time is negligible, the bandwidth would not be distributed 
> according to a proportion of weight settings. I think that it would be
> unclear for users to understand how bandwidth is distributed. And I
> also think that seeky media would gradually become obsolete,
> 

I can understand that if lesser the seek cost game starts changing and
probably a size based policy also work decently.

In that case at some point of time probably CFQ will also need to support
another mode/policy where fairness is provided in terms of size of IO, if
it detects a SSD with hardware queuing. Currently it seem to be disabling
the idling in that case. But this is not very good from fairness point of
view. I guess if CFQ wants to provide fairness in such cases, it needs to
dynamically change the shape and start thinking in terms of size of IO.

So far my testing has been very limited to hard disks connected to my
computer. I will do some testing on high end enterprise storage and see
how much do seek matter and how well both the implementations work.

> > > > I am not against implementing things in higher layer as long as we can
> > > > ensure tight control on latencies, strong isolation between groups and
> > > > not break CFQ's class and ioprio model with-in group.
> > > > 
> > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > > 
> > > > Can you elaborate little bit on this?
> > > 
> > > bio is grabbed in generic_make_request() and throttled as well as
> > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > > 
> > 
> > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > throttling policies will apply. So until and unless we figure out a
> > better way, the issues I have pointed out will still exists even in
> > new implementation.
> 
> Yes, those still exist, but somehow I would like to try to solve them.
> 
> > > The default value of io_limit on the previous test was 128 (not 192)
> > > which is equall to the default value of nr_request.
> > 
> > Hm..., I used following commands to create two ioband devices.
> > 
> > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband1
> > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband2
> > 
> > Here io_limit value is zero so it should pick default value. Following is
> > output of "dmsetup table" command.
> > 
> > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> >                                     ^^^^
> > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > to be 192?
> 
> The default vaule has changed since v1.12.0 and increased from 128 to 192.
> 
> > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > writes.
> > > 
> > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > sync/async requests separately, and it solves this
> > > buffered-write-starves-read problem. I would like to post it soon
> > > after doing some more test.
> > > 
> > > > On top of that can you please give some details how increasing the
> > > > buffered queue length reduces the impact of writers?
> > > 
> > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > than the value of io_limit, so it was a bottleneck of the throughput.
> > > 
> > 
> > Ok, so it should have been throughput bottleneck but how did it solve the
> > issue of writer starving the reader as you had mentioned in the mail.
> 
> As wrote above, I modified dm-ioband to handle sync/async requests
> separately, so even if writers do a lot of buffered IOs, readers can
> issue IOs regardless writers' busyness. Once the IOs are backlogged
> for throttling, the both sync and async requests are issued according
> to the other of arrival.
> 

Ok, so if both the readers and writers are buffered and some tokens become
available then these tokens will be divided half and half between readers
or writer queues?

> > Secondly, you mentioned that processes are made to sleep once we cross 
> > io_limit. This sounds like request descriptor facility on requeust queue
> > where processes are made to sleep.
> >
> > There are threads in kernel which don't want to sleep while submitting
> > bios. For example, btrfs has bio submitting thread which does not want
> > to sleep hence it checks with device if it is congested or not and not
> > submit the bio if it is congested.  How would you handle such cases. Have
> > you implemented any per group congestion kind of interface to make sure
> > such IO's don't sleep if group is congested.
> >
> > Or this limit is per ioband device which every group on the device is
> > sharing. If yes, then how would you provide isolation between groups 
> > because if one groups consumes io_limit tokens, then other will simply
> > be serialized on that device?
> 
> There are two kind of limit and both limit the number of IO requests
> which can be issued simultaneously, but one is for per ioband device, 
> the other is for per ioband group. The per group limit assigned to
> each group is calculated by dividing io_limit according to their
> proportion of weight.
> 
> The kernel thread is not made to sleep by the per group limit, because
> several kinds of kernel threads submit IOs from multiple groups and
> for multiple devices in a single thread. At this time, the kernel
> thread is made to sleep by the per device limit only.
> 

Interesting. Actually not blocking kernel threads on per group limit
and instead blocking it only on per device limts sounds like a good idea.

I can also do something similar and that will take away the need of
exporting per group congestion interface to higher layers and reduce
complexity. If some kernel thread does not want to block, these will
continue to use existing per device/bdi congestion interface.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-10-01  6:41               ` Ryo Tsuruta
@ 2009-10-01 13:31                 ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-01 13:31 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah,
	lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > > 
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > into the disk and again timestamp with finish time once request finishes.
> > > > 
> > > > This way higher layer can get an idea how much disk time a group of bios
> > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > then time accounting becomes an issue.
> > > > 
> > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > time elapsed between each of milestones is t. Also assume that all these
> > > > requests are from same queue/group.
> > > > 
> > > >         t0   t1   t2   t3  t4   t5   t6   t7
> > > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > > 
> > > > Now higher layer will think that time consumed by group is:
> > > > 
> > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > > 
> > > > But the time elapsed is only 7t.
> > > 
> > > IO controller can know how many requests are issued and still in
> > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > exist?
> > > 
> > 
> > That time would not reflect disk time used. It will be follwoing.
> > 
> > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > (time spent in disk)
> 
> In the case where multiple IO requests are issued from IO controller,
> that time measurement is the time from when the first IO request is
> issued until when the endio is called for the last IO request. Does
> not it reflect disk time?
> 

Not accurately as it will be including the time spent in CFQ queues as
well as dispatch queue. I will not worry much about dispatch queue time
but time spent CFQ queues can be significant.

This is assuming that you are using token based scheme and will be
dispatching requests from multiple groups at the same time.

But if you figure out a way that you dispatch requests from one group only
at one time and wait for all requests to finish and then let next group
go, then above can work fairly accurately. In that case it will become
like CFQ with the only difference that effectively we have one queue per
group instread of per process.

> > > > Secondly if a different group is running only single sequential reader,
> > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > groups.
> > > >
> > > > So we need something better to get a sense which group used how much of
> > > > disk time.
> > > 
> > > It could be solved by implementing the way to pass on such information
> > > from IO scheduler to higher layer controller.
> > > 
> > 
> > How would you do that? Can you give some details exactly how and what
> > information IO scheduler will pass to higher level IO controller so that IO
> > controller can attribute right time to the group.
> 
> If you would like to know when the idle timer is expired, how about
> adding a function to IO controller to be notified it from IO
> scheduler? IO scheduler calls the function when the timer is expired.
> 

This probably can be done. So this is like syncing between lower layers
and higher layers about when do we start idling and when do we stop it and
both the layers should be in sync.

This is something my common layer approach does. Becuase it is so close to
IO scheuler, I can do it relatively easily.

One probably can create interfaces to even propogate this information up.
But this all will probably come into the picture only if we don't use
token based schemes and come up with something where at one point of time
dispatch are from one group only.

> > > > > How about making throttling policy be user selectable like the IO
> > > > > scheduler and putting it in the higher layer? So we could support
> > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > with starting with proportional bandwidth control first. 
> > > > > 
> > > > 
> > > > What are the cases where time based policy does not work and size based
> > > > policy works better and user would choose size based policy and not timed
> > > > based one?
> > > 
> > > I think that disk time is not simply proportional to IO size. If there
> > > are two groups whose wights are equally assigned and they issue
> > > different sized IOs repsectively, the bandwidth of each group would
> > > not distributed equally as expected. 
> > > 
> > 
> > If we are providing fairness in terms of time, it is fair. If we provide
> > equal time slots to two processes and if one got more IO done because it
> > was not wasting time seeking or it issued bigger size IO, it deserves that
> > higher BW. IO controller will make sure that process gets fair share in
> > terms of time and exactly how much BW one got will depend on the workload.
> > 
> > That's the precise reason that fairness in terms of time is better on
> > seeky media.
> 
> If the seek time is negligible, the bandwidth would not be distributed 
> according to a proportion of weight settings. I think that it would be
> unclear for users to understand how bandwidth is distributed. And I
> also think that seeky media would gradually become obsolete,
> 

I can understand that if lesser the seek cost game starts changing and
probably a size based policy also work decently.

In that case at some point of time probably CFQ will also need to support
another mode/policy where fairness is provided in terms of size of IO, if
it detects a SSD with hardware queuing. Currently it seem to be disabling
the idling in that case. But this is not very good from fairness point of
view. I guess if CFQ wants to provide fairness in such cases, it needs to
dynamically change the shape and start thinking in terms of size of IO.

So far my testing has been very limited to hard disks connected to my
computer. I will do some testing on high end enterprise storage and see
how much do seek matter and how well both the implementations work.

> > > > I am not against implementing things in higher layer as long as we can
> > > > ensure tight control on latencies, strong isolation between groups and
> > > > not break CFQ's class and ioprio model with-in group.
> > > > 
> > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > > 
> > > > Can you elaborate little bit on this?
> > > 
> > > bio is grabbed in generic_make_request() and throttled as well as
> > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > > 
> > 
> > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > throttling policies will apply. So until and unless we figure out a
> > better way, the issues I have pointed out will still exists even in
> > new implementation.
> 
> Yes, those still exist, but somehow I would like to try to solve them.
> 
> > > The default value of io_limit on the previous test was 128 (not 192)
> > > which is equall to the default value of nr_request.
> > 
> > Hm..., I used following commands to create two ioband devices.
> > 
> > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband1
> > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband2
> > 
> > Here io_limit value is zero so it should pick default value. Following is
> > output of "dmsetup table" command.
> > 
> > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> >                                     ^^^^
> > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > to be 192?
> 
> The default vaule has changed since v1.12.0 and increased from 128 to 192.
> 
> > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > writes.
> > > 
> > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > sync/async requests separately, and it solves this
> > > buffered-write-starves-read problem. I would like to post it soon
> > > after doing some more test.
> > > 
> > > > On top of that can you please give some details how increasing the
> > > > buffered queue length reduces the impact of writers?
> > > 
> > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > than the value of io_limit, so it was a bottleneck of the throughput.
> > > 
> > 
> > Ok, so it should have been throughput bottleneck but how did it solve the
> > issue of writer starving the reader as you had mentioned in the mail.
> 
> As wrote above, I modified dm-ioband to handle sync/async requests
> separately, so even if writers do a lot of buffered IOs, readers can
> issue IOs regardless writers' busyness. Once the IOs are backlogged
> for throttling, the both sync and async requests are issued according
> to the other of arrival.
> 

Ok, so if both the readers and writers are buffered and some tokens become
available then these tokens will be divided half and half between readers
or writer queues?

> > Secondly, you mentioned that processes are made to sleep once we cross 
> > io_limit. This sounds like request descriptor facility on requeust queue
> > where processes are made to sleep.
> >
> > There are threads in kernel which don't want to sleep while submitting
> > bios. For example, btrfs has bio submitting thread which does not want
> > to sleep hence it checks with device if it is congested or not and not
> > submit the bio if it is congested.  How would you handle such cases. Have
> > you implemented any per group congestion kind of interface to make sure
> > such IO's don't sleep if group is congested.
> >
> > Or this limit is per ioband device which every group on the device is
> > sharing. If yes, then how would you provide isolation between groups 
> > because if one groups consumes io_limit tokens, then other will simply
> > be serialized on that device?
> 
> There are two kind of limit and both limit the number of IO requests
> which can be issued simultaneously, but one is for per ioband device, 
> the other is for per ioband group. The per group limit assigned to
> each group is calculated by dividing io_limit according to their
> proportion of weight.
> 
> The kernel thread is not made to sleep by the per group limit, because
> several kinds of kernel threads submit IOs from multiple groups and
> for multiple devices in a single thread. At this time, the kernel
> thread is made to sleep by the per device limit only.
> 

Interesting. Actually not blocking kernel threads on per group limit
and instead blocking it only on per device limts sounds like a good idea.

I can also do something similar and that will take away the need of
exporting per group congestion interface to higher layers and reduce
complexity. If some kernel thread does not want to block, these will
continue to use existing per device/bdi congestion interface.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-01 13:31                 ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-10-01 13:31 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea, torvalds

On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > > Hi Vivek,
> > > 
> > > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > I was thinking that elevator layer will do the merge of bios. So IO
> > > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > > into the disk and again timestamp with finish time once request finishes.
> > > > 
> > > > This way higher layer can get an idea how much disk time a group of bios
> > > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > > then time accounting becomes an issue.
> > > > 
> > > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > > time elapsed between each of milestones is t. Also assume that all these
> > > > requests are from same queue/group.
> > > > 
> > > >         t0   t1   t2   t3  t4   t5   t6   t7
> > > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > > 
> > > > Now higher layer will think that time consumed by group is:
> > > > 
> > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > > 
> > > > But the time elapsed is only 7t.
> > > 
> > > IO controller can know how many requests are issued and still in
> > > progress. Is it not enough to accumulate the time while in-flight IOs
> > > exist?
> > > 
> > 
> > That time would not reflect disk time used. It will be follwoing.
> > 
> > (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> > (time spent in disk)
> 
> In the case where multiple IO requests are issued from IO controller,
> that time measurement is the time from when the first IO request is
> issued until when the endio is called for the last IO request. Does
> not it reflect disk time?
> 

Not accurately as it will be including the time spent in CFQ queues as
well as dispatch queue. I will not worry much about dispatch queue time
but time spent CFQ queues can be significant.

This is assuming that you are using token based scheme and will be
dispatching requests from multiple groups at the same time.

But if you figure out a way that you dispatch requests from one group only
at one time and wait for all requests to finish and then let next group
go, then above can work fairly accurately. In that case it will become
like CFQ with the only difference that effectively we have one queue per
group instread of per process.

> > > > Secondly if a different group is running only single sequential reader,
> > > > there CFQ will be driving queue depth of 1 and time will not be running
> > > > faster and this inaccuracy in accounting will lead to unfair share between
> > > > groups.
> > > >
> > > > So we need something better to get a sense which group used how much of
> > > > disk time.
> > > 
> > > It could be solved by implementing the way to pass on such information
> > > from IO scheduler to higher layer controller.
> > > 
> > 
> > How would you do that? Can you give some details exactly how and what
> > information IO scheduler will pass to higher level IO controller so that IO
> > controller can attribute right time to the group.
> 
> If you would like to know when the idle timer is expired, how about
> adding a function to IO controller to be notified it from IO
> scheduler? IO scheduler calls the function when the timer is expired.
> 

This probably can be done. So this is like syncing between lower layers
and higher layers about when do we start idling and when do we stop it and
both the layers should be in sync.

This is something my common layer approach does. Becuase it is so close to
IO scheuler, I can do it relatively easily.

One probably can create interfaces to even propogate this information up.
But this all will probably come into the picture only if we don't use
token based schemes and come up with something where at one point of time
dispatch are from one group only.

> > > > > How about making throttling policy be user selectable like the IO
> > > > > scheduler and putting it in the higher layer? So we could support
> > > > > all of policies (time-based, size-based and rate limiting). There
> > > > > seems not to only one solution which satisfies all users. But I agree
> > > > > with starting with proportional bandwidth control first. 
> > > > > 
> > > > 
> > > > What are the cases where time based policy does not work and size based
> > > > policy works better and user would choose size based policy and not timed
> > > > based one?
> > > 
> > > I think that disk time is not simply proportional to IO size. If there
> > > are two groups whose wights are equally assigned and they issue
> > > different sized IOs repsectively, the bandwidth of each group would
> > > not distributed equally as expected. 
> > > 
> > 
> > If we are providing fairness in terms of time, it is fair. If we provide
> > equal time slots to two processes and if one got more IO done because it
> > was not wasting time seeking or it issued bigger size IO, it deserves that
> > higher BW. IO controller will make sure that process gets fair share in
> > terms of time and exactly how much BW one got will depend on the workload.
> > 
> > That's the precise reason that fairness in terms of time is better on
> > seeky media.
> 
> If the seek time is negligible, the bandwidth would not be distributed 
> according to a proportion of weight settings. I think that it would be
> unclear for users to understand how bandwidth is distributed. And I
> also think that seeky media would gradually become obsolete,
> 

I can understand that if lesser the seek cost game starts changing and
probably a size based policy also work decently.

In that case at some point of time probably CFQ will also need to support
another mode/policy where fairness is provided in terms of size of IO, if
it detects a SSD with hardware queuing. Currently it seem to be disabling
the idling in that case. But this is not very good from fairness point of
view. I guess if CFQ wants to provide fairness in such cases, it needs to
dynamically change the shape and start thinking in terms of size of IO.

So far my testing has been very limited to hard disks connected to my
computer. I will do some testing on high end enterprise storage and see
how much do seek matter and how well both the implementations work.

> > > > I am not against implementing things in higher layer as long as we can
> > > > ensure tight control on latencies, strong isolation between groups and
> > > > not break CFQ's class and ioprio model with-in group.
> > > > 
> > > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > > 
> > > > Can you elaborate little bit on this?
> > > 
> > > bio is grabbed in generic_make_request() and throttled as well as
> > > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > > 
> > 
> > Ok, so one would not need dm-ioband device now, but same dm-ioband
> > throttling policies will apply. So until and unless we figure out a
> > better way, the issues I have pointed out will still exists even in
> > new implementation.
> 
> Yes, those still exist, but somehow I would like to try to solve them.
> 
> > > The default value of io_limit on the previous test was 128 (not 192)
> > > which is equall to the default value of nr_request.
> > 
> > Hm..., I used following commands to create two ioband devices.
> > 
> > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband1
> > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> > "weight 0 :100" | dmsetup create ioband2
> > 
> > Here io_limit value is zero so it should pick default value. Following is
> > output of "dmsetup table" command.
> > 
> > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
> >                                     ^^^^
> > IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> > to be 192?
> 
> The default vaule has changed since v1.12.0 and increased from 128 to 192.
> 
> > > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > > writes.
> > > 
> > > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > > sync/async requests separately, and it solves this
> > > buffered-write-starves-read problem. I would like to post it soon
> > > after doing some more test.
> > > 
> > > > On top of that can you please give some details how increasing the
> > > > buffered queue length reduces the impact of writers?
> > > 
> > > When the number of in-flight IOs exceeds io_limit, processes which are
> > > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > > IOs are finished. But IO scheduler layer can accept IO requests more
> > > than the value of io_limit, so it was a bottleneck of the throughput.
> > > 
> > 
> > Ok, so it should have been throughput bottleneck but how did it solve the
> > issue of writer starving the reader as you had mentioned in the mail.
> 
> As wrote above, I modified dm-ioband to handle sync/async requests
> separately, so even if writers do a lot of buffered IOs, readers can
> issue IOs regardless writers' busyness. Once the IOs are backlogged
> for throttling, the both sync and async requests are issued according
> to the other of arrival.
> 

Ok, so if both the readers and writers are buffered and some tokens become
available then these tokens will be divided half and half between readers
or writer queues?

> > Secondly, you mentioned that processes are made to sleep once we cross 
> > io_limit. This sounds like request descriptor facility on requeust queue
> > where processes are made to sleep.
> >
> > There are threads in kernel which don't want to sleep while submitting
> > bios. For example, btrfs has bio submitting thread which does not want
> > to sleep hence it checks with device if it is congested or not and not
> > submit the bio if it is congested.  How would you handle such cases. Have
> > you implemented any per group congestion kind of interface to make sure
> > such IO's don't sleep if group is congested.
> >
> > Or this limit is per ioband device which every group on the device is
> > sharing. If yes, then how would you provide isolation between groups 
> > because if one groups consumes io_limit tokens, then other will simply
> > be serialized on that device?
> 
> There are two kind of limit and both limit the number of IO requests
> which can be issued simultaneously, but one is for per ioband device, 
> the other is for per ioband group. The per group limit assigned to
> each group is calculated by dividing io_limit according to their
> proportion of weight.
> 
> The kernel thread is not made to sleep by the per group limit, because
> several kinds of kernel threads submit IOs from multiple groups and
> for multiple devices in a single thread. At this time, the kernel
> thread is made to sleep by the per device limit only.
> 

Interesting. Actually not blocking kernel threads on per group limit
and instead blocking it only on per device limts sounds like a good idea.

I can also do something similar and that will take away the need of
exporting per group congestion interface to higher layers and reduce
complexity. If some kernel thread does not want to block, these will
continue to use existing per device/bdi congestion interface.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                 ` <20090930202447.GA28236-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-01  7:33                   ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-01  7:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, 2009-09-30 at 16:24 -0400, Vivek Goyal wrote:
> On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote:
> > 
> >  
> > >  		/*
> > > +		 * We may have seeky queues, don't throttle up just yet.
> > > +		 */
> > > +		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> > > +			return 0;
> > > +
> > 
> > bzzzt.  Window too large, but the though is to let them overload, but
> > not instantly.
> > 
> 
> CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> using one "slice_idle" period of 8 ms. But it might turn out to be too
> short depending on the disk speed.

Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
some new task is determined to be seeky, the damage is already done.

The below does better, though not as well as "just say no to overload"
of course ;-)

I have a patchlet from Corrado to test, likely better time investment
than poking this darn thing with sharp sticks.

	-Mike

grep elapsed testo.log
    0.894345911  seconds time elapsed <== solo seeky test measurement
    3.732472877  seconds time elapsed
    3.208443735  seconds time elapsed
    4.249776673  seconds time elapsed
    2.763449260  seconds time elapsed
    4.235271019  seconds time elapsed

(3.73 + 3.20 + 4.24 + 2.76 + 4.23) / 5 / 0.89 = 4... darn.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2a9b92..44a888d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -174,6 +174,8 @@ struct cfq_data {
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 
+	unsigned long od_stamp;
+
 	struct list_head cic_list;
 
 	/*
@@ -1296,19 +1298,26 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Drain async requests before we start sync IO
 	 */
-	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+		cfqd->od_stamp = jiffies;
 		return 0;
+	}
 
 	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
-	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+		cfqd->od_stamp = jiffies;
 		return 0;
+	}
 
 	max_dispatch = cfqd->cfq_quantum;
 	if (cfq_class_idle(cfqq))
 		max_dispatch = 1;
 
+	if (cfqd->busy_queues > 1)
+		cfqd->od_stamp = jiffies;
+
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
@@ -1326,6 +1335,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 			return 0;
 
 		/*
+		 * Don't start overloading until we've been alone for a bit.
+		 */
+		if (time_before(jiffies, cfqd->od_stamp + cfq_slice_sync))
+			return 0;
+
+		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
 		if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1941,7 +1956,7 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct cfq_io_context *cic)
 {
-	int old_idle, enable_idle;
+	int old_idle, enable_idle, seeky = 0;
 
 	/*
 	 * Don't idle for async or idle io prio class
@@ -1949,10 +1964,19 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
+	if (cfqd->hw_tag) {
+		if (CIC_SEEKY(cic))
+			seeky = 1;
+		/*
+		 * If known or incalculable seekiness, delay.
+		 */
+		if (seeky || !sample_valid(cic->seek_samples))
+			cfqd->od_stamp = jiffies;
+	}
+
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
+	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky)
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2482,6 +2506,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->hw_tag = 1;
+	cfqd->od_stamp = INITIAL_JIFFIES;
 
 	return cfqd;
 }

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-30 20:24                 ` Vivek Goyal
  (?)
  (?)
@ 2009-10-01  7:33                 ` Mike Galbraith
       [not found]                   ` <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-10-02 18:08                   ` Jens Axboe
  -1 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-10-01  7:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Wed, 2009-09-30 at 16:24 -0400, Vivek Goyal wrote:
> On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote:
> > 
> >  
> > >  		/*
> > > +		 * We may have seeky queues, don't throttle up just yet.
> > > +		 */
> > > +		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> > > +			return 0;
> > > +
> > 
> > bzzzt.  Window too large, but the though is to let them overload, but
> > not instantly.
> > 
> 
> CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
> using one "slice_idle" period of 8 ms. But it might turn out to be too
> short depending on the disk speed.

Yeah, it is too short, as is even _400_ ms.  Trouble is, by the time
some new task is determined to be seeky, the damage is already done.

The below does better, though not as well as "just say no to overload"
of course ;-)

I have a patchlet from Corrado to test, likely better time investment
than poking this darn thing with sharp sticks.

	-Mike

grep elapsed testo.log
    0.894345911  seconds time elapsed <== solo seeky test measurement
    3.732472877  seconds time elapsed
    3.208443735  seconds time elapsed
    4.249776673  seconds time elapsed
    2.763449260  seconds time elapsed
    4.235271019  seconds time elapsed

(3.73 + 3.20 + 4.24 + 2.76 + 4.23) / 5 / 0.89 = 4... darn.

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2a9b92..44a888d 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -174,6 +174,8 @@ struct cfq_data {
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 
+	unsigned long od_stamp;
+
 	struct list_head cic_list;
 
 	/*
@@ -1296,19 +1298,26 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 	/*
 	 * Drain async requests before we start sync IO
 	 */
-	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) {
+		cfqd->od_stamp = jiffies;
 		return 0;
+	}
 
 	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
-	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
+	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) {
+		cfqd->od_stamp = jiffies;
 		return 0;
+	}
 
 	max_dispatch = cfqd->cfq_quantum;
 	if (cfq_class_idle(cfqq))
 		max_dispatch = 1;
 
+	if (cfqd->busy_queues > 1)
+		cfqd->od_stamp = jiffies;
+
 	/*
 	 * Does this cfqq already have too much IO in flight?
 	 */
@@ -1326,6 +1335,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 			return 0;
 
 		/*
+		 * Don't start overloading until we've been alone for a bit.
+		 */
+		if (time_before(jiffies, cfqd->od_stamp + cfq_slice_sync))
+			return 0;
+
+		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
 		if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1941,7 +1956,7 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct cfq_io_context *cic)
 {
-	int old_idle, enable_idle;
+	int old_idle, enable_idle, seeky = 0;
 
 	/*
 	 * Don't idle for async or idle io prio class
@@ -1949,10 +1964,19 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq))
 		return;
 
+	if (cfqd->hw_tag) {
+		if (CIC_SEEKY(cic))
+			seeky = 1;
+		/*
+		 * If known or incalculable seekiness, delay.
+		 */
+		if (seeky || !sample_valid(cic->seek_samples))
+			cfqd->od_stamp = jiffies;
+	}
+
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
+	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky)
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2482,6 +2506,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->hw_tag = 1;
+	cfqd->od_stamp = INITIAL_JIFFIES;
 
 	return cfqd;
 }





^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]             ` <20090930110500.GA26631-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-10-01  6:41               ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-01  6:41 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > > I was thinking that elevator layer will do the merge of bios. So IO
> > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > into the disk and again timestamp with finish time once request finishes.
> > > 
> > > This way higher layer can get an idea how much disk time a group of bios
> > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > then time accounting becomes an issue.
> > > 
> > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > time elapsed between each of milestones is t. Also assume that all these
> > > requests are from same queue/group.
> > > 
> > >         t0   t1   t2   t3  t4   t5   t6   t7
> > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > 
> > > Now higher layer will think that time consumed by group is:
> > > 
> > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > 
> > > But the time elapsed is only 7t.
> > 
> > IO controller can know how many requests are issued and still in
> > progress. Is it not enough to accumulate the time while in-flight IOs
> > exist?
> > 
> 
> That time would not reflect disk time used. It will be follwoing.
> 
> (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> (time spent in disk)

In the case where multiple IO requests are issued from IO controller,
that time measurement is the time from when the first IO request is
issued until when the endio is called for the last IO request. Does
not it reflect disk time?

> > > Secondly if a different group is running only single sequential reader,
> > > there CFQ will be driving queue depth of 1 and time will not be running
> > > faster and this inaccuracy in accounting will lead to unfair share between
> > > groups.
> > >
> > > So we need something better to get a sense which group used how much of
> > > disk time.
> > 
> > It could be solved by implementing the way to pass on such information
> > from IO scheduler to higher layer controller.
> > 
> 
> How would you do that? Can you give some details exactly how and what
> information IO scheduler will pass to higher level IO controller so that IO
> controller can attribute right time to the group.

If you would like to know when the idle timer is expired, how about
adding a function to IO controller to be notified it from IO
scheduler? IO scheduler calls the function when the timer is expired.

> > > > How about making throttling policy be user selectable like the IO
> > > > scheduler and putting it in the higher layer? So we could support
> > > > all of policies (time-based, size-based and rate limiting). There
> > > > seems not to only one solution which satisfies all users. But I agree
> > > > with starting with proportional bandwidth control first. 
> > > > 
> > > 
> > > What are the cases where time based policy does not work and size based
> > > policy works better and user would choose size based policy and not timed
> > > based one?
> > 
> > I think that disk time is not simply proportional to IO size. If there
> > are two groups whose wights are equally assigned and they issue
> > different sized IOs repsectively, the bandwidth of each group would
> > not distributed equally as expected. 
> > 
> 
> If we are providing fairness in terms of time, it is fair. If we provide
> equal time slots to two processes and if one got more IO done because it
> was not wasting time seeking or it issued bigger size IO, it deserves that
> higher BW. IO controller will make sure that process gets fair share in
> terms of time and exactly how much BW one got will depend on the workload.
> 
> That's the precise reason that fairness in terms of time is better on
> seeky media.

If the seek time is negligible, the bandwidth would not be distributed 
according to a proportion of weight settings. I think that it would be
unclear for users to understand how bandwidth is distributed. And I
also think that seeky media would gradually become obsolete,

> > > I am not against implementing things in higher layer as long as we can
> > > ensure tight control on latencies, strong isolation between groups and
> > > not break CFQ's class and ioprio model with-in group.
> > > 
> > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > 
> > > Can you elaborate little bit on this?
> > 
> > bio is grabbed in generic_make_request() and throttled as well as
> > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > 
> 
> Ok, so one would not need dm-ioband device now, but same dm-ioband
> throttling policies will apply. So until and unless we figure out a
> better way, the issues I have pointed out will still exists even in
> new implementation.

Yes, those still exist, but somehow I would like to try to solve them.

> > The default value of io_limit on the previous test was 128 (not 192)
> > which is equall to the default value of nr_request.
> 
> Hm..., I used following commands to create two ioband devices.
> 
> echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> "weight 0 :100" | dmsetup create ioband1
> echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> "weight 0 :100" | dmsetup create ioband2
> 
> Here io_limit value is zero so it should pick default value. Following is
> output of "dmsetup table" command.
> 
> ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
>                                     ^^^^
> IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> to be 192?

The default vaule has changed since v1.12.0 and increased from 128 to 192.

> > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > writes.
> > 
> > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > sync/async requests separately, and it solves this
> > buffered-write-starves-read problem. I would like to post it soon
> > after doing some more test.
> > 
> > > On top of that can you please give some details how increasing the
> > > buffered queue length reduces the impact of writers?
> > 
> > When the number of in-flight IOs exceeds io_limit, processes which are
> > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > IOs are finished. But IO scheduler layer can accept IO requests more
> > than the value of io_limit, so it was a bottleneck of the throughput.
> > 
> 
> Ok, so it should have been throughput bottleneck but how did it solve the
> issue of writer starving the reader as you had mentioned in the mail.

As wrote above, I modified dm-ioband to handle sync/async requests
separately, so even if writers do a lot of buffered IOs, readers can
issue IOs regardless writers' busyness. Once the IOs are backlogged
for throttling, the both sync and async requests are issued according
to the other of arrival.

> Secondly, you mentioned that processes are made to sleep once we cross 
> io_limit. This sounds like request descriptor facility on requeust queue
> where processes are made to sleep.
>
> There are threads in kernel which don't want to sleep while submitting
> bios. For example, btrfs has bio submitting thread which does not want
> to sleep hence it checks with device if it is congested or not and not
> submit the bio if it is congested.  How would you handle such cases. Have
> you implemented any per group congestion kind of interface to make sure
> such IO's don't sleep if group is congested.
>
> Or this limit is per ioband device which every group on the device is
> sharing. If yes, then how would you provide isolation between groups 
> because if one groups consumes io_limit tokens, then other will simply
> be serialized on that device?

There are two kind of limit and both limit the number of IO requests
which can be issued simultaneously, but one is for per ioband device, 
the other is for per ioband group. The per group limit assigned to
each group is calculated by dividing io_limit according to their
proportion of weight.

The kernel thread is not made to sleep by the per group limit, because
several kinds of kernel threads submit IOs from multiple groups and
for multiple devices in a single thread. At this time, the kernel
thread is made to sleep by the per device limit only.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-30 11:05             ` Vivek Goyal
@ 2009-10-01  6:41               ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-01  6:41 UTC (permalink / raw)
  To: vgoyal
  Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah,
	lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > I was thinking that elevator layer will do the merge of bios. So IO
> > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > into the disk and again timestamp with finish time once request finishes.
> > > 
> > > This way higher layer can get an idea how much disk time a group of bios
> > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > then time accounting becomes an issue.
> > > 
> > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > time elapsed between each of milestones is t. Also assume that all these
> > > requests are from same queue/group.
> > > 
> > >         t0   t1   t2   t3  t4   t5   t6   t7
> > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > 
> > > Now higher layer will think that time consumed by group is:
> > > 
> > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > 
> > > But the time elapsed is only 7t.
> > 
> > IO controller can know how many requests are issued and still in
> > progress. Is it not enough to accumulate the time while in-flight IOs
> > exist?
> > 
> 
> That time would not reflect disk time used. It will be follwoing.
> 
> (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> (time spent in disk)

In the case where multiple IO requests are issued from IO controller,
that time measurement is the time from when the first IO request is
issued until when the endio is called for the last IO request. Does
not it reflect disk time?

> > > Secondly if a different group is running only single sequential reader,
> > > there CFQ will be driving queue depth of 1 and time will not be running
> > > faster and this inaccuracy in accounting will lead to unfair share between
> > > groups.
> > >
> > > So we need something better to get a sense which group used how much of
> > > disk time.
> > 
> > It could be solved by implementing the way to pass on such information
> > from IO scheduler to higher layer controller.
> > 
> 
> How would you do that? Can you give some details exactly how and what
> information IO scheduler will pass to higher level IO controller so that IO
> controller can attribute right time to the group.

If you would like to know when the idle timer is expired, how about
adding a function to IO controller to be notified it from IO
scheduler? IO scheduler calls the function when the timer is expired.

> > > > How about making throttling policy be user selectable like the IO
> > > > scheduler and putting it in the higher layer? So we could support
> > > > all of policies (time-based, size-based and rate limiting). There
> > > > seems not to only one solution which satisfies all users. But I agree
> > > > with starting with proportional bandwidth control first. 
> > > > 
> > > 
> > > What are the cases where time based policy does not work and size based
> > > policy works better and user would choose size based policy and not timed
> > > based one?
> > 
> > I think that disk time is not simply proportional to IO size. If there
> > are two groups whose wights are equally assigned and they issue
> > different sized IOs repsectively, the bandwidth of each group would
> > not distributed equally as expected. 
> > 
> 
> If we are providing fairness in terms of time, it is fair. If we provide
> equal time slots to two processes and if one got more IO done because it
> was not wasting time seeking or it issued bigger size IO, it deserves that
> higher BW. IO controller will make sure that process gets fair share in
> terms of time and exactly how much BW one got will depend on the workload.
> 
> That's the precise reason that fairness in terms of time is better on
> seeky media.

If the seek time is negligible, the bandwidth would not be distributed 
according to a proportion of weight settings. I think that it would be
unclear for users to understand how bandwidth is distributed. And I
also think that seeky media would gradually become obsolete,

> > > I am not against implementing things in higher layer as long as we can
> > > ensure tight control on latencies, strong isolation between groups and
> > > not break CFQ's class and ioprio model with-in group.
> > > 
> > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > 
> > > Can you elaborate little bit on this?
> > 
> > bio is grabbed in generic_make_request() and throttled as well as
> > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > 
> 
> Ok, so one would not need dm-ioband device now, but same dm-ioband
> throttling policies will apply. So until and unless we figure out a
> better way, the issues I have pointed out will still exists even in
> new implementation.

Yes, those still exist, but somehow I would like to try to solve them.

> > The default value of io_limit on the previous test was 128 (not 192)
> > which is equall to the default value of nr_request.
> 
> Hm..., I used following commands to create two ioband devices.
> 
> echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> "weight 0 :100" | dmsetup create ioband1
> echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> "weight 0 :100" | dmsetup create ioband2
> 
> Here io_limit value is zero so it should pick default value. Following is
> output of "dmsetup table" command.
> 
> ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
>                                     ^^^^
> IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> to be 192?

The default vaule has changed since v1.12.0 and increased from 128 to 192.

> > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > writes.
> > 
> > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > sync/async requests separately, and it solves this
> > buffered-write-starves-read problem. I would like to post it soon
> > after doing some more test.
> > 
> > > On top of that can you please give some details how increasing the
> > > buffered queue length reduces the impact of writers?
> > 
> > When the number of in-flight IOs exceeds io_limit, processes which are
> > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > IOs are finished. But IO scheduler layer can accept IO requests more
> > than the value of io_limit, so it was a bottleneck of the throughput.
> > 
> 
> Ok, so it should have been throughput bottleneck but how did it solve the
> issue of writer starving the reader as you had mentioned in the mail.

As wrote above, I modified dm-ioband to handle sync/async requests
separately, so even if writers do a lot of buffered IOs, readers can
issue IOs regardless writers' busyness. Once the IOs are backlogged
for throttling, the both sync and async requests are issued according
to the other of arrival.

> Secondly, you mentioned that processes are made to sleep once we cross 
> io_limit. This sounds like request descriptor facility on requeust queue
> where processes are made to sleep.
>
> There are threads in kernel which don't want to sleep while submitting
> bios. For example, btrfs has bio submitting thread which does not want
> to sleep hence it checks with device if it is congested or not and not
> submit the bio if it is congested.  How would you handle such cases. Have
> you implemented any per group congestion kind of interface to make sure
> such IO's don't sleep if group is congested.
>
> Or this limit is per ioband device which every group on the device is
> sharing. If yes, then how would you provide isolation between groups 
> because if one groups consumes io_limit tokens, then other will simply
> be serialized on that device?

There are two kind of limit and both limit the number of IO requests
which can be issued simultaneously, but one is for per ioband device, 
the other is for per ioband group. The per group limit assigned to
each group is calculated by dividing io_limit according to their
proportion of weight.

The kernel thread is not made to sleep by the per group limit, because
several kinds of kernel threads submit IOs from multiple groups and
for multiple devices in a single thread. At this time, the kernel
thread is made to sleep by the per device limit only.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-10-01  6:41               ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-10-01  6:41 UTC (permalink / raw)
  To: vgoyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea, torvalds

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> > Hi Vivek,
> > 
> > Vivek Goyal <vgoyal@redhat.com> wrote:
> > > I was thinking that elevator layer will do the merge of bios. So IO
> > > scheduler/elevator can time stamp the first bio in the request as it goes
> > > into the disk and again timestamp with finish time once request finishes.
> > > 
> > > This way higher layer can get an idea how much disk time a group of bios
> > > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > > then time accounting becomes an issue.
> > > 
> > > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > > time elapsed between each of milestones is t. Also assume that all these
> > > requests are from same queue/group.
> > > 
> > >         t0   t1   t2   t3  t4   t5   t6   t7
> > >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > > 
> > > Now higher layer will think that time consumed by group is:
> > > 
> > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > > 
> > > But the time elapsed is only 7t.
> > 
> > IO controller can know how many requests are issued and still in
> > progress. Is it not enough to accumulate the time while in-flight IOs
> > exist?
> > 
> 
> That time would not reflect disk time used. It will be follwoing.
> 
> (time spent waiting in CFQ queues) + (time spent in dispatch queue) +
> (time spent in disk)

In the case where multiple IO requests are issued from IO controller,
that time measurement is the time from when the first IO request is
issued until when the endio is called for the last IO request. Does
not it reflect disk time?

> > > Secondly if a different group is running only single sequential reader,
> > > there CFQ will be driving queue depth of 1 and time will not be running
> > > faster and this inaccuracy in accounting will lead to unfair share between
> > > groups.
> > >
> > > So we need something better to get a sense which group used how much of
> > > disk time.
> > 
> > It could be solved by implementing the way to pass on such information
> > from IO scheduler to higher layer controller.
> > 
> 
> How would you do that? Can you give some details exactly how and what
> information IO scheduler will pass to higher level IO controller so that IO
> controller can attribute right time to the group.

If you would like to know when the idle timer is expired, how about
adding a function to IO controller to be notified it from IO
scheduler? IO scheduler calls the function when the timer is expired.

> > > > How about making throttling policy be user selectable like the IO
> > > > scheduler and putting it in the higher layer? So we could support
> > > > all of policies (time-based, size-based and rate limiting). There
> > > > seems not to only one solution which satisfies all users. But I agree
> > > > with starting with proportional bandwidth control first. 
> > > > 
> > > 
> > > What are the cases where time based policy does not work and size based
> > > policy works better and user would choose size based policy and not timed
> > > based one?
> > 
> > I think that disk time is not simply proportional to IO size. If there
> > are two groups whose wights are equally assigned and they issue
> > different sized IOs repsectively, the bandwidth of each group would
> > not distributed equally as expected. 
> > 
> 
> If we are providing fairness in terms of time, it is fair. If we provide
> equal time slots to two processes and if one got more IO done because it
> was not wasting time seeking or it issued bigger size IO, it deserves that
> higher BW. IO controller will make sure that process gets fair share in
> terms of time and exactly how much BW one got will depend on the workload.
> 
> That's the precise reason that fairness in terms of time is better on
> seeky media.

If the seek time is negligible, the bandwidth would not be distributed 
according to a proportion of weight settings. I think that it would be
unclear for users to understand how bandwidth is distributed. And I
also think that seeky media would gradually become obsolete,

> > > I am not against implementing things in higher layer as long as we can
> > > ensure tight control on latencies, strong isolation between groups and
> > > not break CFQ's class and ioprio model with-in group.
> > > 
> > > > BTW, I will start to reimplement dm-ioband into block layer.
> > > 
> > > Can you elaborate little bit on this?
> > 
> > bio is grabbed in generic_make_request() and throttled as well as
> > dm-ioband's mechanism. dmsetup command is not necessary any longer.
> > 
> 
> Ok, so one would not need dm-ioband device now, but same dm-ioband
> throttling policies will apply. So until and unless we figure out a
> better way, the issues I have pointed out will still exists even in
> new implementation.

Yes, those still exist, but somehow I would like to try to solve them.

> > The default value of io_limit on the previous test was 128 (not 192)
> > which is equall to the default value of nr_request.
> 
> Hm..., I used following commands to create two ioband devices.
> 
> echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
> "weight 0 :100" | dmsetup create ioband1
> echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
> "weight 0 :100" | dmsetup create ioband2
> 
> Here io_limit value is zero so it should pick default value. Following is
> output of "dmsetup table" command.
> 
> ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
> ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
>                                     ^^^^
> IIUC, above number 192 is reflecting io_limit? If yes, then default seems
> to be 192?

The default vaule has changed since v1.12.0 and increased from 128 to 192.

> > > I set it up to 256 as you suggested. I still see writer starving reader. I
> > > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > > writes.
> > 
> > O.K. You removed "conv=fdatasync", the new dm-ioband handles
> > sync/async requests separately, and it solves this
> > buffered-write-starves-read problem. I would like to post it soon
> > after doing some more test.
> > 
> > > On top of that can you please give some details how increasing the
> > > buffered queue length reduces the impact of writers?
> > 
> > When the number of in-flight IOs exceeds io_limit, processes which are
> > going to issue IOs are made sleep by dm-ioband until all the in-flight
> > IOs are finished. But IO scheduler layer can accept IO requests more
> > than the value of io_limit, so it was a bottleneck of the throughput.
> > 
> 
> Ok, so it should have been throughput bottleneck but how did it solve the
> issue of writer starving the reader as you had mentioned in the mail.

As wrote above, I modified dm-ioband to handle sync/async requests
separately, so even if writers do a lot of buffered IOs, readers can
issue IOs regardless writers' busyness. Once the IOs are backlogged
for throttling, the both sync and async requests are issued according
to the other of arrival.

> Secondly, you mentioned that processes are made to sleep once we cross 
> io_limit. This sounds like request descriptor facility on requeust queue
> where processes are made to sleep.
>
> There are threads in kernel which don't want to sleep while submitting
> bios. For example, btrfs has bio submitting thread which does not want
> to sleep hence it checks with device if it is congested or not and not
> submit the bio if it is congested.  How would you handle such cases. Have
> you implemented any per group congestion kind of interface to make sure
> such IO's don't sleep if group is congested.
>
> Or this limit is per ioband device which every group on the device is
> sharing. If yes, then how would you provide isolation between groups 
> because if one groups consumes io_limit tokens, then other will simply
> be serialized on that device?

There are two kind of limit and both limit the number of IO requests
which can be issued simultaneously, but one is for per ioband device, 
the other is for per ioband group. The per group limit assigned to
each group is calculated by dividing io_limit according to their
proportion of weight.

The kernel thread is not made to sleep by the per group limit, because
several kinds of kernel threads submit IOs from multiple groups and
for multiple devices in a single thread. At this time, the kernel
thread is made to sleep by the per device limit only.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]               ` <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-30 20:24                 ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30 20:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote:
> 
>  
> >  		/*
> > +		 * We may have seeky queues, don't throttle up just yet.
> > +		 */
> > +		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> > +			return 0;
> > +
> 
> bzzzt.  Window too large, but the though is to let them overload, but
> not instantly.
> 

CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
using one "slice_idle" period of 8 ms. But it might turn out to be too
short depending on the disk speed.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-30 20:05             ` Mike Galbraith
@ 2009-09-30 20:24                 ` Vivek Goyal
       [not found]               ` <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30 20:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jens Axboe, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote:
> 
>  
> >  		/*
> > +		 * We may have seeky queues, don't throttle up just yet.
> > +		 */
> > +		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> > +			return 0;
> > +
> 
> bzzzt.  Window too large, but the though is to let them overload, but
> not instantly.
> 

CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
using one "slice_idle" period of 8 ms. But it might turn out to be too
short depending on the disk speed.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-30 20:24                 ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30 20:24 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote:
> 
>  
> >  		/*
> > +		 * We may have seeky queues, don't throttle up just yet.
> > +		 */
> > +		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> > +			return 0;
> > +
> 
> bzzzt.  Window too large, but the though is to let them overload, but
> not instantly.
> 

CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try
using one "slice_idle" period of 8 ms. But it might turn out to be too
short depending on the disk speed.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]             ` <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-30 20:05               ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-30 20:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


 
>  		/*
> +		 * We may have seeky queues, don't throttle up just yet.
> +		 */
> +		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> +			return 0;
> +

bzzzt.  Window too large, but the though is to let them overload, but
not instantly.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-30 19:58           ` Mike Galbraith
       [not found]             ` <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-30 20:05             ` Mike Galbraith
  2009-09-30 20:24                 ` Vivek Goyal
       [not found]               ` <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-30 20:05 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel


 
>  		/*
> +		 * We may have seeky queues, don't throttle up just yet.
> +		 */
> +		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
> +			return 0;
> +

bzzzt.  Window too large, but the though is to let them overload, but
not instantly.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]           ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-09-27 18:15             ` Mike Galbraith
@ 2009-09-30 19:58             ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-30 19:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:

> It's a given that not merging will provide better latency. We can't
> disable that or performance will suffer A LOT on some systems. There are
> ways to make it better, though. One would be to make the max request
> size smaller, but that would also hurt for streamed workloads. Can you
> try whether the below patch makes a difference? It will basically
> disallow merges to a request that isn't the last one.

Thoughts about something like the below?

The problem with the dd vs konsole -e exit type load seems to be
kjournald overloading the disk between reads.  When userland is blocked,
kjournald is free to stuff 4*quantum into the queue instantly.

Taking the hint from Vivek's fairness tweakable patch, I stamped the
queue when a seeker was last seen, and disallowed overload within
CIC_SEEK_THR of that time.  Worked well.

dd competing against perf stat -- konsole -e exec timings, 5 back to back runs
                                                           Avg
before         9.15    14.51     9.39    15.06     9.90   11.6
after          1.76     1.54     1.93     1.88     1.56    1.7

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2a9b92..4a00129 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -174,6 +174,8 @@ struct cfq_data {
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 
+	unsigned long last_seeker;
+
 	struct list_head cic_list;
 
 	/*
@@ -1326,6 +1328,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 			return 0;
 
 		/*
+		 * We may have seeky queues, don't throttle up just yet.
+		 */
+		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
+			return 0;
+
+		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
 		if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1941,7 +1949,7 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct cfq_io_context *cic)
 {
-	int old_idle, enable_idle;
+	int old_idle, enable_idle, seeky = 0;
 
 	/*
 	 * Don't idle for async or idle io prio class
@@ -1951,8 +1959,12 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
+	if (cfqd->hw_tag && CIC_SEEKY(cic)) {
+		cfqd->last_seeker = jiffies;
+		seeky = 1;
+	}
+
+	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky)
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2482,6 +2494,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->hw_tag = 1;
+	cfqd->last_seeker = jiffies;
 
 	return cfqd;
 }

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-27 16:42         ` Jens Axboe
       [not found]           ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
  2009-09-27 18:15           ` Mike Galbraith
@ 2009-09-30 19:58           ` Mike Galbraith
       [not found]             ` <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-09-30 20:05             ` Mike Galbraith
  2 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-30 19:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:

> It's a given that not merging will provide better latency. We can't
> disable that or performance will suffer A LOT on some systems. There are
> ways to make it better, though. One would be to make the max request
> size smaller, but that would also hurt for streamed workloads. Can you
> try whether the below patch makes a difference? It will basically
> disallow merges to a request that isn't the last one.

Thoughts about something like the below?

The problem with the dd vs konsole -e exit type load seems to be
kjournald overloading the disk between reads.  When userland is blocked,
kjournald is free to stuff 4*quantum into the queue instantly.

Taking the hint from Vivek's fairness tweakable patch, I stamped the
queue when a seeker was last seen, and disallowed overload within
CIC_SEEK_THR of that time.  Worked well.

dd competing against perf stat -- konsole -e exec timings, 5 back to back runs
                                                           Avg
before         9.15    14.51     9.39    15.06     9.90   11.6
after          1.76     1.54     1.93     1.88     1.56    1.7

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index e2a9b92..4a00129 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -174,6 +174,8 @@ struct cfq_data {
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
 
+	unsigned long last_seeker;
+
 	struct list_head cic_list;
 
 	/*
@@ -1326,6 +1328,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 			return 0;
 
 		/*
+		 * We may have seeky queues, don't throttle up just yet.
+		 */
+		if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR))
+			return 0;
+
+		/*
 		 * we are the only queue, allow up to 4 times of 'quantum'
 		 */
 		if (cfqq->dispatched >= 4 * max_dispatch)
@@ -1941,7 +1949,7 @@ static void
 cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 		       struct cfq_io_context *cic)
 {
-	int old_idle, enable_idle;
+	int old_idle, enable_idle, seeky = 0;
 
 	/*
 	 * Don't idle for async or idle io prio class
@@ -1951,8 +1959,12 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 
 	enable_idle = old_idle = cfq_cfqq_idle_window(cfqq);
 
-	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle ||
-	    (cfqd->hw_tag && CIC_SEEKY(cic)))
+	if (cfqd->hw_tag && CIC_SEEKY(cic)) {
+		cfqd->last_seeker = jiffies;
+		seeky = 1;
+	}
+
+	if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky)
 		enable_idle = 0;
 	else if (sample_valid(cic->ttime_samples)) {
 		if (cic->ttime_mean > cfqd->cfq_slice_idle)
@@ -2482,6 +2494,7 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
 	cfqd->hw_tag = 1;
+	cfqd->last_seeker = jiffies;
 
 	return cfqd;
 }



^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]           ` <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-30 11:05             ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30 11:05 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > I was thinking that elevator layer will do the merge of bios. So IO
> > scheduler/elevator can time stamp the first bio in the request as it goes
> > into the disk and again timestamp with finish time once request finishes.
> > 
> > This way higher layer can get an idea how much disk time a group of bios
> > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > then time accounting becomes an issue.
> > 
> > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > time elapsed between each of milestones is t. Also assume that all these
> > requests are from same queue/group.
> > 
> >         t0   t1   t2   t3  t4   t5   t6   t7
> >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > 
> > Now higher layer will think that time consumed by group is:
> > 
> > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > 
> > But the time elapsed is only 7t.
> 
> IO controller can know how many requests are issued and still in
> progress. Is it not enough to accumulate the time while in-flight IOs
> exist?
> 

That time would not reflect disk time used. It will be follwoing.

(time spent waiting in CFQ queues) + (time spent in dispatch queue) +
(time spent in disk)

> > Secondly if a different group is running only single sequential reader,
> > there CFQ will be driving queue depth of 1 and time will not be running
> > faster and this inaccuracy in accounting will lead to unfair share between
> > groups.
> >
> > So we need something better to get a sense which group used how much of
> > disk time.
> 
> It could be solved by implementing the way to pass on such information
> from IO scheduler to higher layer controller.
> 

How would you do that? Can you give some details exactly how and what
information IO scheduler will pass to higher level IO controller so that IO
controller can attribute right time to the group.

> > > How about making throttling policy be user selectable like the IO
> > > scheduler and putting it in the higher layer? So we could support
> > > all of policies (time-based, size-based and rate limiting). There
> > > seems not to only one solution which satisfies all users. But I agree
> > > with starting with proportional bandwidth control first. 
> > > 
> > 
> > What are the cases where time based policy does not work and size based
> > policy works better and user would choose size based policy and not timed
> > based one?
> 
> I think that disk time is not simply proportional to IO size. If there
> are two groups whose wights are equally assigned and they issue
> different sized IOs repsectively, the bandwidth of each group would
> not distributed equally as expected. 
> 

If we are providing fairness in terms of time, it is fair. If we provide
equal time slots to two processes and if one got more IO done because it
was not wasting time seeking or it issued bigger size IO, it deserves that
higher BW. IO controller will make sure that process gets fair share in
terms of time and exactly how much BW one got will depend on the workload.

That's the precise reason that fairness in terms of time is better on
seeky media.

> > I am not against implementing things in higher layer as long as we can
> > ensure tight control on latencies, strong isolation between groups and
> > not break CFQ's class and ioprio model with-in group.
> > 
> > > BTW, I will start to reimplement dm-ioband into block layer.
> > 
> > Can you elaborate little bit on this?
> 
> bio is grabbed in generic_make_request() and throttled as well as
> dm-ioband's mechanism. dmsetup command is not necessary any longer.
> 

Ok, so one would not need dm-ioband device now, but same dm-ioband
throttling policies will apply. So until and unless we figure out a
better way, the issues I have pointed out will still exists even in
new implementation.

> > > > Fairness for higher level logical devices
> > > > =========================================
> > > > Do we want good fairness numbers for higher level logical devices also
> > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > > > at leaf nodes can help us use the resources optimally and in the process
> > > > we can get fairness at higher level also in many of the cases.
> > > 
> > > We should also take care of block devices which provide their own
> > > make_request_fn() and not use a IO scheduler. We can't use the leaf
> > > nodes approach to such devices.
> > > 
> > 
> > I am not sure how big an issue is this. This can be easily solved by
> > making use of NOOP scheduler by these devices. What are the reasons for
> > these devices to not use even noop? 
> 
> I'm not sure why the developers of the device driver choose their own
> way, and the driver is provided in binary form, so we can't modify it.
> 
> > > > Fairness with-in group
> > > > ======================
> > > > One of the issues with higher level controller is that how to do fair
> > > > throttling so that fairness with-in group is not impacted. Especially
> > > > the case of making sure that we don't break the notion of ioprio of the
> > > > processes with-in group.
> > > 
> > > I ran your test script to confirm that the notion of ioprio was not
> > > broken by dm-ioband. Here is the results of the test.
> > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> > > 
> > > I think that the time period during which dm-ioband holds IO requests
> > > for throttling would be too short to break the notion of ioprio.
> > 
> > Ok, I re-ran that test. Previously default io_limit value was 192 and now
> 
> The default value of io_limit on the previous test was 128 (not 192)
> which is equall to the default value of nr_request.

Hm..., I used following commands to create two ioband devices.

echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
"weight 0 :100" | dmsetup create ioband1
echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
"weight 0 :100" | dmsetup create ioband2

Here io_limit value is zero so it should pick default value. Following is
output of "dmsetup table" command.

ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
                                    ^^^^
IIUC, above number 192 is reflecting io_limit? If yes, then default seems
to be 192?

> 
> > I set it up to 256 as you suggested. I still see writer starving reader. I
> > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > writes.
> 
> O.K. You removed "conv=fdatasync", the new dm-ioband handles
> sync/async requests separately, and it solves this
> buffered-write-starves-read problem. I would like to post it soon
> after doing some more test.
> 
> > On top of that can you please give some details how increasing the
> > buffered queue length reduces the impact of writers?
> 
> When the number of in-flight IOs exceeds io_limit, processes which are
> going to issue IOs are made sleep by dm-ioband until all the in-flight
> IOs are finished. But IO scheduler layer can accept IO requests more
> than the value of io_limit, so it was a bottleneck of the throughput.
> 

Ok, so it should have been throughput bottleneck but how did it solve the
issue of writer starving the reader as you had mentioned in the mail.

Secondly, you mentioned that processes are made to sleep once we cross 
io_limit. This sounds like request descriptor facility on requeust queue
where processes are made to sleep.

There are threads in kernel which don't want to sleep while submitting
bios. For example, btrfs has bio submitting thread which does not want
to sleep hence it checks with device if it is congested or not and not
submit the bio if it is congested.  How would you handle such cases. Have
you implemented any per group congestion kind of interface to make sure
such IO's don't sleep if group is congested.

Or this limit is per ioband device which every group on the device is
sharing. If yes, then how would you provide isolation between groups 
because if one groups consumes io_limit tokens, then other will simply
be serialized on that device?

> > IO Prio issue
> > --------------
> > I ran another test where two ioband devices were created of weight 100 
> > each on two partitions. In first group 4 readers were launched. Three
> > readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> > group2, I launched a buffered writer.
> > 
> > One would expect that prio0 reader gets more bandwidth as compared to
> > prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> > that is not happening. Look how vanilla CFQ provides much more bandwidth
> > to prio0 reader as compared to prio7 reader and how putting them in the
> > group reduces the difference betweej prio0 and prio7 readers.
> > 
> > Following are the results.
> 
> O.K. I'll try to do more test with dm-ioband according to your
> comments especially working with CFQ. Thanks for pointing out.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-30  8:43         ` Ryo Tsuruta
@ 2009-09-30 11:05             ` Vivek Goyal
       [not found]           ` <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30 11:05 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah,
	lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > I was thinking that elevator layer will do the merge of bios. So IO
> > scheduler/elevator can time stamp the first bio in the request as it goes
> > into the disk and again timestamp with finish time once request finishes.
> > 
> > This way higher layer can get an idea how much disk time a group of bios
> > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > then time accounting becomes an issue.
> > 
> > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > time elapsed between each of milestones is t. Also assume that all these
> > requests are from same queue/group.
> > 
> >         t0   t1   t2   t3  t4   t5   t6   t7
> >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > 
> > Now higher layer will think that time consumed by group is:
> > 
> > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > 
> > But the time elapsed is only 7t.
> 
> IO controller can know how many requests are issued and still in
> progress. Is it not enough to accumulate the time while in-flight IOs
> exist?
> 

That time would not reflect disk time used. It will be follwoing.

(time spent waiting in CFQ queues) + (time spent in dispatch queue) +
(time spent in disk)

> > Secondly if a different group is running only single sequential reader,
> > there CFQ will be driving queue depth of 1 and time will not be running
> > faster and this inaccuracy in accounting will lead to unfair share between
> > groups.
> >
> > So we need something better to get a sense which group used how much of
> > disk time.
> 
> It could be solved by implementing the way to pass on such information
> from IO scheduler to higher layer controller.
> 

How would you do that? Can you give some details exactly how and what
information IO scheduler will pass to higher level IO controller so that IO
controller can attribute right time to the group.

> > > How about making throttling policy be user selectable like the IO
> > > scheduler and putting it in the higher layer? So we could support
> > > all of policies (time-based, size-based and rate limiting). There
> > > seems not to only one solution which satisfies all users. But I agree
> > > with starting with proportional bandwidth control first. 
> > > 
> > 
> > What are the cases where time based policy does not work and size based
> > policy works better and user would choose size based policy and not timed
> > based one?
> 
> I think that disk time is not simply proportional to IO size. If there
> are two groups whose wights are equally assigned and they issue
> different sized IOs repsectively, the bandwidth of each group would
> not distributed equally as expected. 
> 

If we are providing fairness in terms of time, it is fair. If we provide
equal time slots to two processes and if one got more IO done because it
was not wasting time seeking or it issued bigger size IO, it deserves that
higher BW. IO controller will make sure that process gets fair share in
terms of time and exactly how much BW one got will depend on the workload.

That's the precise reason that fairness in terms of time is better on
seeky media.

> > I am not against implementing things in higher layer as long as we can
> > ensure tight control on latencies, strong isolation between groups and
> > not break CFQ's class and ioprio model with-in group.
> > 
> > > BTW, I will start to reimplement dm-ioband into block layer.
> > 
> > Can you elaborate little bit on this?
> 
> bio is grabbed in generic_make_request() and throttled as well as
> dm-ioband's mechanism. dmsetup command is not necessary any longer.
> 

Ok, so one would not need dm-ioband device now, but same dm-ioband
throttling policies will apply. So until and unless we figure out a
better way, the issues I have pointed out will still exists even in
new implementation.

> > > > Fairness for higher level logical devices
> > > > =========================================
> > > > Do we want good fairness numbers for higher level logical devices also
> > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > > > at leaf nodes can help us use the resources optimally and in the process
> > > > we can get fairness at higher level also in many of the cases.
> > > 
> > > We should also take care of block devices which provide their own
> > > make_request_fn() and not use a IO scheduler. We can't use the leaf
> > > nodes approach to such devices.
> > > 
> > 
> > I am not sure how big an issue is this. This can be easily solved by
> > making use of NOOP scheduler by these devices. What are the reasons for
> > these devices to not use even noop? 
> 
> I'm not sure why the developers of the device driver choose their own
> way, and the driver is provided in binary form, so we can't modify it.
> 
> > > > Fairness with-in group
> > > > ======================
> > > > One of the issues with higher level controller is that how to do fair
> > > > throttling so that fairness with-in group is not impacted. Especially
> > > > the case of making sure that we don't break the notion of ioprio of the
> > > > processes with-in group.
> > > 
> > > I ran your test script to confirm that the notion of ioprio was not
> > > broken by dm-ioband. Here is the results of the test.
> > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> > > 
> > > I think that the time period during which dm-ioband holds IO requests
> > > for throttling would be too short to break the notion of ioprio.
> > 
> > Ok, I re-ran that test. Previously default io_limit value was 192 and now
> 
> The default value of io_limit on the previous test was 128 (not 192)
> which is equall to the default value of nr_request.

Hm..., I used following commands to create two ioband devices.

echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
"weight 0 :100" | dmsetup create ioband1
echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
"weight 0 :100" | dmsetup create ioband2

Here io_limit value is zero so it should pick default value. Following is
output of "dmsetup table" command.

ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
                                    ^^^^
IIUC, above number 192 is reflecting io_limit? If yes, then default seems
to be 192?

> 
> > I set it up to 256 as you suggested. I still see writer starving reader. I
> > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > writes.
> 
> O.K. You removed "conv=fdatasync", the new dm-ioband handles
> sync/async requests separately, and it solves this
> buffered-write-starves-read problem. I would like to post it soon
> after doing some more test.
> 
> > On top of that can you please give some details how increasing the
> > buffered queue length reduces the impact of writers?
> 
> When the number of in-flight IOs exceeds io_limit, processes which are
> going to issue IOs are made sleep by dm-ioband until all the in-flight
> IOs are finished. But IO scheduler layer can accept IO requests more
> than the value of io_limit, so it was a bottleneck of the throughput.
> 

Ok, so it should have been throughput bottleneck but how did it solve the
issue of writer starving the reader as you had mentioned in the mail.

Secondly, you mentioned that processes are made to sleep once we cross 
io_limit. This sounds like request descriptor facility on requeust queue
where processes are made to sleep.

There are threads in kernel which don't want to sleep while submitting
bios. For example, btrfs has bio submitting thread which does not want
to sleep hence it checks with device if it is congested or not and not
submit the bio if it is congested.  How would you handle such cases. Have
you implemented any per group congestion kind of interface to make sure
such IO's don't sleep if group is congested.

Or this limit is per ioband device which every group on the device is
sharing. If yes, then how would you provide isolation between groups 
because if one groups consumes io_limit tokens, then other will simply
be serialized on that device?

> > IO Prio issue
> > --------------
> > I ran another test where two ioband devices were created of weight 100 
> > each on two partitions. In first group 4 readers were launched. Three
> > readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> > group2, I launched a buffered writer.
> > 
> > One would expect that prio0 reader gets more bandwidth as compared to
> > prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> > that is not happening. Look how vanilla CFQ provides much more bandwidth
> > to prio0 reader as compared to prio7 reader and how putting them in the
> > group reduces the difference betweej prio0 and prio7 readers.
> > 
> > Following are the results.
> 
> O.K. I'll try to do more test with dm-ioband according to your
> comments especially working with CFQ. Thanks for pointing out.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-30 11:05             ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30 11:05 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea, torvalds

On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > I was thinking that elevator layer will do the merge of bios. So IO
> > scheduler/elevator can time stamp the first bio in the request as it goes
> > into the disk and again timestamp with finish time once request finishes.
> > 
> > This way higher layer can get an idea how much disk time a group of bios
> > used. But on multi queue, if we dispatch say 4 requests from same queue,
> > then time accounting becomes an issue.
> > 
> > Consider following where four requests rq1, rq2, rq3 and rq4 are
> > dispatched to disk at time t0, t1, t2 and t3 respectively and these
> > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> > time elapsed between each of milestones is t. Also assume that all these
> > requests are from same queue/group.
> > 
> >         t0   t1   t2   t3  t4   t5   t6   t7
> >         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> > 
> > Now higher layer will think that time consumed by group is:
> > 
> > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> > 
> > But the time elapsed is only 7t.
> 
> IO controller can know how many requests are issued and still in
> progress. Is it not enough to accumulate the time while in-flight IOs
> exist?
> 

That time would not reflect disk time used. It will be follwoing.

(time spent waiting in CFQ queues) + (time spent in dispatch queue) +
(time spent in disk)

> > Secondly if a different group is running only single sequential reader,
> > there CFQ will be driving queue depth of 1 and time will not be running
> > faster and this inaccuracy in accounting will lead to unfair share between
> > groups.
> >
> > So we need something better to get a sense which group used how much of
> > disk time.
> 
> It could be solved by implementing the way to pass on such information
> from IO scheduler to higher layer controller.
> 

How would you do that? Can you give some details exactly how and what
information IO scheduler will pass to higher level IO controller so that IO
controller can attribute right time to the group.

> > > How about making throttling policy be user selectable like the IO
> > > scheduler and putting it in the higher layer? So we could support
> > > all of policies (time-based, size-based and rate limiting). There
> > > seems not to only one solution which satisfies all users. But I agree
> > > with starting with proportional bandwidth control first. 
> > > 
> > 
> > What are the cases where time based policy does not work and size based
> > policy works better and user would choose size based policy and not timed
> > based one?
> 
> I think that disk time is not simply proportional to IO size. If there
> are two groups whose wights are equally assigned and they issue
> different sized IOs repsectively, the bandwidth of each group would
> not distributed equally as expected. 
> 

If we are providing fairness in terms of time, it is fair. If we provide
equal time slots to two processes and if one got more IO done because it
was not wasting time seeking or it issued bigger size IO, it deserves that
higher BW. IO controller will make sure that process gets fair share in
terms of time and exactly how much BW one got will depend on the workload.

That's the precise reason that fairness in terms of time is better on
seeky media.

> > I am not against implementing things in higher layer as long as we can
> > ensure tight control on latencies, strong isolation between groups and
> > not break CFQ's class and ioprio model with-in group.
> > 
> > > BTW, I will start to reimplement dm-ioband into block layer.
> > 
> > Can you elaborate little bit on this?
> 
> bio is grabbed in generic_make_request() and throttled as well as
> dm-ioband's mechanism. dmsetup command is not necessary any longer.
> 

Ok, so one would not need dm-ioband device now, but same dm-ioband
throttling policies will apply. So until and unless we figure out a
better way, the issues I have pointed out will still exists even in
new implementation.

> > > > Fairness for higher level logical devices
> > > > =========================================
> > > > Do we want good fairness numbers for higher level logical devices also
> > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > > > at leaf nodes can help us use the resources optimally and in the process
> > > > we can get fairness at higher level also in many of the cases.
> > > 
> > > We should also take care of block devices which provide their own
> > > make_request_fn() and not use a IO scheduler. We can't use the leaf
> > > nodes approach to such devices.
> > > 
> > 
> > I am not sure how big an issue is this. This can be easily solved by
> > making use of NOOP scheduler by these devices. What are the reasons for
> > these devices to not use even noop? 
> 
> I'm not sure why the developers of the device driver choose their own
> way, and the driver is provided in binary form, so we can't modify it.
> 
> > > > Fairness with-in group
> > > > ======================
> > > > One of the issues with higher level controller is that how to do fair
> > > > throttling so that fairness with-in group is not impacted. Especially
> > > > the case of making sure that we don't break the notion of ioprio of the
> > > > processes with-in group.
> > > 
> > > I ran your test script to confirm that the notion of ioprio was not
> > > broken by dm-ioband. Here is the results of the test.
> > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> > > 
> > > I think that the time period during which dm-ioband holds IO requests
> > > for throttling would be too short to break the notion of ioprio.
> > 
> > Ok, I re-ran that test. Previously default io_limit value was 192 and now
> 
> The default value of io_limit on the previous test was 128 (not 192)
> which is equall to the default value of nr_request.

Hm..., I used following commands to create two ioband devices.

echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none"
"weight 0 :100" | dmsetup create ioband1
echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none"
"weight 0 :100" | dmsetup create ioband2

Here io_limit value is zero so it should pick default value. Following is
output of "dmsetup table" command.

ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100
ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100
                                    ^^^^
IIUC, above number 192 is reflecting io_limit? If yes, then default seems
to be 192?

> 
> > I set it up to 256 as you suggested. I still see writer starving reader. I
> > have removed "conv=fdatasync" from writer so that a writer is pure buffered
> > writes.
> 
> O.K. You removed "conv=fdatasync", the new dm-ioband handles
> sync/async requests separately, and it solves this
> buffered-write-starves-read problem. I would like to post it soon
> after doing some more test.
> 
> > On top of that can you please give some details how increasing the
> > buffered queue length reduces the impact of writers?
> 
> When the number of in-flight IOs exceeds io_limit, processes which are
> going to issue IOs are made sleep by dm-ioband until all the in-flight
> IOs are finished. But IO scheduler layer can accept IO requests more
> than the value of io_limit, so it was a bottleneck of the throughput.
> 

Ok, so it should have been throughput bottleneck but how did it solve the
issue of writer starving the reader as you had mentioned in the mail.

Secondly, you mentioned that processes are made to sleep once we cross 
io_limit. This sounds like request descriptor facility on requeust queue
where processes are made to sleep.

There are threads in kernel which don't want to sleep while submitting
bios. For example, btrfs has bio submitting thread which does not want
to sleep hence it checks with device if it is congested or not and not
submit the bio if it is congested.  How would you handle such cases. Have
you implemented any per group congestion kind of interface to make sure
such IO's don't sleep if group is congested.

Or this limit is per ioband device which every group on the device is
sharing. If yes, then how would you provide isolation between groups 
because if one groups consumes io_limit tokens, then other will simply
be serialized on that device?

> > IO Prio issue
> > --------------
> > I ran another test where two ioband devices were created of weight 100 
> > each on two partitions. In first group 4 readers were launched. Three
> > readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> > group2, I launched a buffered writer.
> > 
> > One would expect that prio0 reader gets more bandwidth as compared to
> > prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> > that is not happening. Look how vanilla CFQ provides much more bandwidth
> > to prio0 reader as compared to prio7 reader and how putting them in the
> > group reduces the difference betweej prio0 and prio7 readers.
> > 
> > Following are the results.
> 
> O.K. I'll try to do more test with dm-ioband according to your
> comments especially working with CFQ. Thanks for pointing out.
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]         ` <20090929141049.GA12141-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-29 19:53           ` Nauman Rafique
@ 2009-09-30  8:43           ` Ryo Tsuruta
  1 sibling, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-30  8:43 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> I was thinking that elevator layer will do the merge of bios. So IO
> scheduler/elevator can time stamp the first bio in the request as it goes
> into the disk and again timestamp with finish time once request finishes.
> 
> This way higher layer can get an idea how much disk time a group of bios
> used. But on multi queue, if we dispatch say 4 requests from same queue,
> then time accounting becomes an issue.
> 
> Consider following where four requests rq1, rq2, rq3 and rq4 are
> dispatched to disk at time t0, t1, t2 and t3 respectively and these
> requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> time elapsed between each of milestones is t. Also assume that all these
> requests are from same queue/group.
> 
>         t0   t1   t2   t3  t4   t5   t6   t7
>         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> 
> Now higher layer will think that time consumed by group is:
> 
> (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> 
> But the time elapsed is only 7t.

IO controller can know how many requests are issued and still in
progress. Is it not enough to accumulate the time while in-flight IOs
exist?

> Secondly if a different group is running only single sequential reader,
> there CFQ will be driving queue depth of 1 and time will not be running
> faster and this inaccuracy in accounting will lead to unfair share between
> groups.
>
> So we need something better to get a sense which group used how much of
> disk time.

It could be solved by implementing the way to pass on such information
from IO scheduler to higher layer controller.

> > How about making throttling policy be user selectable like the IO
> > scheduler and putting it in the higher layer? So we could support
> > all of policies (time-based, size-based and rate limiting). There
> > seems not to only one solution which satisfies all users. But I agree
> > with starting with proportional bandwidth control first. 
> > 
> 
> What are the cases where time based policy does not work and size based
> policy works better and user would choose size based policy and not timed
> based one?

I think that disk time is not simply proportional to IO size. If there
are two groups whose wights are equally assigned and they issue
different sized IOs repsectively, the bandwidth of each group would
not distributed equally as expected. 

> I am not against implementing things in higher layer as long as we can
> ensure tight control on latencies, strong isolation between groups and
> not break CFQ's class and ioprio model with-in group.
> 
> > BTW, I will start to reimplement dm-ioband into block layer.
> 
> Can you elaborate little bit on this?

bio is grabbed in generic_make_request() and throttled as well as
dm-ioband's mechanism. dmsetup command is not necessary any longer.

> > > Fairness for higher level logical devices
> > > =========================================
> > > Do we want good fairness numbers for higher level logical devices also
> > > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > > at leaf nodes can help us use the resources optimally and in the process
> > > we can get fairness at higher level also in many of the cases.
> > 
> > We should also take care of block devices which provide their own
> > make_request_fn() and not use a IO scheduler. We can't use the leaf
> > nodes approach to such devices.
> > 
> 
> I am not sure how big an issue is this. This can be easily solved by
> making use of NOOP scheduler by these devices. What are the reasons for
> these devices to not use even noop? 

I'm not sure why the developers of the device driver choose their own
way, and the driver is provided in binary form, so we can't modify it.

> > > Fairness with-in group
> > > ======================
> > > One of the issues with higher level controller is that how to do fair
> > > throttling so that fairness with-in group is not impacted. Especially
> > > the case of making sure that we don't break the notion of ioprio of the
> > > processes with-in group.
> > 
> > I ran your test script to confirm that the notion of ioprio was not
> > broken by dm-ioband. Here is the results of the test.
> > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> > 
> > I think that the time period during which dm-ioband holds IO requests
> > for throttling would be too short to break the notion of ioprio.
> 
> Ok, I re-ran that test. Previously default io_limit value was 192 and now

The default value of io_limit on the previous test was 128 (not 192)
which is equall to the default value of nr_request.

> I set it up to 256 as you suggested. I still see writer starving reader. I
> have removed "conv=fdatasync" from writer so that a writer is pure buffered
> writes.

O.K. You removed "conv=fdatasync", the new dm-ioband handles
sync/async requests separately, and it solves this
buffered-write-starves-read problem. I would like to post it soon
after doing some more test.

> On top of that can you please give some details how increasing the
> buffered queue length reduces the impact of writers?

When the number of in-flight IOs exceeds io_limit, processes which are
going to issue IOs are made sleep by dm-ioband until all the in-flight
IOs are finished. But IO scheduler layer can accept IO requests more
than the value of io_limit, so it was a bottleneck of the throughput.

> IO Prio issue
> --------------
> I ran another test where two ioband devices were created of weight 100 
> each on two partitions. In first group 4 readers were launched. Three
> readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> group2, I launched a buffered writer.
> 
> One would expect that prio0 reader gets more bandwidth as compared to
> prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> that is not happening. Look how vanilla CFQ provides much more bandwidth
> to prio0 reader as compared to prio7 reader and how putting them in the
> group reduces the difference betweej prio0 and prio7 readers.
> 
> Following are the results.

O.K. I'll try to do more test with dm-ioband according to your
comments especially working with CFQ. Thanks for pointing out.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-29 14:10         ` Vivek Goyal
                           ` (2 preceding siblings ...)
  (?)
@ 2009-09-30  8:43         ` Ryo Tsuruta
  2009-09-30 11:05             ` Vivek Goyal
       [not found]           ` <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-30  8:43 UTC (permalink / raw)
  To: vgoyal
  Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah,
	lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> I was thinking that elevator layer will do the merge of bios. So IO
> scheduler/elevator can time stamp the first bio in the request as it goes
> into the disk and again timestamp with finish time once request finishes.
> 
> This way higher layer can get an idea how much disk time a group of bios
> used. But on multi queue, if we dispatch say 4 requests from same queue,
> then time accounting becomes an issue.
> 
> Consider following where four requests rq1, rq2, rq3 and rq4 are
> dispatched to disk at time t0, t1, t2 and t3 respectively and these
> requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> time elapsed between each of milestones is t. Also assume that all these
> requests are from same queue/group.
> 
>         t0   t1   t2   t3  t4   t5   t6   t7
>         rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
> 
> Now higher layer will think that time consumed by group is:
> 
> (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
> 
> But the time elapsed is only 7t.

IO controller can know how many requests are issued and still in
progress. Is it not enough to accumulate the time while in-flight IOs
exist?

> Secondly if a different group is running only single sequential reader,
> there CFQ will be driving queue depth of 1 and time will not be running
> faster and this inaccuracy in accounting will lead to unfair share between
> groups.
>
> So we need something better to get a sense which group used how much of
> disk time.

It could be solved by implementing the way to pass on such information
from IO scheduler to higher layer controller.

> > How about making throttling policy be user selectable like the IO
> > scheduler and putting it in the higher layer? So we could support
> > all of policies (time-based, size-based and rate limiting). There
> > seems not to only one solution which satisfies all users. But I agree
> > with starting with proportional bandwidth control first. 
> > 
> 
> What are the cases where time based policy does not work and size based
> policy works better and user would choose size based policy and not timed
> based one?

I think that disk time is not simply proportional to IO size. If there
are two groups whose wights are equally assigned and they issue
different sized IOs repsectively, the bandwidth of each group would
not distributed equally as expected. 

> I am not against implementing things in higher layer as long as we can
> ensure tight control on latencies, strong isolation between groups and
> not break CFQ's class and ioprio model with-in group.
> 
> > BTW, I will start to reimplement dm-ioband into block layer.
> 
> Can you elaborate little bit on this?

bio is grabbed in generic_make_request() and throttled as well as
dm-ioband's mechanism. dmsetup command is not necessary any longer.

> > > Fairness for higher level logical devices
> > > =========================================
> > > Do we want good fairness numbers for higher level logical devices also
> > > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > > at leaf nodes can help us use the resources optimally and in the process
> > > we can get fairness at higher level also in many of the cases.
> > 
> > We should also take care of block devices which provide their own
> > make_request_fn() and not use a IO scheduler. We can't use the leaf
> > nodes approach to such devices.
> > 
> 
> I am not sure how big an issue is this. This can be easily solved by
> making use of NOOP scheduler by these devices. What are the reasons for
> these devices to not use even noop? 

I'm not sure why the developers of the device driver choose their own
way, and the driver is provided in binary form, so we can't modify it.

> > > Fairness with-in group
> > > ======================
> > > One of the issues with higher level controller is that how to do fair
> > > throttling so that fairness with-in group is not impacted. Especially
> > > the case of making sure that we don't break the notion of ioprio of the
> > > processes with-in group.
> > 
> > I ran your test script to confirm that the notion of ioprio was not
> > broken by dm-ioband. Here is the results of the test.
> > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> > 
> > I think that the time period during which dm-ioband holds IO requests
> > for throttling would be too short to break the notion of ioprio.
> 
> Ok, I re-ran that test. Previously default io_limit value was 192 and now

The default value of io_limit on the previous test was 128 (not 192)
which is equall to the default value of nr_request.

> I set it up to 256 as you suggested. I still see writer starving reader. I
> have removed "conv=fdatasync" from writer so that a writer is pure buffered
> writes.

O.K. You removed "conv=fdatasync", the new dm-ioband handles
sync/async requests separately, and it solves this
buffered-write-starves-read problem. I would like to post it soon
after doing some more test.

> On top of that can you please give some details how increasing the
> buffered queue length reduces the impact of writers?

When the number of in-flight IOs exceeds io_limit, processes which are
going to issue IOs are made sleep by dm-ioband until all the in-flight
IOs are finished. But IO scheduler layer can accept IO requests more
than the value of io_limit, so it was a bottleneck of the throughput.

> IO Prio issue
> --------------
> I ran another test where two ioband devices were created of weight 100 
> each on two partitions. In first group 4 readers were launched. Three
> readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> group2, I launched a buffered writer.
> 
> One would expect that prio0 reader gets more bandwidth as compared to
> prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> that is not happening. Look how vanilla CFQ provides much more bandwidth
> to prio0 reader as compared to prio7 reader and how putting them in the
> group reduces the difference betweej prio0 and prio7 readers.
> 
> Following are the results.

O.K. I'll try to do more test with dm-ioband according to your
comments especially working with CFQ. Thanks for pointing out.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-29 10:49         ` Takuya Yoshikawa
  2009-09-29 14:10         ` Vivek Goyal
@ 2009-09-30  3:11         ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30  3:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and all,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> 
> > > We are starting from a point where there is no cgroup based IO
> > > scheduling in the kernel. And it is probably not reasonable to satisfy
> > > all IO scheduling related requirements in one patch set. We can start
> > > with something simple, and build on top of that. So a very simple
> > > patch set that enables cgroup based proportional scheduling for CFQ
> > > seems like the way to go at this point.
> > 
> > Sure, we can start with CFQ only. But a bigger question we need to answer
> > is that is CFQ the right place to solve the issue? Jens, do you think 
> > that CFQ is the right place to solve the problem?
> > 
> > Andrew seems to favor a high level approach so that IO schedulers are less
> > complex and we can provide fairness at high level logical devices also. 
> 
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.
> 
> > I will again try to summarize my understanding so far about the pros/cons
> > of each approach and then we can take the discussion forward.
> 
> Good summary. Thanks for your work.
> 
> > Fairness in terms of size of IO or disk time used
> > =================================================
> > On a seeky media, fairness in terms of disk time can get us better results
> > instead fairness interms of size of IO or number of IO.
> > 
> > If we implement some kind of time based solution at higher layer, then 
> > that higher layer should know who used how much of time each group used. We
> > can probably do some kind of timestamping in bio to get a sense when did it
> > get into disk and when did it finish. But on a multi queue hardware there
> > can be multiple requests in the disk either from same queue or from differnet
> > queues and with pure timestamping based apparoch, so far I could not think
> > how at high level we will get an idea who used how much of time.
> 
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?
>  
> > So this is the first point of contention that how do we want to provide
> > fairness. In terms of disk time used or in terms of size of IO/number of
> > IO.
> >
> > Max bandwidth Controller or Proportional bandwidth controller
> > =============================================================
> > What is our primary requirement here? A weight based proportional
> > bandwidth controller where we can use the resources optimally and any
> > kind of throttling kicks in only if there is contention for the disk.
> > 
> > Or we want max bandwidth control where a group is not allowed to use the
> > disk even if disk is free. 
> > 
> > Or we need both? I would think that at some point of time we will need
> > both but we can start with proportional bandwidth control first.
> 
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first. 
> 
> BTW, I will start to reimplement dm-ioband into block layer.
> 
> > Fairness for higher level logical devices
> > =========================================
> > Do we want good fairness numbers for higher level logical devices also
> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > at leaf nodes can help us use the resources optimally and in the process
> > we can get fairness at higher level also in many of the cases.
> 
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
> 
> > But do we want strict fairness numbers on higher level logical devices
> > even if it means sub-optimal usage of unerlying phsical devices?
> > 
> > I think that for proportinal bandwidth control, it should be ok to provide
> > fairness at higher level logical device but for max bandwidth control it
> > might make more sense to provide fairness at higher level. Consider a
> > case where from a striped device a customer wants to limit a group to
> > 30MB/s and in case of leaf node control, if every leaf node provides
> > 30MB/s, it might accumulate to much more than specified rate at logical
> > device.
> >
> > Latency Control and strong isolation between groups
> > ===================================================
> > Do we want a good isolation between groups and better latencies and
> > stronger isolation between groups?
> > 
> > I think if problem is solved at IO scheduler level, we can achieve better
> > latency control and hence stronger isolation between groups.
> > 
> > Higher level solutions should find it hard to provide same kind of latency
> > control and isolation between groups as IO scheduler based solution.
> 
> Why do you think that the higher level solution is hard to provide it? 
> I think that it is a matter of how to implement throttling policy.
> 
> > Fairness for buffered writes
> > ============================
> > Doing io control at any place below page cache has disadvantage that page
> > cache might not dispatch more writes from higher weight group hence higher
> > weight group might not see more IO done. Andrew says that we don't have
> > a solution to this problem in kernel and he would like to see it handled
> > properly.
> > 
> > Only way to solve this seems to be to slow down the writers before they
> > write into page cache. IO throttling patch handled it by slowing down 
> > writer if it crossed max specified rate. Other suggestions have come in
> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
> > al-together where some kind of per group write limit can be specified.
> > 
> > So if solution is implemented at IO scheduler layer or at device mapper
> > layer, both shall have to rely on another controller to be co-mounted
> > to handle buffered writes properly.
> >
> > Fairness with-in group
> > ======================
> > One of the issues with higher level controller is that how to do fair
> > throttling so that fairness with-in group is not impacted. Especially
> > the case of making sure that we don't break the notion of ioprio of the
> > processes with-in group.
> 
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> 
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.
> 

Hi Ryo,

I am doing some more tests to see how do we maintain notion of prio
with-in group.

I have created two ioband devies ioband1 and ioband2 of weight 100 each on
two disk partitions. On one partition/device (ioband1) a buffered writer is
doing writeout and on other partition I launch one prio0 reader and
increasing number of prio4 readers using fio and let it run for 30 
seconds and see how BW got distributed between prio0 and prio4 processes.

Note, here readers are doing direct IO.

I did this test with vanilla CFQ and with dm-ioband + cfq.

With vanilla CFQ
----------------
<---------prio4 readers --------------------------> <---prio0 reader--->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   12892KiB/s  12892KiB/s  12892KiB/s  409K usec   14705KiB/s  252K usec   
2   5667KiB/s   5637KiB/s   11302KiB/s  717K usec   17555KiB/s  339K usec   
4   4395KiB/s   4173KiB/s   17027KiB/s  933K usec   12437KiB/s  553K usec   
8   2652KiB/s   2391KiB/s   20268KiB/s  1410K usec  9482KiB/s   685K usec   
16  1653KiB/s   1413KiB/s   24035KiB/s  2418K usec  5860KiB/s   1027K usec  

Note, as we increase number of prio4 readers, prio0 processes aggregate
bandwidth goes down (nr=2 seems to be only exception) but it still
maintains more BW than prio4 process.

Also note that as we increase number of prio4 readers, their aggreagate
bandwidth goes up which is expected. 

With dm-ioband
--------------
<---------prio4 readers --------------------------> <---prio0 reader--->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   11242KiB/s  11242KiB/s  11242KiB/s  415K usec   3884KiB/s   244K usec   
2   8110KiB/s   6236KiB/s   14345KiB/s  304K usec   320KiB/s    125K usec   
4   6898KiB/s   622KiB/s    11059KiB/s  206K usec   503KiB/s    201K usec   
8   345KiB/s    47KiB/s     850KiB/s    342K usec   8350KiB/s   164K usec   
16  28KiB/s     28KiB/s     451KiB/s    688 msec    5092KiB/s   306K usec   

Looking at the output with dm-ioband, it seems to be all over the place.
Look at aggregate bandwidth of prio0 reader and how wildly it is swinging.
It first goes down and then suddenly jumps up way high.

Similiarly look at aggregate bandwidth of prio4 readers and the moment we
hit 8 readers, it suddenly tanks.

Look at prio4 reader and prio 7 reader BW with 16 prio4 processes running.
prio4 process gets 28Kb/s and prio 0 process gets 5MB/s.

Can you please look into it? It looks like we got serious issues w.r.t
to fairness and bandwidth distribution with-in group.

Thanks
Vivek


> > Especially io throttling patch was very bad in terms of prio with-in 
> > group where throttling treated everyone equally and difference between
> > process prio disappeared.
> >
> > Reads Vs Writes
> > ===============
> > A higher level control most likely will change the ratio in which reads
> > and writes are dispatched to disk with-in group. It used to be decided
> > by IO scheduler so far but with higher level groups doing throttling and
> > possibly buffering the bios and releasing them later, they will have to
> > come up with their own policy on in what proportion reads and writes
> > should be dispatched. In case of IO scheduler based control, all the
> > queuing takes place at IO scheduler and it still retains control of
> > in what ration reads and writes should be dispatched.
> 
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.
> 
> > Summary
> > =======
> > 
> > - An io scheduler based io controller can provide better latencies,
> >   stronger isolation between groups, time based fairness and will not
> >   interfere with io schedulers policies like class, ioprio and
> >   reader vs writer issues.
> > 
> >   But it can gunrantee fairness at higher logical level devices.
> >   Especially in case of max bw control, leaf node control does not sound
> >   to be the most appropriate thing.
> > 
> > - IO throttling provides max bw control in terms of absolute rate. It has
> >   the advantage that it can provide control at higher level logical device
> >   and also control buffered writes without need of additional controller
> >   co-mounted.
> > 
> >   But it does only max bw control and not proportion control so one might
> >   not be using resources optimally. It looses sense of task prio and class
> >   with-in group as any of the task can be throttled with-in group. Because
> >   throttling does not kick in till you hit the max bw limit, it should find
> >   it hard to provide same latencies as io scheduler based control.
> > 
> > - dm-ioband also has the advantage that it can provide fairness at higher
> >   level logical devices.
> > 
> >   But, fairness is provided only in terms of size of IO or number of IO.
> >   No time based fairness. It is very throughput oriented and does not 
> >   throttle high speed group if other group is running slow random reader.
> >   This results in bad latnecies for random reader group and weaker
> >   isolation between groups.
> 
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
> 
> >   Also it does not provide fairness if a group is not continuously
> >   backlogged. So if one is running 1-2 dd/sequential readers in the group,
> >   one does not get fairness until workload is increased to a point where
> >   group becomes continuously backlogged. This also results in poor
> >   latencies and limited fairness.
> 
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.
>  
> > At this point of time it does not look like a single IO controller all
> > the scenarios/requirements. This means few things to me.
> > 
> > - Drop some of the requirements and go with one implementation which meets
> >   those reduced set of requirements.
> >
> > - Have more than one IO controller implementation in kenrel. One for lower
> >   level control for better latencies, stronger isolation and optimal resource
> >   usage and other one for fairness at higher level logical devices and max
> >   bandwidth control. 
> > 
> >   And let user decide which one to use based on his/her needs. 
> > 
> > - Come up with more intelligent way of doing IO control where single
> >   controller covers all the cases.
> > 
> > At this point of time, I am more inclined towards option 2 of having more
> > than one implementation in kernel. :-) (Until and unless we can brainstrom
> > and come up with ideas to make option 3 happen).
> >
> > > It would be great if we discuss our plans on the mailing list, so we
> > > can get early feedback from everyone.
> >  
> > This is what comes to my mind so far. Please add to the list if I have missed
> > some points. Also correct me if I am wrong about the pros/cons of the
> > approaches.
> >
> > Thoughts/ideas/opinions are welcome...
> > 
> > Thanks
> > Vivek
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-29  9:56     ` Ryo Tsuruta
@ 2009-09-30  3:11         ` Vivek Goyal
  2009-09-29 14:10         ` Vivek Goyal
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30  3:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah,
	lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and all,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> 
> > > We are starting from a point where there is no cgroup based IO
> > > scheduling in the kernel. And it is probably not reasonable to satisfy
> > > all IO scheduling related requirements in one patch set. We can start
> > > with something simple, and build on top of that. So a very simple
> > > patch set that enables cgroup based proportional scheduling for CFQ
> > > seems like the way to go at this point.
> > 
> > Sure, we can start with CFQ only. But a bigger question we need to answer
> > is that is CFQ the right place to solve the issue? Jens, do you think 
> > that CFQ is the right place to solve the problem?
> > 
> > Andrew seems to favor a high level approach so that IO schedulers are less
> > complex and we can provide fairness at high level logical devices also. 
> 
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.
> 
> > I will again try to summarize my understanding so far about the pros/cons
> > of each approach and then we can take the discussion forward.
> 
> Good summary. Thanks for your work.
> 
> > Fairness in terms of size of IO or disk time used
> > =================================================
> > On a seeky media, fairness in terms of disk time can get us better results
> > instead fairness interms of size of IO or number of IO.
> > 
> > If we implement some kind of time based solution at higher layer, then 
> > that higher layer should know who used how much of time each group used. We
> > can probably do some kind of timestamping in bio to get a sense when did it
> > get into disk and when did it finish. But on a multi queue hardware there
> > can be multiple requests in the disk either from same queue or from differnet
> > queues and with pure timestamping based apparoch, so far I could not think
> > how at high level we will get an idea who used how much of time.
> 
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?
>  
> > So this is the first point of contention that how do we want to provide
> > fairness. In terms of disk time used or in terms of size of IO/number of
> > IO.
> >
> > Max bandwidth Controller or Proportional bandwidth controller
> > =============================================================
> > What is our primary requirement here? A weight based proportional
> > bandwidth controller where we can use the resources optimally and any
> > kind of throttling kicks in only if there is contention for the disk.
> > 
> > Or we want max bandwidth control where a group is not allowed to use the
> > disk even if disk is free. 
> > 
> > Or we need both? I would think that at some point of time we will need
> > both but we can start with proportional bandwidth control first.
> 
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first. 
> 
> BTW, I will start to reimplement dm-ioband into block layer.
> 
> > Fairness for higher level logical devices
> > =========================================
> > Do we want good fairness numbers for higher level logical devices also
> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > at leaf nodes can help us use the resources optimally and in the process
> > we can get fairness at higher level also in many of the cases.
> 
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
> 
> > But do we want strict fairness numbers on higher level logical devices
> > even if it means sub-optimal usage of unerlying phsical devices?
> > 
> > I think that for proportinal bandwidth control, it should be ok to provide
> > fairness at higher level logical device but for max bandwidth control it
> > might make more sense to provide fairness at higher level. Consider a
> > case where from a striped device a customer wants to limit a group to
> > 30MB/s and in case of leaf node control, if every leaf node provides
> > 30MB/s, it might accumulate to much more than specified rate at logical
> > device.
> >
> > Latency Control and strong isolation between groups
> > ===================================================
> > Do we want a good isolation between groups and better latencies and
> > stronger isolation between groups?
> > 
> > I think if problem is solved at IO scheduler level, we can achieve better
> > latency control and hence stronger isolation between groups.
> > 
> > Higher level solutions should find it hard to provide same kind of latency
> > control and isolation between groups as IO scheduler based solution.
> 
> Why do you think that the higher level solution is hard to provide it? 
> I think that it is a matter of how to implement throttling policy.
> 
> > Fairness for buffered writes
> > ============================
> > Doing io control at any place below page cache has disadvantage that page
> > cache might not dispatch more writes from higher weight group hence higher
> > weight group might not see more IO done. Andrew says that we don't have
> > a solution to this problem in kernel and he would like to see it handled
> > properly.
> > 
> > Only way to solve this seems to be to slow down the writers before they
> > write into page cache. IO throttling patch handled it by slowing down 
> > writer if it crossed max specified rate. Other suggestions have come in
> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
> > al-together where some kind of per group write limit can be specified.
> > 
> > So if solution is implemented at IO scheduler layer or at device mapper
> > layer, both shall have to rely on another controller to be co-mounted
> > to handle buffered writes properly.
> >
> > Fairness with-in group
> > ======================
> > One of the issues with higher level controller is that how to do fair
> > throttling so that fairness with-in group is not impacted. Especially
> > the case of making sure that we don't break the notion of ioprio of the
> > processes with-in group.
> 
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> 
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.
> 

Hi Ryo,

I am doing some more tests to see how do we maintain notion of prio
with-in group.

I have created two ioband devies ioband1 and ioband2 of weight 100 each on
two disk partitions. On one partition/device (ioband1) a buffered writer is
doing writeout and on other partition I launch one prio0 reader and
increasing number of prio4 readers using fio and let it run for 30 
seconds and see how BW got distributed between prio0 and prio4 processes.

Note, here readers are doing direct IO.

I did this test with vanilla CFQ and with dm-ioband + cfq.

With vanilla CFQ
----------------
<---------prio4 readers --------------------------> <---prio0 reader--->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   12892KiB/s  12892KiB/s  12892KiB/s  409K usec   14705KiB/s  252K usec   
2   5667KiB/s   5637KiB/s   11302KiB/s  717K usec   17555KiB/s  339K usec   
4   4395KiB/s   4173KiB/s   17027KiB/s  933K usec   12437KiB/s  553K usec   
8   2652KiB/s   2391KiB/s   20268KiB/s  1410K usec  9482KiB/s   685K usec   
16  1653KiB/s   1413KiB/s   24035KiB/s  2418K usec  5860KiB/s   1027K usec  

Note, as we increase number of prio4 readers, prio0 processes aggregate
bandwidth goes down (nr=2 seems to be only exception) but it still
maintains more BW than prio4 process.

Also note that as we increase number of prio4 readers, their aggreagate
bandwidth goes up which is expected. 

With dm-ioband
--------------
<---------prio4 readers --------------------------> <---prio0 reader--->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   11242KiB/s  11242KiB/s  11242KiB/s  415K usec   3884KiB/s   244K usec   
2   8110KiB/s   6236KiB/s   14345KiB/s  304K usec   320KiB/s    125K usec   
4   6898KiB/s   622KiB/s    11059KiB/s  206K usec   503KiB/s    201K usec   
8   345KiB/s    47KiB/s     850KiB/s    342K usec   8350KiB/s   164K usec   
16  28KiB/s     28KiB/s     451KiB/s    688 msec    5092KiB/s   306K usec   

Looking at the output with dm-ioband, it seems to be all over the place.
Look at aggregate bandwidth of prio0 reader and how wildly it is swinging.
It first goes down and then suddenly jumps up way high.

Similiarly look at aggregate bandwidth of prio4 readers and the moment we
hit 8 readers, it suddenly tanks.

Look at prio4 reader and prio 7 reader BW with 16 prio4 processes running.
prio4 process gets 28Kb/s and prio 0 process gets 5MB/s.

Can you please look into it? It looks like we got serious issues w.r.t
to fairness and bandwidth distribution with-in group.

Thanks
Vivek


> > Especially io throttling patch was very bad in terms of prio with-in 
> > group where throttling treated everyone equally and difference between
> > process prio disappeared.
> >
> > Reads Vs Writes
> > ===============
> > A higher level control most likely will change the ratio in which reads
> > and writes are dispatched to disk with-in group. It used to be decided
> > by IO scheduler so far but with higher level groups doing throttling and
> > possibly buffering the bios and releasing them later, they will have to
> > come up with their own policy on in what proportion reads and writes
> > should be dispatched. In case of IO scheduler based control, all the
> > queuing takes place at IO scheduler and it still retains control of
> > in what ration reads and writes should be dispatched.
> 
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.
> 
> > Summary
> > =======
> > 
> > - An io scheduler based io controller can provide better latencies,
> >   stronger isolation between groups, time based fairness and will not
> >   interfere with io schedulers policies like class, ioprio and
> >   reader vs writer issues.
> > 
> >   But it can gunrantee fairness at higher logical level devices.
> >   Especially in case of max bw control, leaf node control does not sound
> >   to be the most appropriate thing.
> > 
> > - IO throttling provides max bw control in terms of absolute rate. It has
> >   the advantage that it can provide control at higher level logical device
> >   and also control buffered writes without need of additional controller
> >   co-mounted.
> > 
> >   But it does only max bw control and not proportion control so one might
> >   not be using resources optimally. It looses sense of task prio and class
> >   with-in group as any of the task can be throttled with-in group. Because
> >   throttling does not kick in till you hit the max bw limit, it should find
> >   it hard to provide same latencies as io scheduler based control.
> > 
> > - dm-ioband also has the advantage that it can provide fairness at higher
> >   level logical devices.
> > 
> >   But, fairness is provided only in terms of size of IO or number of IO.
> >   No time based fairness. It is very throughput oriented and does not 
> >   throttle high speed group if other group is running slow random reader.
> >   This results in bad latnecies for random reader group and weaker
> >   isolation between groups.
> 
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
> 
> >   Also it does not provide fairness if a group is not continuously
> >   backlogged. So if one is running 1-2 dd/sequential readers in the group,
> >   one does not get fairness until workload is increased to a point where
> >   group becomes continuously backlogged. This also results in poor
> >   latencies and limited fairness.
> 
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.
>  
> > At this point of time it does not look like a single IO controller all
> > the scenarios/requirements. This means few things to me.
> > 
> > - Drop some of the requirements and go with one implementation which meets
> >   those reduced set of requirements.
> >
> > - Have more than one IO controller implementation in kenrel. One for lower
> >   level control for better latencies, stronger isolation and optimal resource
> >   usage and other one for fairness at higher level logical devices and max
> >   bandwidth control. 
> > 
> >   And let user decide which one to use based on his/her needs. 
> > 
> > - Come up with more intelligent way of doing IO control where single
> >   controller covers all the cases.
> > 
> > At this point of time, I am more inclined towards option 2 of having more
> > than one implementation in kernel. :-) (Until and unless we can brainstrom
> > and come up with ideas to make option 3 happen).
> >
> > > It would be great if we discuss our plans on the mailing list, so we
> > > can get early feedback from everyone.
> >  
> > This is what comes to my mind so far. Please add to the list if I have missed
> > some points. Also correct me if I am wrong about the pros/cons of the
> > approaches.
> >
> > Thoughts/ideas/opinions are welcome...
> > 
> > Thanks
> > Vivek
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-30  3:11         ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-30  3:11 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea, torvalds

On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and all,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> 
> > > We are starting from a point where there is no cgroup based IO
> > > scheduling in the kernel. And it is probably not reasonable to satisfy
> > > all IO scheduling related requirements in one patch set. We can start
> > > with something simple, and build on top of that. So a very simple
> > > patch set that enables cgroup based proportional scheduling for CFQ
> > > seems like the way to go at this point.
> > 
> > Sure, we can start with CFQ only. But a bigger question we need to answer
> > is that is CFQ the right place to solve the issue? Jens, do you think 
> > that CFQ is the right place to solve the problem?
> > 
> > Andrew seems to favor a high level approach so that IO schedulers are less
> > complex and we can provide fairness at high level logical devices also. 
> 
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.
> 
> > I will again try to summarize my understanding so far about the pros/cons
> > of each approach and then we can take the discussion forward.
> 
> Good summary. Thanks for your work.
> 
> > Fairness in terms of size of IO or disk time used
> > =================================================
> > On a seeky media, fairness in terms of disk time can get us better results
> > instead fairness interms of size of IO or number of IO.
> > 
> > If we implement some kind of time based solution at higher layer, then 
> > that higher layer should know who used how much of time each group used. We
> > can probably do some kind of timestamping in bio to get a sense when did it
> > get into disk and when did it finish. But on a multi queue hardware there
> > can be multiple requests in the disk either from same queue or from differnet
> > queues and with pure timestamping based apparoch, so far I could not think
> > how at high level we will get an idea who used how much of time.
> 
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?
>  
> > So this is the first point of contention that how do we want to provide
> > fairness. In terms of disk time used or in terms of size of IO/number of
> > IO.
> >
> > Max bandwidth Controller or Proportional bandwidth controller
> > =============================================================
> > What is our primary requirement here? A weight based proportional
> > bandwidth controller where we can use the resources optimally and any
> > kind of throttling kicks in only if there is contention for the disk.
> > 
> > Or we want max bandwidth control where a group is not allowed to use the
> > disk even if disk is free. 
> > 
> > Or we need both? I would think that at some point of time we will need
> > both but we can start with proportional bandwidth control first.
> 
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first. 
> 
> BTW, I will start to reimplement dm-ioband into block layer.
> 
> > Fairness for higher level logical devices
> > =========================================
> > Do we want good fairness numbers for higher level logical devices also
> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > at leaf nodes can help us use the resources optimally and in the process
> > we can get fairness at higher level also in many of the cases.
> 
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
> 
> > But do we want strict fairness numbers on higher level logical devices
> > even if it means sub-optimal usage of unerlying phsical devices?
> > 
> > I think that for proportinal bandwidth control, it should be ok to provide
> > fairness at higher level logical device but for max bandwidth control it
> > might make more sense to provide fairness at higher level. Consider a
> > case where from a striped device a customer wants to limit a group to
> > 30MB/s and in case of leaf node control, if every leaf node provides
> > 30MB/s, it might accumulate to much more than specified rate at logical
> > device.
> >
> > Latency Control and strong isolation between groups
> > ===================================================
> > Do we want a good isolation between groups and better latencies and
> > stronger isolation between groups?
> > 
> > I think if problem is solved at IO scheduler level, we can achieve better
> > latency control and hence stronger isolation between groups.
> > 
> > Higher level solutions should find it hard to provide same kind of latency
> > control and isolation between groups as IO scheduler based solution.
> 
> Why do you think that the higher level solution is hard to provide it? 
> I think that it is a matter of how to implement throttling policy.
> 
> > Fairness for buffered writes
> > ============================
> > Doing io control at any place below page cache has disadvantage that page
> > cache might not dispatch more writes from higher weight group hence higher
> > weight group might not see more IO done. Andrew says that we don't have
> > a solution to this problem in kernel and he would like to see it handled
> > properly.
> > 
> > Only way to solve this seems to be to slow down the writers before they
> > write into page cache. IO throttling patch handled it by slowing down 
> > writer if it crossed max specified rate. Other suggestions have come in
> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
> > al-together where some kind of per group write limit can be specified.
> > 
> > So if solution is implemented at IO scheduler layer or at device mapper
> > layer, both shall have to rely on another controller to be co-mounted
> > to handle buffered writes properly.
> >
> > Fairness with-in group
> > ======================
> > One of the issues with higher level controller is that how to do fair
> > throttling so that fairness with-in group is not impacted. Especially
> > the case of making sure that we don't break the notion of ioprio of the
> > processes with-in group.
> 
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> 
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.
> 

Hi Ryo,

I am doing some more tests to see how do we maintain notion of prio
with-in group.

I have created two ioband devies ioband1 and ioband2 of weight 100 each on
two disk partitions. On one partition/device (ioband1) a buffered writer is
doing writeout and on other partition I launch one prio0 reader and
increasing number of prio4 readers using fio and let it run for 30 
seconds and see how BW got distributed between prio0 and prio4 processes.

Note, here readers are doing direct IO.

I did this test with vanilla CFQ and with dm-ioband + cfq.

With vanilla CFQ
----------------
<---------prio4 readers --------------------------> <---prio0 reader--->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   12892KiB/s  12892KiB/s  12892KiB/s  409K usec   14705KiB/s  252K usec   
2   5667KiB/s   5637KiB/s   11302KiB/s  717K usec   17555KiB/s  339K usec   
4   4395KiB/s   4173KiB/s   17027KiB/s  933K usec   12437KiB/s  553K usec   
8   2652KiB/s   2391KiB/s   20268KiB/s  1410K usec  9482KiB/s   685K usec   
16  1653KiB/s   1413KiB/s   24035KiB/s  2418K usec  5860KiB/s   1027K usec  

Note, as we increase number of prio4 readers, prio0 processes aggregate
bandwidth goes down (nr=2 seems to be only exception) but it still
maintains more BW than prio4 process.

Also note that as we increase number of prio4 readers, their aggreagate
bandwidth goes up which is expected. 

With dm-ioband
--------------
<---------prio4 readers --------------------------> <---prio0 reader--->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   11242KiB/s  11242KiB/s  11242KiB/s  415K usec   3884KiB/s   244K usec   
2   8110KiB/s   6236KiB/s   14345KiB/s  304K usec   320KiB/s    125K usec   
4   6898KiB/s   622KiB/s    11059KiB/s  206K usec   503KiB/s    201K usec   
8   345KiB/s    47KiB/s     850KiB/s    342K usec   8350KiB/s   164K usec   
16  28KiB/s     28KiB/s     451KiB/s    688 msec    5092KiB/s   306K usec   

Looking at the output with dm-ioband, it seems to be all over the place.
Look at aggregate bandwidth of prio0 reader and how wildly it is swinging.
It first goes down and then suddenly jumps up way high.

Similiarly look at aggregate bandwidth of prio4 readers and the moment we
hit 8 readers, it suddenly tanks.

Look at prio4 reader and prio 7 reader BW with 16 prio4 processes running.
prio4 process gets 28Kb/s and prio 0 process gets 5MB/s.

Can you please look into it? It looks like we got serious issues w.r.t
to fairness and bandwidth distribution with-in group.

Thanks
Vivek


> > Especially io throttling patch was very bad in terms of prio with-in 
> > group where throttling treated everyone equally and difference between
> > process prio disappeared.
> >
> > Reads Vs Writes
> > ===============
> > A higher level control most likely will change the ratio in which reads
> > and writes are dispatched to disk with-in group. It used to be decided
> > by IO scheduler so far but with higher level groups doing throttling and
> > possibly buffering the bios and releasing them later, they will have to
> > come up with their own policy on in what proportion reads and writes
> > should be dispatched. In case of IO scheduler based control, all the
> > queuing takes place at IO scheduler and it still retains control of
> > in what ration reads and writes should be dispatched.
> 
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.
> 
> > Summary
> > =======
> > 
> > - An io scheduler based io controller can provide better latencies,
> >   stronger isolation between groups, time based fairness and will not
> >   interfere with io schedulers policies like class, ioprio and
> >   reader vs writer issues.
> > 
> >   But it can gunrantee fairness at higher logical level devices.
> >   Especially in case of max bw control, leaf node control does not sound
> >   to be the most appropriate thing.
> > 
> > - IO throttling provides max bw control in terms of absolute rate. It has
> >   the advantage that it can provide control at higher level logical device
> >   and also control buffered writes without need of additional controller
> >   co-mounted.
> > 
> >   But it does only max bw control and not proportion control so one might
> >   not be using resources optimally. It looses sense of task prio and class
> >   with-in group as any of the task can be throttled with-in group. Because
> >   throttling does not kick in till you hit the max bw limit, it should find
> >   it hard to provide same latencies as io scheduler based control.
> > 
> > - dm-ioband also has the advantage that it can provide fairness at higher
> >   level logical devices.
> > 
> >   But, fairness is provided only in terms of size of IO or number of IO.
> >   No time based fairness. It is very throughput oriented and does not 
> >   throttle high speed group if other group is running slow random reader.
> >   This results in bad latnecies for random reader group and weaker
> >   isolation between groups.
> 
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
> 
> >   Also it does not provide fairness if a group is not continuously
> >   backlogged. So if one is running 1-2 dd/sequential readers in the group,
> >   one does not get fairness until workload is increased to a point where
> >   group becomes continuously backlogged. This also results in poor
> >   latencies and limited fairness.
> 
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.
>  
> > At this point of time it does not look like a single IO controller all
> > the scenarios/requirements. This means few things to me.
> > 
> > - Drop some of the requirements and go with one implementation which meets
> >   those reduced set of requirements.
> >
> > - Have more than one IO controller implementation in kenrel. One for lower
> >   level control for better latencies, stronger isolation and optimal resource
> >   usage and other one for fairness at higher level logical devices and max
> >   bandwidth control. 
> > 
> >   And let user decide which one to use based on his/her needs. 
> > 
> > - Come up with more intelligent way of doing IO control where single
> >   controller covers all the cases.
> > 
> > At this point of time, I am more inclined towards option 2 of having more
> > than one implementation in kernel. :-) (Until and unless we can brainstrom
> > and come up with ideas to make option 3 happen).
> >
> > > It would be great if we discuss our plans on the mailing list, so we
> > > can get early feedback from everyone.
> >  
> > This is what comes to my mind so far. Please add to the list if I have missed
> > some points. Also correct me if I am wrong about the pros/cons of the
> > approaches.
> >
> > Thoughts/ideas/opinions are welcome...
> > 
> > Thanks
> > Vivek
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]         ` <20090929141049.GA12141-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-29 19:53           ` Nauman Rafique
  2009-09-30  8:43           ` Ryo Tsuruta
  1 sibling, 0 replies; 349+ messages in thread
From: Nauman Rafique @ 2009-09-29 19:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

We have been going around in circles for past many months on this
issue of IO controller. I thought that we are getting closer to a
point where we agree on one approach and go with it, but apparently we
are not. I think it would be useful at this point to learn from the
example of how similar functionality was introduced for other
resources like cpu scheduling and memory controllers.

We are starting from a point where there is no cgroup based resource
allocation for disks and there is a lot to be done. CFS has been doing
hierarchical proportional allocation for CPU scheduling for a while
now. Only recently someone has sent out patches for enforcing upper
limits. And it makes a lot of sense (more discussion on this later).
Also Fernando tells me that memory controller did not support
hierarchies in the first attempt. What I don't understand is, if we
are starting from scratch, why do we want to solve all the problems of
IO scheduling in one attempt?

Max bandwidth Controller or Proportional bandwidth controller
===============================================

Enforcing limits is applicable in the scenario where you are managing
a bunch of services in a data center and you want to either charge
them for what they use or you want a very predictable performance over
time. If we just do proportional allocation, then the actual
performance received by a user depends on other co-scheduled tasks. If
other tasks are not using the resource, you end up using their share.
But if all the other co-users become active, the 'extra' resource that
you had would be taken away. Thus without enforcing some upper limit,
predictability gets hurt.  But this becomes an issue only if we are
sharing resources. The most important precondition to sharing
resources is 'the requirement to provide isolation'. And isolation
includes controlling both bandwidth AND latency, in the presence of
other sharers. As Vivek has rightly pointed out, a ticket allocation
based algorithm is good for enforcing upper limits, but it is NOT good
for providing isolation i.e. latency control and even bandwidth in
some cases (as Vivek has shown with results in the last few emails).
Moreover, a solution that is implemented in higher layers (be it VFS
or DM) has little control over what happens in IO scheduler, again
hurting the isolation goal.

In the absence of isolation, we cannot even start sharing a resource.
The predictability or billing are secondary concerns that arise only
if we are sharing resources. If there is somebody who does not care
about isolation, but want to do their billing correctly, I would like
to know about it. Needless to say that max bandwidth limits can also
be enforced at IO scheduling layer.

Common layer vs CFS
==================

Takuya has raised an interesting point here. If somebody wishes to use
noop, using a common layer IO controller on top of noop isn't
necessarily going to give them the same thing. In fact, with IO
controller, noop might behave much like CFQ.

Moreover at one point, if we decide that we absolutely need IO
controller to work for other schedulers too, we have this Vivek's
patch set as a proof-of-concept. For now, as Jens very rightly pointed
out in our discussion, we can have a "simple scheduler: Noop" and an
"intelligent scheduler: CFQ with cgroup based scheduling".

Class based scheduling
===================

CFQ has this notion of classes that needs to be supported in any
solution that we come up with, otherwise we break the semantics of the
existing scheduler. We have workloads which have strong latency
requirements. We have two options: either don't do resource sharing
for them OR share the resource but put them in a higher class (RT) so
that their latencies are not (or minimally) effected by other
workloads running with them.

A solution in higher layer can try to support those semantics, but
what if somebody wants to use a Noop scheduler and does not care about
those semantics? We will end up with multiple schedulers in the upper
layers, and who knows where all this will stop.

Controlling writeback
================

It seems like writeback path has problems, but we should not try to
solve those problems with the same patch set that is trying to do
basic cgroup based IO scheduling. Jens patches for per-bdi pdflush are
already in. They should solve the problem of pdflush not sending down
enough IOs; at least Jens results seem to show that. IMHO, the next
step is to use memory controller in conjunction with IO controller,
and a per group per bdi pdflush threads (only if a group is doing IO
on that bdi), something similar to io_group that we have in Vivek's
patches. That should solve multiple problems. First, it would allow us
to obviate the need of any tracking for dirty pages. Second, we can
build a feedback from IO scheduling layer to the upper layers. If the
number of pending writes in IO controller for a given group exceed a
limit, we block the submitting thread (pdflush), similar to current
congestion implementation. Then the group would start hitting dirty
limits at one point (we would need per group dirty limits, as has
already been pointed out by others), thus blocking the tasks that are
dirtying the pages. Thus using a block layer IO controller, we can
achieve the affect similar achieved by Righi's proposal.

Vivek has summarized most of the other arguments very well. In short,
what I am trying to say is lets start with something very simple that
satisfies some of the most important requirements and we can build
upon that.

On Tue, Sep 29, 2009 at 7:10 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
>> Hi Vivek and all,
>>
>> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
>>
>> > > We are starting from a point where there is no cgroup based IO
>> > > scheduling in the kernel. And it is probably not reasonable to satisfy
>> > > all IO scheduling related requirements in one patch set. We can start
>> > > with something simple, and build on top of that. So a very simple
>> > > patch set that enables cgroup based proportional scheduling for CFQ
>> > > seems like the way to go at this point.
>> >
>> > Sure, we can start with CFQ only. But a bigger question we need to answer
>> > is that is CFQ the right place to solve the issue? Jens, do you think
>> > that CFQ is the right place to solve the problem?
>> >
>> > Andrew seems to favor a high level approach so that IO schedulers are less
>> > complex and we can provide fairness at high level logical devices also.
>>
>> I'm not in favor of expansion of CFQ, because some enterprise storages
>> are better performed with NOOP rather than CFQ, and I think bandwidth
>> control is needed much more for such storage system. Is it easy to
>> support other IO schedulers even if a new IO scheduler is introduced?
>> I would like to know a bit more specific about Namuman's scheduler design.
>>
>
> The new design is essentially the old design. Except the fact that
> suggestion is that in the first step instead of covering all the 4 IO
> schedulers, first cover only CFQ and then later others.
>
> So providing fairness for NOOP is not an issue. Even if we introduce new
> IO schedulers down the line, I can't think of a reason why can't we cover
> that too with common layer.
>
>> > I will again try to summarize my understanding so far about the pros/cons
>> > of each approach and then we can take the discussion forward.
>>
>> Good summary. Thanks for your work.
>>
>> > Fairness in terms of size of IO or disk time used
>> > =================================================
>> > On a seeky media, fairness in terms of disk time can get us better results
>> > instead fairness interms of size of IO or number of IO.
>> >
>> > If we implement some kind of time based solution at higher layer, then
>> > that higher layer should know who used how much of time each group used. We
>> > can probably do some kind of timestamping in bio to get a sense when did it
>> > get into disk and when did it finish. But on a multi queue hardware there
>> > can be multiple requests in the disk either from same queue or from differnet
>> > queues and with pure timestamping based apparoch, so far I could not think
>> > how at high level we will get an idea who used how much of time.
>>
>> IIUC, could the overlap time be calculated from time-stamp on a multi
>> queue hardware?
>
> So far could not think of anything clean. Do you have something in mind.
>
> I was thinking that elevator layer will do the merge of bios. So IO
> scheduler/elevator can time stamp the first bio in the request as it goes
> into the disk and again timestamp with finish time once request finishes.
>
> This way higher layer can get an idea how much disk time a group of bios
> used. But on multi queue, if we dispatch say 4 requests from same queue,
> then time accounting becomes an issue.
>
> Consider following where four requests rq1, rq2, rq3 and rq4 are
> dispatched to disk at time t0, t1, t2 and t3 respectively and these
> requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> time elapsed between each of milestones is t. Also assume that all these
> requests are from same queue/group.
>
>        t0   t1   t2   t3  t4   t5   t6   t7
>        rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
>
> Now higher layer will think that time consumed by group is:
>
> (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
>
> But the time elapsed is only 7t.
>
> Secondly if a different group is running only single sequential reader,
> there CFQ will be driving queue depth of 1 and time will not be running
> faster and this inaccuracy in accounting will lead to unfair share between
> groups.
>
> So we need something better to get a sense which group used how much of
> disk time.
>
>>
>> > So this is the first point of contention that how do we want to provide
>> > fairness. In terms of disk time used or in terms of size of IO/number of
>> > IO.
>> >
>> > Max bandwidth Controller or Proportional bandwidth controller
>> > =============================================================
>> > What is our primary requirement here? A weight based proportional
>> > bandwidth controller where we can use the resources optimally and any
>> > kind of throttling kicks in only if there is contention for the disk.
>> >
>> > Or we want max bandwidth control where a group is not allowed to use the
>> > disk even if disk is free.
>> >
>> > Or we need both? I would think that at some point of time we will need
>> > both but we can start with proportional bandwidth control first.
>>
>> How about making throttling policy be user selectable like the IO
>> scheduler and putting it in the higher layer? So we could support
>> all of policies (time-based, size-based and rate limiting). There
>> seems not to only one solution which satisfies all users. But I agree
>> with starting with proportional bandwidth control first.
>>
>
> What are the cases where time based policy does not work and size based
> policy works better and user would choose size based policy and not timed
> based one?
>
> I am not against implementing things in higher layer as long as we can
> ensure tight control on latencies, strong isolation between groups and
> not break CFQ's class and ioprio model with-in group.
>
>> BTW, I will start to reimplement dm-ioband into block layer.
>
> Can you elaborate little bit on this?
>
>>
>> > Fairness for higher level logical devices
>> > =========================================
>> > Do we want good fairness numbers for higher level logical devices also
>> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
>> > at leaf nodes can help us use the resources optimally and in the process
>> > we can get fairness at higher level also in many of the cases.
>>
>> We should also take care of block devices which provide their own
>> make_request_fn() and not use a IO scheduler. We can't use the leaf
>> nodes approach to such devices.
>>
>
> I am not sure how big an issue is this. This can be easily solved by
> making use of NOOP scheduler by these devices. What are the reasons for
> these devices to not use even noop?
>
>> > But do we want strict fairness numbers on higher level logical devices
>> > even if it means sub-optimal usage of unerlying phsical devices?
>> >
>> > I think that for proportinal bandwidth control, it should be ok to provide
>> > fairness at higher level logical device but for max bandwidth control it
>> > might make more sense to provide fairness at higher level. Consider a
>> > case where from a striped device a customer wants to limit a group to
>> > 30MB/s and in case of leaf node control, if every leaf node provides
>> > 30MB/s, it might accumulate to much more than specified rate at logical
>> > device.
>> >
>> > Latency Control and strong isolation between groups
>> > ===================================================
>> > Do we want a good isolation between groups and better latencies and
>> > stronger isolation between groups?
>> >
>> > I think if problem is solved at IO scheduler level, we can achieve better
>> > latency control and hence stronger isolation between groups.
>> >
>> > Higher level solutions should find it hard to provide same kind of latency
>> > control and isolation between groups as IO scheduler based solution.
>>
>> Why do you think that the higher level solution is hard to provide it?
>> I think that it is a matter of how to implement throttling policy.
>>
>
> So far both in dm-ioband and IO throttling solution I have seen that
> higher layer implements some of kind leaky bucket/token bucket algorithm,
> which inherently allows IO from all the competing groups until they run
> out of tokens and then these groups are made to wait till fresh tokens are
> issued.
>
> That means, most of the times, IO scheduler will see requests from more
> than one group at the same time and that will be the source of weak
> isolation between groups.
>
> Consider following simple examples. Assume there are two groups and one
> contains 16 random readers and other contains 1 random reader.
>
>                G1      G2
>               16RR     1RR
>
> Now it might happen that IO scheduler sees requests from all the 17 RR
> readers at the same time. (Throttling probably will kick in later because
> you would like to give one group a nice slice of 100ms otherwise
> sequential readers will suffer a lot and disk will become seek bound).
>
> So CFQ will dispatch requests (at least one), from each of the 16 random
> readers first and then from 1 random reader in group 2 and this increases
> the max latency for the application in group 2 and provides weak
> isolation.
>
> There will also be additional issues with CFQ preemtpion logic. CFQ will
> have no knowledge of groups and it will do cross group preemtptions. For
> example if a meta data request comes in group1, it will preempt any of
> the queue being served in other groups. So somebody doing "find . *" or
> "cat <small files>" in one group will keep on preempting a sequential
> reader in other group. Again this will probably lead to higher max
> latencies.
>
> Note, even if CFQ does not enable idling on random readers, and expires
> queue after single dispatch, seeking time between queues can be
> significant. Similarly, if instead of 16 random reders we had 16 random
> synchronous writers we will have seek time issue as well as writers can
> often dump bigger requests which also adds to latency.
>
> This latency issue can be solved if we dispatch requests only from one
> group for a certain time of time and then move to next group. (Something
> what common layer is doing).
>
> If we go for only single group dispatching requests, then we shall have
> to implemnt some of the preemption semantics also in higher layer because
> in certain cases we want to do preemption across the groups. Like RT task
> group preemting non-RT task group etc.
>
> Once we go deeper into implementation, I think we will find more issues.
>
>> > Fairness for buffered writes
>> > ============================
>> > Doing io control at any place below page cache has disadvantage that page
>> > cache might not dispatch more writes from higher weight group hence higher
>> > weight group might not see more IO done. Andrew says that we don't have
>> > a solution to this problem in kernel and he would like to see it handled
>> > properly.
>> >
>> > Only way to solve this seems to be to slow down the writers before they
>> > write into page cache. IO throttling patch handled it by slowing down
>> > writer if it crossed max specified rate. Other suggestions have come in
>> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
>> > al-together where some kind of per group write limit can be specified.
>> >
>> > So if solution is implemented at IO scheduler layer or at device mapper
>> > layer, both shall have to rely on another controller to be co-mounted
>> > to handle buffered writes properly.
>> >
>> > Fairness with-in group
>> > ======================
>> > One of the issues with higher level controller is that how to do fair
>> > throttling so that fairness with-in group is not impacted. Especially
>> > the case of making sure that we don't break the notion of ioprio of the
>> > processes with-in group.
>>
>> I ran your test script to confirm that the notion of ioprio was not
>> broken by dm-ioband. Here is the results of the test.
>> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
>>
>> I think that the time period during which dm-ioband holds IO requests
>> for throttling would be too short to break the notion of ioprio.
>
> Ok, I re-ran that test. Previously default io_limit value was 192 and now
> I set it up to 256 as you suggested. I still see writer starving reader. I
> have removed "conv=fdatasync" from writer so that a writer is pure buffered
> writes.
>
> With vanilla CFQ
> ----------------
> reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s
>
> with dm-ioband default io_limit=192
> -----------------------------------
> writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s
> reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s
>
> ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
> ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100
>
> with dm-ioband default io_limit=256
> -----------------------------------
> reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s
>
> ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100
> ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100
>
> Notice that with vanilla CFQ, reader is taking 10 seconds to finish and
> with dm-ioband it takes more than 40 seconds to finish. So writer is still
> starving the reader with both io_limit 192 and 256.
>
> On top of that can you please give some details how increasing the
> buffered queue length reduces the impact of writers?
>
> IO Prio issue
> --------------
> I ran another test where two ioband devices were created of weight 100
> each on two partitions. In first group 4 readers were launched. Three
> readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> group2, I launched a buffered writer.
>
> One would expect that prio0 reader gets more bandwidth as compared to
> prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> that is not happening. Look how vanilla CFQ provides much more bandwidth
> to prio0 reader as compared to prio7 reader and how putting them in the
> group reduces the difference betweej prio0 and prio7 readers.
>
> Following are the results.
>
> Vanilla CFQ
> ===========
> set1
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s
> 578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s
> 578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s
>
> set2
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s
> 578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s
> 578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s
>
> set3
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s
> 578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s
>
> with dm-ioband
> ==============
> ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
> ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100
>
> set1
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s
> 578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s
> 578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s
> 578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s
>
> set2
> ---
> prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s
> 578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s
> 578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s
> 578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s
>
> set3
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s
> 578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s
> 578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s
> 578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s
>
> Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader.
>      With dm-ioband this ratio changed to less than 200%.
>
>      I will run more tests, but this show how notion of priority with-in a
>      group changes if we implement throttling at higher layer and don't
>      keep it with CFQ.
>
>     The second thing which strikes me is that I divided the disk 50% each
>     between readers and writers and in that case would expect protection
>     for writers and expect writers to finish fast. But writers have been
>     slowed down like and it also kills overall disk throughput. I think
>     it probably became seek bound.
>
>     I think the moment I get more time, I will run some timed fio tests
>     and look at how overall disk performed and how bandwidth was
>     distributed with-in group and between groups.
>
>>
>> > Especially io throttling patch was very bad in terms of prio with-in
>> > group where throttling treated everyone equally and difference between
>> > process prio disappeared.
>> >
>> > Reads Vs Writes
>> > ===============
>> > A higher level control most likely will change the ratio in which reads
>> > and writes are dispatched to disk with-in group. It used to be decided
>> > by IO scheduler so far but with higher level groups doing throttling and
>> > possibly buffering the bios and releasing them later, they will have to
>> > come up with their own policy on in what proportion reads and writes
>> > should be dispatched. In case of IO scheduler based control, all the
>> > queuing takes place at IO scheduler and it still retains control of
>> > in what ration reads and writes should be dispatched.
>>
>> I don't think it is a concern. The current implementation of dm-ioband
>> is that sync/async IO requests are handled separately and the
>> backlogged IOs are released according to the order of arrival if both
>> sync and async requests are backlogged.
>
> At least the version of dm-ioband I have is not producing the desired
> results. See above.
>
> Is there a newer version? I will run some tests on that too. But I think
> you will again run into same issue where you will decide the ratio of
> read vs write with-in group and as I change the IO schedulers results
> will vary.
>
> So at this point of time I can't think how can you solve read vs write
> ratio issue at higher layer without changing the behavior or underlying
> IO scheduler.
>
>>
>> > Summary
>> > =======
>> >
>> > - An io scheduler based io controller can provide better latencies,
>> >   stronger isolation between groups, time based fairness and will not
>> >   interfere with io schedulers policies like class, ioprio and
>> >   reader vs writer issues.
>> >
>> >   But it can gunrantee fairness at higher logical level devices.
>> >   Especially in case of max bw control, leaf node control does not sound
>> >   to be the most appropriate thing.
>> >
>> > - IO throttling provides max bw control in terms of absolute rate. It has
>> >   the advantage that it can provide control at higher level logical device
>> >   and also control buffered writes without need of additional controller
>> >   co-mounted.
>> >
>> >   But it does only max bw control and not proportion control so one might
>> >   not be using resources optimally. It looses sense of task prio and class
>> >   with-in group as any of the task can be throttled with-in group. Because
>> >   throttling does not kick in till you hit the max bw limit, it should find
>> >   it hard to provide same latencies as io scheduler based control.
>> >
>> > - dm-ioband also has the advantage that it can provide fairness at higher
>> >   level logical devices.
>> >
>> >   But, fairness is provided only in terms of size of IO or number of IO.
>> >   No time based fairness. It is very throughput oriented and does not
>> >   throttle high speed group if other group is running slow random reader.
>> >   This results in bad latnecies for random reader group and weaker
>> >   isolation between groups.
>>
>> A new policy can be added to dm-ioband. Actually, range-bw policy,
>> which provides min and max bandwidth control, does time-based
>> throttling. Moreover there is room for improvement for existing
>> policies. The write-starve-read issue you pointed out will be solved
>> soon.
>>
>> >   Also it does not provide fairness if a group is not continuously
>> >   backlogged. So if one is running 1-2 dd/sequential readers in the group,
>> >   one does not get fairness until workload is increased to a point where
>> >   group becomes continuously backlogged. This also results in poor
>> >   latencies and limited fairness.
>>
>> This is intended to efficiently use bandwidth of underlying devices
>> when IO load is low.
>
> But this has following undesired results.
>
> - Slow moving group does not get reduced latencies. For example, random readers
>  in slow moving group get no isolation and will continue to see higher max
>  latencies.
>
> - A single sequential reader in one group does not get fair share and
>  we might be pushing buffered writes in other group thinking that we
>  are getting better throughput. But the fact is that we are eating away
>  readers share in group1 and giving it to writers in group2. Also I
>  showed that we did not necessarily improve the overall throughput of
>  the system by doing so. (Because it increases the number of seeks).
>
>  I had sent you a mail to show that.
>
> http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html
>
>  But you changed the test case to run 4 readers in a single group to show that
>  it throughput does not decrease. Please don't change test cases. In case of 4
>  sequential readers in the group, group is continuously backlogged and you
>  don't steal bandwidth from slow moving group. So in that mail I was not
>  even discussing the scenario when you don't steal the bandwidth from
>  other group.
>
>  I specially created one slow moving group with one reader so that we end up
>  stealing bandwidth from slow moving group and show that we did not achive
>  higher overall throughput by stealing the BW at the same time we did not get
>  fairness for single reader and observed decreasing throughput for single
>  reader as number of writers in other group increased.
>
> Thanks
> Vivek
>
>>
>> > At this point of time it does not look like a single IO controller all
>> > the scenarios/requirements. This means few things to me.
>> >
>> > - Drop some of the requirements and go with one implementation which meets
>> >   those reduced set of requirements.
>> >
>> > - Have more than one IO controller implementation in kenrel. One for lower
>> >   level control for better latencies, stronger isolation and optimal resource
>> >   usage and other one for fairness at higher level logical devices and max
>> >   bandwidth control.
>> >
>> >   And let user decide which one to use based on his/her needs.
>> >
>> > - Come up with more intelligent way of doing IO control where single
>> >   controller covers all the cases.
>> >
>> > At this point of time, I am more inclined towards option 2 of having more
>> > than one implementation in kernel. :-) (Until and unless we can brainstrom
>> > and come up with ideas to make option 3 happen).
>> >
>> > > It would be great if we discuss our plans on the mailing list, so we
>> > > can get early feedback from everyone.
>> >
>> > This is what comes to my mind so far. Please add to the list if I have missed
>> > some points. Also correct me if I am wrong about the pros/cons of the
>> > approaches.
>> >
>> > Thoughts/ideas/opinions are welcome...
>> >
>> > Thanks
>> > Vivek
>>
>> Thanks,
>> Ryo Tsuruta
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-29 14:10         ` Vivek Goyal
  (?)
@ 2009-09-29 19:53         ` Nauman Rafique
  -1 siblings, 0 replies; 349+ messages in thread
From: Nauman Rafique @ 2009-09-29 19:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ryo Tsuruta, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo, riel, yoshikawa.takuya

We have been going around in circles for past many months on this
issue of IO controller. I thought that we are getting closer to a
point where we agree on one approach and go with it, but apparently we
are not. I think it would be useful at this point to learn from the
example of how similar functionality was introduced for other
resources like cpu scheduling and memory controllers.

We are starting from a point where there is no cgroup based resource
allocation for disks and there is a lot to be done. CFS has been doing
hierarchical proportional allocation for CPU scheduling for a while
now. Only recently someone has sent out patches for enforcing upper
limits. And it makes a lot of sense (more discussion on this later).
Also Fernando tells me that memory controller did not support
hierarchies in the first attempt. What I don't understand is, if we
are starting from scratch, why do we want to solve all the problems of
IO scheduling in one attempt?

Max bandwidth Controller or Proportional bandwidth controller
===============================================

Enforcing limits is applicable in the scenario where you are managing
a bunch of services in a data center and you want to either charge
them for what they use or you want a very predictable performance over
time. If we just do proportional allocation, then the actual
performance received by a user depends on other co-scheduled tasks. If
other tasks are not using the resource, you end up using their share.
But if all the other co-users become active, the 'extra' resource that
you had would be taken away. Thus without enforcing some upper limit,
predictability gets hurt.  But this becomes an issue only if we are
sharing resources. The most important precondition to sharing
resources is 'the requirement to provide isolation'. And isolation
includes controlling both bandwidth AND latency, in the presence of
other sharers. As Vivek has rightly pointed out, a ticket allocation
based algorithm is good for enforcing upper limits, but it is NOT good
for providing isolation i.e. latency control and even bandwidth in
some cases (as Vivek has shown with results in the last few emails).
Moreover, a solution that is implemented in higher layers (be it VFS
or DM) has little control over what happens in IO scheduler, again
hurting the isolation goal.

In the absence of isolation, we cannot even start sharing a resource.
The predictability or billing are secondary concerns that arise only
if we are sharing resources. If there is somebody who does not care
about isolation, but want to do their billing correctly, I would like
to know about it. Needless to say that max bandwidth limits can also
be enforced at IO scheduling layer.

Common layer vs CFS
==================

Takuya has raised an interesting point here. If somebody wishes to use
noop, using a common layer IO controller on top of noop isn't
necessarily going to give them the same thing. In fact, with IO
controller, noop might behave much like CFQ.

Moreover at one point, if we decide that we absolutely need IO
controller to work for other schedulers too, we have this Vivek's
patch set as a proof-of-concept. For now, as Jens very rightly pointed
out in our discussion, we can have a "simple scheduler: Noop" and an
"intelligent scheduler: CFQ with cgroup based scheduling".

Class based scheduling
===================

CFQ has this notion of classes that needs to be supported in any
solution that we come up with, otherwise we break the semantics of the
existing scheduler. We have workloads which have strong latency
requirements. We have two options: either don't do resource sharing
for them OR share the resource but put them in a higher class (RT) so
that their latencies are not (or minimally) effected by other
workloads running with them.

A solution in higher layer can try to support those semantics, but
what if somebody wants to use a Noop scheduler and does not care about
those semantics? We will end up with multiple schedulers in the upper
layers, and who knows where all this will stop.

Controlling writeback
================

It seems like writeback path has problems, but we should not try to
solve those problems with the same patch set that is trying to do
basic cgroup based IO scheduling. Jens patches for per-bdi pdflush are
already in. They should solve the problem of pdflush not sending down
enough IOs; at least Jens results seem to show that. IMHO, the next
step is to use memory controller in conjunction with IO controller,
and a per group per bdi pdflush threads (only if a group is doing IO
on that bdi), something similar to io_group that we have in Vivek's
patches. That should solve multiple problems. First, it would allow us
to obviate the need of any tracking for dirty pages. Second, we can
build a feedback from IO scheduling layer to the upper layers. If the
number of pending writes in IO controller for a given group exceed a
limit, we block the submitting thread (pdflush), similar to current
congestion implementation. Then the group would start hitting dirty
limits at one point (we would need per group dirty limits, as has
already been pointed out by others), thus blocking the tasks that are
dirtying the pages. Thus using a block layer IO controller, we can
achieve the affect similar achieved by Righi's proposal.

Vivek has summarized most of the other arguments very well. In short,
what I am trying to say is lets start with something very simple that
satisfies some of the most important requirements and we can build
upon that.

On Tue, Sep 29, 2009 at 7:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
>> Hi Vivek and all,
>>
>> Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
>>
>> > > We are starting from a point where there is no cgroup based IO
>> > > scheduling in the kernel. And it is probably not reasonable to satisfy
>> > > all IO scheduling related requirements in one patch set. We can start
>> > > with something simple, and build on top of that. So a very simple
>> > > patch set that enables cgroup based proportional scheduling for CFQ
>> > > seems like the way to go at this point.
>> >
>> > Sure, we can start with CFQ only. But a bigger question we need to answer
>> > is that is CFQ the right place to solve the issue? Jens, do you think
>> > that CFQ is the right place to solve the problem?
>> >
>> > Andrew seems to favor a high level approach so that IO schedulers are less
>> > complex and we can provide fairness at high level logical devices also.
>>
>> I'm not in favor of expansion of CFQ, because some enterprise storages
>> are better performed with NOOP rather than CFQ, and I think bandwidth
>> control is needed much more for such storage system. Is it easy to
>> support other IO schedulers even if a new IO scheduler is introduced?
>> I would like to know a bit more specific about Namuman's scheduler design.
>>
>
> The new design is essentially the old design. Except the fact that
> suggestion is that in the first step instead of covering all the 4 IO
> schedulers, first cover only CFQ and then later others.
>
> So providing fairness for NOOP is not an issue. Even if we introduce new
> IO schedulers down the line, I can't think of a reason why can't we cover
> that too with common layer.
>
>> > I will again try to summarize my understanding so far about the pros/cons
>> > of each approach and then we can take the discussion forward.
>>
>> Good summary. Thanks for your work.
>>
>> > Fairness in terms of size of IO or disk time used
>> > =================================================
>> > On a seeky media, fairness in terms of disk time can get us better results
>> > instead fairness interms of size of IO or number of IO.
>> >
>> > If we implement some kind of time based solution at higher layer, then
>> > that higher layer should know who used how much of time each group used. We
>> > can probably do some kind of timestamping in bio to get a sense when did it
>> > get into disk and when did it finish. But on a multi queue hardware there
>> > can be multiple requests in the disk either from same queue or from differnet
>> > queues and with pure timestamping based apparoch, so far I could not think
>> > how at high level we will get an idea who used how much of time.
>>
>> IIUC, could the overlap time be calculated from time-stamp on a multi
>> queue hardware?
>
> So far could not think of anything clean. Do you have something in mind.
>
> I was thinking that elevator layer will do the merge of bios. So IO
> scheduler/elevator can time stamp the first bio in the request as it goes
> into the disk and again timestamp with finish time once request finishes.
>
> This way higher layer can get an idea how much disk time a group of bios
> used. But on multi queue, if we dispatch say 4 requests from same queue,
> then time accounting becomes an issue.
>
> Consider following where four requests rq1, rq2, rq3 and rq4 are
> dispatched to disk at time t0, t1, t2 and t3 respectively and these
> requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
> time elapsed between each of milestones is t. Also assume that all these
> requests are from same queue/group.
>
>        t0   t1   t2   t3  t4   t5   t6   t7
>        rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4
>
> Now higher layer will think that time consumed by group is:
>
> (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t
>
> But the time elapsed is only 7t.
>
> Secondly if a different group is running only single sequential reader,
> there CFQ will be driving queue depth of 1 and time will not be running
> faster and this inaccuracy in accounting will lead to unfair share between
> groups.
>
> So we need something better to get a sense which group used how much of
> disk time.
>
>>
>> > So this is the first point of contention that how do we want to provide
>> > fairness. In terms of disk time used or in terms of size of IO/number of
>> > IO.
>> >
>> > Max bandwidth Controller or Proportional bandwidth controller
>> > =============================================================
>> > What is our primary requirement here? A weight based proportional
>> > bandwidth controller where we can use the resources optimally and any
>> > kind of throttling kicks in only if there is contention for the disk.
>> >
>> > Or we want max bandwidth control where a group is not allowed to use the
>> > disk even if disk is free.
>> >
>> > Or we need both? I would think that at some point of time we will need
>> > both but we can start with proportional bandwidth control first.
>>
>> How about making throttling policy be user selectable like the IO
>> scheduler and putting it in the higher layer? So we could support
>> all of policies (time-based, size-based and rate limiting). There
>> seems not to only one solution which satisfies all users. But I agree
>> with starting with proportional bandwidth control first.
>>
>
> What are the cases where time based policy does not work and size based
> policy works better and user would choose size based policy and not timed
> based one?
>
> I am not against implementing things in higher layer as long as we can
> ensure tight control on latencies, strong isolation between groups and
> not break CFQ's class and ioprio model with-in group.
>
>> BTW, I will start to reimplement dm-ioband into block layer.
>
> Can you elaborate little bit on this?
>
>>
>> > Fairness for higher level logical devices
>> > =========================================
>> > Do we want good fairness numbers for higher level logical devices also
>> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
>> > at leaf nodes can help us use the resources optimally and in the process
>> > we can get fairness at higher level also in many of the cases.
>>
>> We should also take care of block devices which provide their own
>> make_request_fn() and not use a IO scheduler. We can't use the leaf
>> nodes approach to such devices.
>>
>
> I am not sure how big an issue is this. This can be easily solved by
> making use of NOOP scheduler by these devices. What are the reasons for
> these devices to not use even noop?
>
>> > But do we want strict fairness numbers on higher level logical devices
>> > even if it means sub-optimal usage of unerlying phsical devices?
>> >
>> > I think that for proportinal bandwidth control, it should be ok to provide
>> > fairness at higher level logical device but for max bandwidth control it
>> > might make more sense to provide fairness at higher level. Consider a
>> > case where from a striped device a customer wants to limit a group to
>> > 30MB/s and in case of leaf node control, if every leaf node provides
>> > 30MB/s, it might accumulate to much more than specified rate at logical
>> > device.
>> >
>> > Latency Control and strong isolation between groups
>> > ===================================================
>> > Do we want a good isolation between groups and better latencies and
>> > stronger isolation between groups?
>> >
>> > I think if problem is solved at IO scheduler level, we can achieve better
>> > latency control and hence stronger isolation between groups.
>> >
>> > Higher level solutions should find it hard to provide same kind of latency
>> > control and isolation between groups as IO scheduler based solution.
>>
>> Why do you think that the higher level solution is hard to provide it?
>> I think that it is a matter of how to implement throttling policy.
>>
>
> So far both in dm-ioband and IO throttling solution I have seen that
> higher layer implements some of kind leaky bucket/token bucket algorithm,
> which inherently allows IO from all the competing groups until they run
> out of tokens and then these groups are made to wait till fresh tokens are
> issued.
>
> That means, most of the times, IO scheduler will see requests from more
> than one group at the same time and that will be the source of weak
> isolation between groups.
>
> Consider following simple examples. Assume there are two groups and one
> contains 16 random readers and other contains 1 random reader.
>
>                G1      G2
>               16RR     1RR
>
> Now it might happen that IO scheduler sees requests from all the 17 RR
> readers at the same time. (Throttling probably will kick in later because
> you would like to give one group a nice slice of 100ms otherwise
> sequential readers will suffer a lot and disk will become seek bound).
>
> So CFQ will dispatch requests (at least one), from each of the 16 random
> readers first and then from 1 random reader in group 2 and this increases
> the max latency for the application in group 2 and provides weak
> isolation.
>
> There will also be additional issues with CFQ preemtpion logic. CFQ will
> have no knowledge of groups and it will do cross group preemtptions. For
> example if a meta data request comes in group1, it will preempt any of
> the queue being served in other groups. So somebody doing "find . *" or
> "cat <small files>" in one group will keep on preempting a sequential
> reader in other group. Again this will probably lead to higher max
> latencies.
>
> Note, even if CFQ does not enable idling on random readers, and expires
> queue after single dispatch, seeking time between queues can be
> significant. Similarly, if instead of 16 random reders we had 16 random
> synchronous writers we will have seek time issue as well as writers can
> often dump bigger requests which also adds to latency.
>
> This latency issue can be solved if we dispatch requests only from one
> group for a certain time of time and then move to next group. (Something
> what common layer is doing).
>
> If we go for only single group dispatching requests, then we shall have
> to implemnt some of the preemption semantics also in higher layer because
> in certain cases we want to do preemption across the groups. Like RT task
> group preemting non-RT task group etc.
>
> Once we go deeper into implementation, I think we will find more issues.
>
>> > Fairness for buffered writes
>> > ============================
>> > Doing io control at any place below page cache has disadvantage that page
>> > cache might not dispatch more writes from higher weight group hence higher
>> > weight group might not see more IO done. Andrew says that we don't have
>> > a solution to this problem in kernel and he would like to see it handled
>> > properly.
>> >
>> > Only way to solve this seems to be to slow down the writers before they
>> > write into page cache. IO throttling patch handled it by slowing down
>> > writer if it crossed max specified rate. Other suggestions have come in
>> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
>> > al-together where some kind of per group write limit can be specified.
>> >
>> > So if solution is implemented at IO scheduler layer or at device mapper
>> > layer, both shall have to rely on another controller to be co-mounted
>> > to handle buffered writes properly.
>> >
>> > Fairness with-in group
>> > ======================
>> > One of the issues with higher level controller is that how to do fair
>> > throttling so that fairness with-in group is not impacted. Especially
>> > the case of making sure that we don't break the notion of ioprio of the
>> > processes with-in group.
>>
>> I ran your test script to confirm that the notion of ioprio was not
>> broken by dm-ioband. Here is the results of the test.
>> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
>>
>> I think that the time period during which dm-ioband holds IO requests
>> for throttling would be too short to break the notion of ioprio.
>
> Ok, I re-ran that test. Previously default io_limit value was 192 and now
> I set it up to 256 as you suggested. I still see writer starving reader. I
> have removed "conv=fdatasync" from writer so that a writer is pure buffered
> writes.
>
> With vanilla CFQ
> ----------------
> reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s
>
> with dm-ioband default io_limit=192
> -----------------------------------
> writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s
> reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s
>
> ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
> ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100
>
> with dm-ioband default io_limit=256
> -----------------------------------
> reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s
>
> ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100
> ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100
>
> Notice that with vanilla CFQ, reader is taking 10 seconds to finish and
> with dm-ioband it takes more than 40 seconds to finish. So writer is still
> starving the reader with both io_limit 192 and 256.
>
> On top of that can you please give some details how increasing the
> buffered queue length reduces the impact of writers?
>
> IO Prio issue
> --------------
> I ran another test where two ioband devices were created of weight 100
> each on two partitions. In first group 4 readers were launched. Three
> readers are of class BE and prio 7, fourth one is of class BE prio 0. In
> group2, I launched a buffered writer.
>
> One would expect that prio0 reader gets more bandwidth as compared to
> prio 4 readers and prio 7 readers will get more or less same bw. Looks like
> that is not happening. Look how vanilla CFQ provides much more bandwidth
> to prio0 reader as compared to prio7 reader and how putting them in the
> group reduces the difference betweej prio0 and prio7 readers.
>
> Following are the results.
>
> Vanilla CFQ
> ===========
> set1
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s
> 578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s
> 578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s
>
> set2
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s
> 578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s
> 578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s
>
> set3
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s
> 578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s
> 578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s
>
> with dm-ioband
> ==============
> ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
> ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100
>
> set1
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s
> 578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s
> 578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s
> 578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s
>
> set2
> ---
> prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s
> 578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s
> 578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s
> 578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s
>
> set3
> ----
> prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s
> 578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s
> 578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s
> 578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s
> writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s
>
> Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader.
>      With dm-ioband this ratio changed to less than 200%.
>
>      I will run more tests, but this show how notion of priority with-in a
>      group changes if we implement throttling at higher layer and don't
>      keep it with CFQ.
>
>     The second thing which strikes me is that I divided the disk 50% each
>     between readers and writers and in that case would expect protection
>     for writers and expect writers to finish fast. But writers have been
>     slowed down like and it also kills overall disk throughput. I think
>     it probably became seek bound.
>
>     I think the moment I get more time, I will run some timed fio tests
>     and look at how overall disk performed and how bandwidth was
>     distributed with-in group and between groups.
>
>>
>> > Especially io throttling patch was very bad in terms of prio with-in
>> > group where throttling treated everyone equally and difference between
>> > process prio disappeared.
>> >
>> > Reads Vs Writes
>> > ===============
>> > A higher level control most likely will change the ratio in which reads
>> > and writes are dispatched to disk with-in group. It used to be decided
>> > by IO scheduler so far but with higher level groups doing throttling and
>> > possibly buffering the bios and releasing them later, they will have to
>> > come up with their own policy on in what proportion reads and writes
>> > should be dispatched. In case of IO scheduler based control, all the
>> > queuing takes place at IO scheduler and it still retains control of
>> > in what ration reads and writes should be dispatched.
>>
>> I don't think it is a concern. The current implementation of dm-ioband
>> is that sync/async IO requests are handled separately and the
>> backlogged IOs are released according to the order of arrival if both
>> sync and async requests are backlogged.
>
> At least the version of dm-ioband I have is not producing the desired
> results. See above.
>
> Is there a newer version? I will run some tests on that too. But I think
> you will again run into same issue where you will decide the ratio of
> read vs write with-in group and as I change the IO schedulers results
> will vary.
>
> So at this point of time I can't think how can you solve read vs write
> ratio issue at higher layer without changing the behavior or underlying
> IO scheduler.
>
>>
>> > Summary
>> > =======
>> >
>> > - An io scheduler based io controller can provide better latencies,
>> >   stronger isolation between groups, time based fairness and will not
>> >   interfere with io schedulers policies like class, ioprio and
>> >   reader vs writer issues.
>> >
>> >   But it can gunrantee fairness at higher logical level devices.
>> >   Especially in case of max bw control, leaf node control does not sound
>> >   to be the most appropriate thing.
>> >
>> > - IO throttling provides max bw control in terms of absolute rate. It has
>> >   the advantage that it can provide control at higher level logical device
>> >   and also control buffered writes without need of additional controller
>> >   co-mounted.
>> >
>> >   But it does only max bw control and not proportion control so one might
>> >   not be using resources optimally. It looses sense of task prio and class
>> >   with-in group as any of the task can be throttled with-in group. Because
>> >   throttling does not kick in till you hit the max bw limit, it should find
>> >   it hard to provide same latencies as io scheduler based control.
>> >
>> > - dm-ioband also has the advantage that it can provide fairness at higher
>> >   level logical devices.
>> >
>> >   But, fairness is provided only in terms of size of IO or number of IO.
>> >   No time based fairness. It is very throughput oriented and does not
>> >   throttle high speed group if other group is running slow random reader.
>> >   This results in bad latnecies for random reader group and weaker
>> >   isolation between groups.
>>
>> A new policy can be added to dm-ioband. Actually, range-bw policy,
>> which provides min and max bandwidth control, does time-based
>> throttling. Moreover there is room for improvement for existing
>> policies. The write-starve-read issue you pointed out will be solved
>> soon.
>>
>> >   Also it does not provide fairness if a group is not continuously
>> >   backlogged. So if one is running 1-2 dd/sequential readers in the group,
>> >   one does not get fairness until workload is increased to a point where
>> >   group becomes continuously backlogged. This also results in poor
>> >   latencies and limited fairness.
>>
>> This is intended to efficiently use bandwidth of underlying devices
>> when IO load is low.
>
> But this has following undesired results.
>
> - Slow moving group does not get reduced latencies. For example, random readers
>  in slow moving group get no isolation and will continue to see higher max
>  latencies.
>
> - A single sequential reader in one group does not get fair share and
>  we might be pushing buffered writes in other group thinking that we
>  are getting better throughput. But the fact is that we are eating away
>  readers share in group1 and giving it to writers in group2. Also I
>  showed that we did not necessarily improve the overall throughput of
>  the system by doing so. (Because it increases the number of seeks).
>
>  I had sent you a mail to show that.
>
> http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html
>
>  But you changed the test case to run 4 readers in a single group to show that
>  it throughput does not decrease. Please don't change test cases. In case of 4
>  sequential readers in the group, group is continuously backlogged and you
>  don't steal bandwidth from slow moving group. So in that mail I was not
>  even discussing the scenario when you don't steal the bandwidth from
>  other group.
>
>  I specially created one slow moving group with one reader so that we end up
>  stealing bandwidth from slow moving group and show that we did not achive
>  higher overall throughput by stealing the BW at the same time we did not get
>  fairness for single reader and observed decreasing throughput for single
>  reader as number of writers in other group increased.
>
> Thanks
> Vivek
>
>>
>> > At this point of time it does not look like a single IO controller all
>> > the scenarios/requirements. This means few things to me.
>> >
>> > - Drop some of the requirements and go with one implementation which meets
>> >   those reduced set of requirements.
>> >
>> > - Have more than one IO controller implementation in kenrel. One for lower
>> >   level control for better latencies, stronger isolation and optimal resource
>> >   usage and other one for fairness at higher level logical devices and max
>> >   bandwidth control.
>> >
>> >   And let user decide which one to use based on his/her needs.
>> >
>> > - Come up with more intelligent way of doing IO control where single
>> >   controller covers all the cases.
>> >
>> > At this point of time, I am more inclined towards option 2 of having more
>> > than one implementation in kernel. :-) (Until and unless we can brainstrom
>> > and come up with ideas to make option 3 happen).
>> >
>> > > It would be great if we discuss our plans on the mailing list, so we
>> > > can get early feedback from everyone.
>> >
>> > This is what comes to my mind so far. Please add to the list if I have missed
>> > some points. Also correct me if I am wrong about the pros/cons of the
>> > approaches.
>> >
>> > Thoughts/ideas/opinions are welcome...
>> >
>> > Thanks
>> > Vivek
>>
>> Thanks,
>> Ryo Tsuruta
>

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-29 10:49         ` Takuya Yoshikawa
@ 2009-09-29 14:10         ` Vivek Goyal
  2009-09-30  3:11         ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-29 14:10 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and all,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> 
> > > We are starting from a point where there is no cgroup based IO
> > > scheduling in the kernel. And it is probably not reasonable to satisfy
> > > all IO scheduling related requirements in one patch set. We can start
> > > with something simple, and build on top of that. So a very simple
> > > patch set that enables cgroup based proportional scheduling for CFQ
> > > seems like the way to go at this point.
> > 
> > Sure, we can start with CFQ only. But a bigger question we need to answer
> > is that is CFQ the right place to solve the issue? Jens, do you think 
> > that CFQ is the right place to solve the problem?
> > 
> > Andrew seems to favor a high level approach so that IO schedulers are less
> > complex and we can provide fairness at high level logical devices also. 
> 
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.
> 

The new design is essentially the old design. Except the fact that
suggestion is that in the first step instead of covering all the 4 IO
schedulers, first cover only CFQ and then later others. 

So providing fairness for NOOP is not an issue. Even if we introduce new
IO schedulers down the line, I can't think of a reason why can't we cover
that too with common layer.

> > I will again try to summarize my understanding so far about the pros/cons
> > of each approach and then we can take the discussion forward.
> 
> Good summary. Thanks for your work.
> 
> > Fairness in terms of size of IO or disk time used
> > =================================================
> > On a seeky media, fairness in terms of disk time can get us better results
> > instead fairness interms of size of IO or number of IO.
> > 
> > If we implement some kind of time based solution at higher layer, then 
> > that higher layer should know who used how much of time each group used. We
> > can probably do some kind of timestamping in bio to get a sense when did it
> > get into disk and when did it finish. But on a multi queue hardware there
> > can be multiple requests in the disk either from same queue or from differnet
> > queues and with pure timestamping based apparoch, so far I could not think
> > how at high level we will get an idea who used how much of time.
> 
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?

So far could not think of anything clean. Do you have something in mind.

I was thinking that elevator layer will do the merge of bios. So IO
scheduler/elevator can time stamp the first bio in the request as it goes
into the disk and again timestamp with finish time once request finishes.

This way higher layer can get an idea how much disk time a group of bios
used. But on multi queue, if we dispatch say 4 requests from same queue,
then time accounting becomes an issue.

Consider following where four requests rq1, rq2, rq3 and rq4 are
dispatched to disk at time t0, t1, t2 and t3 respectively and these
requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
time elapsed between each of milestones is t. Also assume that all these
requests are from same queue/group.

        t0   t1   t2   t3  t4   t5   t6   t7
        rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4

Now higher layer will think that time consumed by group is:

(t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t

But the time elapsed is only 7t.

Secondly if a different group is running only single sequential reader,
there CFQ will be driving queue depth of 1 and time will not be running
faster and this inaccuracy in accounting will lead to unfair share between
groups.

So we need something better to get a sense which group used how much of
disk time.

>  
> > So this is the first point of contention that how do we want to provide
> > fairness. In terms of disk time used or in terms of size of IO/number of
> > IO.
> >
> > Max bandwidth Controller or Proportional bandwidth controller
> > =============================================================
> > What is our primary requirement here? A weight based proportional
> > bandwidth controller where we can use the resources optimally and any
> > kind of throttling kicks in only if there is contention for the disk.
> > 
> > Or we want max bandwidth control where a group is not allowed to use the
> > disk even if disk is free. 
> > 
> > Or we need both? I would think that at some point of time we will need
> > both but we can start with proportional bandwidth control first.
> 
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first. 
> 

What are the cases where time based policy does not work and size based
policy works better and user would choose size based policy and not timed
based one?

I am not against implementing things in higher layer as long as we can
ensure tight control on latencies, strong isolation between groups and
not break CFQ's class and ioprio model with-in group.

> BTW, I will start to reimplement dm-ioband into block layer.

Can you elaborate little bit on this?

> 
> > Fairness for higher level logical devices
> > =========================================
> > Do we want good fairness numbers for higher level logical devices also
> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > at leaf nodes can help us use the resources optimally and in the process
> > we can get fairness at higher level also in many of the cases.
> 
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
> 

I am not sure how big an issue is this. This can be easily solved by
making use of NOOP scheduler by these devices. What are the reasons for
these devices to not use even noop? 

> > But do we want strict fairness numbers on higher level logical devices
> > even if it means sub-optimal usage of unerlying phsical devices?
> > 
> > I think that for proportinal bandwidth control, it should be ok to provide
> > fairness at higher level logical device but for max bandwidth control it
> > might make more sense to provide fairness at higher level. Consider a
> > case where from a striped device a customer wants to limit a group to
> > 30MB/s and in case of leaf node control, if every leaf node provides
> > 30MB/s, it might accumulate to much more than specified rate at logical
> > device.
> >
> > Latency Control and strong isolation between groups
> > ===================================================
> > Do we want a good isolation between groups and better latencies and
> > stronger isolation between groups?
> > 
> > I think if problem is solved at IO scheduler level, we can achieve better
> > latency control and hence stronger isolation between groups.
> > 
> > Higher level solutions should find it hard to provide same kind of latency
> > control and isolation between groups as IO scheduler based solution.
> 
> Why do you think that the higher level solution is hard to provide it? 
> I think that it is a matter of how to implement throttling policy.
> 

So far both in dm-ioband and IO throttling solution I have seen that
higher layer implements some of kind leaky bucket/token bucket algorithm,
which inherently allows IO from all the competing groups until they run
out of tokens and then these groups are made to wait till fresh tokens are
issued.

That means, most of the times, IO scheduler will see requests from more
than one group at the same time and that will be the source of weak
isolation between groups.

Consider following simple examples. Assume there are two groups and one
contains 16 random readers and other contains 1 random reader.

		G1	G2
	       16RR	1RR 

Now it might happen that IO scheduler sees requests from all the 17 RR
readers at the same time. (Throttling probably will kick in later because
you would like to give one group a nice slice of 100ms otherwise
sequential readers will suffer a lot and disk will become seek bound).

So CFQ will dispatch requests (at least one), from each of the 16 random
readers first and then from 1 random reader in group 2 and this increases
the max latency for the application in group 2 and provides weak
isolation.

There will also be additional issues with CFQ preemtpion logic. CFQ will
have no knowledge of groups and it will do cross group preemtptions. For
example if a meta data request comes in group1, it will preempt any of
the queue being served in other groups. So somebody doing "find . *" or
"cat <small files>" in one group will keep on preempting a sequential
reader in other group. Again this will probably lead to higher max
latencies.  

Note, even if CFQ does not enable idling on random readers, and expires
queue after single dispatch, seeking time between queues can be
significant. Similarly, if instead of 16 random reders we had 16 random
synchronous writers we will have seek time issue as well as writers can
often dump bigger requests which also adds to latency.

This latency issue can be solved if we dispatch requests only from one
group for a certain time of time and then move to next group. (Something
what common layer is doing).

If we go for only single group dispatching requests, then we shall have
to implemnt some of the preemption semantics also in higher layer because
in certain cases we want to do preemption across the groups. Like RT task
group preemting non-RT task group etc.

Once we go deeper into implementation, I think we will find more issues.

> > Fairness for buffered writes
> > ============================
> > Doing io control at any place below page cache has disadvantage that page
> > cache might not dispatch more writes from higher weight group hence higher
> > weight group might not see more IO done. Andrew says that we don't have
> > a solution to this problem in kernel and he would like to see it handled
> > properly.
> > 
> > Only way to solve this seems to be to slow down the writers before they
> > write into page cache. IO throttling patch handled it by slowing down 
> > writer if it crossed max specified rate. Other suggestions have come in
> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
> > al-together where some kind of per group write limit can be specified.
> > 
> > So if solution is implemented at IO scheduler layer or at device mapper
> > layer, both shall have to rely on another controller to be co-mounted
> > to handle buffered writes properly.
> >
> > Fairness with-in group
> > ======================
> > One of the issues with higher level controller is that how to do fair
> > throttling so that fairness with-in group is not impacted. Especially
> > the case of making sure that we don't break the notion of ioprio of the
> > processes with-in group.
> 
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> 
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.

Ok, I re-ran that test. Previously default io_limit value was 192 and now
I set it up to 256 as you suggested. I still see writer starving reader. I
have removed "conv=fdatasync" from writer so that a writer is pure buffered
writes.

With vanilla CFQ
----------------
reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s

with dm-ioband default io_limit=192
-----------------------------------
writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s
reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s

ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100

with dm-ioband default io_limit=256
-----------------------------------
reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s

ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100
ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100

Notice that with vanilla CFQ, reader is taking 10 seconds to finish and
with dm-ioband it takes more than 40 seconds to finish. So writer is still
starving the reader with both io_limit 192 and 256.

On top of that can you please give some details how increasing the
buffered queue length reduces the impact of writers?

IO Prio issue
--------------
I ran another test where two ioband devices were created of weight 100 
each on two partitions. In first group 4 readers were launched. Three
readers are of class BE and prio 7, fourth one is of class BE prio 0. In
group2, I launched a buffered writer.

One would expect that prio0 reader gets more bandwidth as compared to
prio 4 readers and prio 7 readers will get more or less same bw. Looks like
that is not happening. Look how vanilla CFQ provides much more bandwidth
to prio0 reader as compared to prio7 reader and how putting them in the
group reduces the difference betweej prio0 and prio7 readers.

Following are the results.

Vanilla CFQ
===========
set1
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s
578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s
578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s

set2
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s
578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s
578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s

set3
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s
578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s

with dm-ioband
==============
ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100

set1
----
prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s
578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s
578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s
578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s

set2
---
prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s
578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s
578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s
578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s

set3
----
prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s
578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s
578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s
578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s

Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader. 
      With dm-ioband this ratio changed to less than 200%.

      I will run more tests, but this show how notion of priority with-in a
      group changes if we implement throttling at higher layer and don't
      keep it with CFQ. 

     The second thing which strikes me is that I divided the disk 50% each
     between readers and writers and in that case would expect protection
     for writers and expect writers to finish fast. But writers have been
     slowed down like and it also kills overall disk throughput. I think
     it probably became seek bound.

     I think the moment I get more time, I will run some timed fio tests
     and look at how overall disk performed and how bandwidth was
     distributed with-in group and between groups. 

> 
> > Especially io throttling patch was very bad in terms of prio with-in 
> > group where throttling treated everyone equally and difference between
> > process prio disappeared.
> >
> > Reads Vs Writes
> > ===============
> > A higher level control most likely will change the ratio in which reads
> > and writes are dispatched to disk with-in group. It used to be decided
> > by IO scheduler so far but with higher level groups doing throttling and
> > possibly buffering the bios and releasing them later, they will have to
> > come up with their own policy on in what proportion reads and writes
> > should be dispatched. In case of IO scheduler based control, all the
> > queuing takes place at IO scheduler and it still retains control of
> > in what ration reads and writes should be dispatched.
> 
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.

At least the version of dm-ioband I have is not producing the desired
results. See above.

Is there a newer version? I will run some tests on that too. But I think
you will again run into same issue where you will decide the ratio of 
read vs write with-in group and as I change the IO schedulers results
will vary. 

So at this point of time I can't think how can you solve read vs write
ratio issue at higher layer without changing the behavior or underlying
IO scheduler.

> 
> > Summary
> > =======
> > 
> > - An io scheduler based io controller can provide better latencies,
> >   stronger isolation between groups, time based fairness and will not
> >   interfere with io schedulers policies like class, ioprio and
> >   reader vs writer issues.
> > 
> >   But it can gunrantee fairness at higher logical level devices.
> >   Especially in case of max bw control, leaf node control does not sound
> >   to be the most appropriate thing.
> > 
> > - IO throttling provides max bw control in terms of absolute rate. It has
> >   the advantage that it can provide control at higher level logical device
> >   and also control buffered writes without need of additional controller
> >   co-mounted.
> > 
> >   But it does only max bw control and not proportion control so one might
> >   not be using resources optimally. It looses sense of task prio and class
> >   with-in group as any of the task can be throttled with-in group. Because
> >   throttling does not kick in till you hit the max bw limit, it should find
> >   it hard to provide same latencies as io scheduler based control.
> > 
> > - dm-ioband also has the advantage that it can provide fairness at higher
> >   level logical devices.
> > 
> >   But, fairness is provided only in terms of size of IO or number of IO.
> >   No time based fairness. It is very throughput oriented and does not 
> >   throttle high speed group if other group is running slow random reader.
> >   This results in bad latnecies for random reader group and weaker
> >   isolation between groups.
> 
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
> 
> >   Also it does not provide fairness if a group is not continuously
> >   backlogged. So if one is running 1-2 dd/sequential readers in the group,
> >   one does not get fairness until workload is increased to a point where
> >   group becomes continuously backlogged. This also results in poor
> >   latencies and limited fairness.
> 
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.

But this has following undesired results.

- Slow moving group does not get reduced latencies. For example, random readers
  in slow moving group get no isolation and will continue to see higher max
  latencies.

- A single sequential reader in one group does not get fair share and
  we might be pushing buffered writes in other group thinking that we
  are getting better throughput. But the fact is that we are eating away
  readers share in group1 and giving it to writers in group2. Also I
  showed that we did not necessarily improve the overall throughput of
  the system by doing so. (Because it increases the number of seeks).

  I had sent you a mail to show that.

http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html

 But you changed the test case to run 4 readers in a single group to show that
 it throughput does not decrease. Please don't change test cases. In case of 4
 sequential readers in the group, group is continuously backlogged and you
 don't steal bandwidth from slow moving group. So in that mail I was not
 even discussing the scenario when you don't steal the bandwidth from
 other group. 

 I specially created one slow moving group with one reader so that we end up
 stealing bandwidth from slow moving group and show that we did not achive
 higher overall throughput by stealing the BW at the same time we did not get
 fairness for single reader and observed decreasing throughput for single
 reader as number of writers in other group increased.

Thanks
Vivek

>  
> > At this point of time it does not look like a single IO controller all
> > the scenarios/requirements. This means few things to me.
> > 
> > - Drop some of the requirements and go with one implementation which meets
> >   those reduced set of requirements.
> >
> > - Have more than one IO controller implementation in kenrel. One for lower
> >   level control for better latencies, stronger isolation and optimal resource
> >   usage and other one for fairness at higher level logical devices and max
> >   bandwidth control. 
> > 
> >   And let user decide which one to use based on his/her needs. 
> > 
> > - Come up with more intelligent way of doing IO control where single
> >   controller covers all the cases.
> > 
> > At this point of time, I am more inclined towards option 2 of having more
> > than one implementation in kernel. :-) (Until and unless we can brainstrom
> > and come up with ideas to make option 3 happen).
> >
> > > It would be great if we discuss our plans on the mailing list, so we
> > > can get early feedback from everyone.
> >  
> > This is what comes to my mind so far. Please add to the list if I have missed
> > some points. Also correct me if I am wrong about the pros/cons of the
> > approaches.
> >
> > Thoughts/ideas/opinions are welcome...
> > 
> > Thanks
> > Vivek
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-29  9:56     ` Ryo Tsuruta
@ 2009-09-29 14:10         ` Vivek Goyal
  2009-09-29 14:10         ` Vivek Goyal
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-29 14:10 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah,
	lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and all,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> 
> > > We are starting from a point where there is no cgroup based IO
> > > scheduling in the kernel. And it is probably not reasonable to satisfy
> > > all IO scheduling related requirements in one patch set. We can start
> > > with something simple, and build on top of that. So a very simple
> > > patch set that enables cgroup based proportional scheduling for CFQ
> > > seems like the way to go at this point.
> > 
> > Sure, we can start with CFQ only. But a bigger question we need to answer
> > is that is CFQ the right place to solve the issue? Jens, do you think 
> > that CFQ is the right place to solve the problem?
> > 
> > Andrew seems to favor a high level approach so that IO schedulers are less
> > complex and we can provide fairness at high level logical devices also. 
> 
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.
> 

The new design is essentially the old design. Except the fact that
suggestion is that in the first step instead of covering all the 4 IO
schedulers, first cover only CFQ and then later others. 

So providing fairness for NOOP is not an issue. Even if we introduce new
IO schedulers down the line, I can't think of a reason why can't we cover
that too with common layer.

> > I will again try to summarize my understanding so far about the pros/cons
> > of each approach and then we can take the discussion forward.
> 
> Good summary. Thanks for your work.
> 
> > Fairness in terms of size of IO or disk time used
> > =================================================
> > On a seeky media, fairness in terms of disk time can get us better results
> > instead fairness interms of size of IO or number of IO.
> > 
> > If we implement some kind of time based solution at higher layer, then 
> > that higher layer should know who used how much of time each group used. We
> > can probably do some kind of timestamping in bio to get a sense when did it
> > get into disk and when did it finish. But on a multi queue hardware there
> > can be multiple requests in the disk either from same queue or from differnet
> > queues and with pure timestamping based apparoch, so far I could not think
> > how at high level we will get an idea who used how much of time.
> 
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?

So far could not think of anything clean. Do you have something in mind.

I was thinking that elevator layer will do the merge of bios. So IO
scheduler/elevator can time stamp the first bio in the request as it goes
into the disk and again timestamp with finish time once request finishes.

This way higher layer can get an idea how much disk time a group of bios
used. But on multi queue, if we dispatch say 4 requests from same queue,
then time accounting becomes an issue.

Consider following where four requests rq1, rq2, rq3 and rq4 are
dispatched to disk at time t0, t1, t2 and t3 respectively and these
requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
time elapsed between each of milestones is t. Also assume that all these
requests are from same queue/group.

        t0   t1   t2   t3  t4   t5   t6   t7
        rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4

Now higher layer will think that time consumed by group is:

(t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t

But the time elapsed is only 7t.

Secondly if a different group is running only single sequential reader,
there CFQ will be driving queue depth of 1 and time will not be running
faster and this inaccuracy in accounting will lead to unfair share between
groups.

So we need something better to get a sense which group used how much of
disk time.

>  
> > So this is the first point of contention that how do we want to provide
> > fairness. In terms of disk time used or in terms of size of IO/number of
> > IO.
> >
> > Max bandwidth Controller or Proportional bandwidth controller
> > =============================================================
> > What is our primary requirement here? A weight based proportional
> > bandwidth controller where we can use the resources optimally and any
> > kind of throttling kicks in only if there is contention for the disk.
> > 
> > Or we want max bandwidth control where a group is not allowed to use the
> > disk even if disk is free. 
> > 
> > Or we need both? I would think that at some point of time we will need
> > both but we can start with proportional bandwidth control first.
> 
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first. 
> 

What are the cases where time based policy does not work and size based
policy works better and user would choose size based policy and not timed
based one?

I am not against implementing things in higher layer as long as we can
ensure tight control on latencies, strong isolation between groups and
not break CFQ's class and ioprio model with-in group.

> BTW, I will start to reimplement dm-ioband into block layer.

Can you elaborate little bit on this?

> 
> > Fairness for higher level logical devices
> > =========================================
> > Do we want good fairness numbers for higher level logical devices also
> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > at leaf nodes can help us use the resources optimally and in the process
> > we can get fairness at higher level also in many of the cases.
> 
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
> 

I am not sure how big an issue is this. This can be easily solved by
making use of NOOP scheduler by these devices. What are the reasons for
these devices to not use even noop? 

> > But do we want strict fairness numbers on higher level logical devices
> > even if it means sub-optimal usage of unerlying phsical devices?
> > 
> > I think that for proportinal bandwidth control, it should be ok to provide
> > fairness at higher level logical device but for max bandwidth control it
> > might make more sense to provide fairness at higher level. Consider a
> > case where from a striped device a customer wants to limit a group to
> > 30MB/s and in case of leaf node control, if every leaf node provides
> > 30MB/s, it might accumulate to much more than specified rate at logical
> > device.
> >
> > Latency Control and strong isolation between groups
> > ===================================================
> > Do we want a good isolation between groups and better latencies and
> > stronger isolation between groups?
> > 
> > I think if problem is solved at IO scheduler level, we can achieve better
> > latency control and hence stronger isolation between groups.
> > 
> > Higher level solutions should find it hard to provide same kind of latency
> > control and isolation between groups as IO scheduler based solution.
> 
> Why do you think that the higher level solution is hard to provide it? 
> I think that it is a matter of how to implement throttling policy.
> 

So far both in dm-ioband and IO throttling solution I have seen that
higher layer implements some of kind leaky bucket/token bucket algorithm,
which inherently allows IO from all the competing groups until they run
out of tokens and then these groups are made to wait till fresh tokens are
issued.

That means, most of the times, IO scheduler will see requests from more
than one group at the same time and that will be the source of weak
isolation between groups.

Consider following simple examples. Assume there are two groups and one
contains 16 random readers and other contains 1 random reader.

		G1	G2
	       16RR	1RR 

Now it might happen that IO scheduler sees requests from all the 17 RR
readers at the same time. (Throttling probably will kick in later because
you would like to give one group a nice slice of 100ms otherwise
sequential readers will suffer a lot and disk will become seek bound).

So CFQ will dispatch requests (at least one), from each of the 16 random
readers first and then from 1 random reader in group 2 and this increases
the max latency for the application in group 2 and provides weak
isolation.

There will also be additional issues with CFQ preemtpion logic. CFQ will
have no knowledge of groups and it will do cross group preemtptions. For
example if a meta data request comes in group1, it will preempt any of
the queue being served in other groups. So somebody doing "find . *" or
"cat <small files>" in one group will keep on preempting a sequential
reader in other group. Again this will probably lead to higher max
latencies.  

Note, even if CFQ does not enable idling on random readers, and expires
queue after single dispatch, seeking time between queues can be
significant. Similarly, if instead of 16 random reders we had 16 random
synchronous writers we will have seek time issue as well as writers can
often dump bigger requests which also adds to latency.

This latency issue can be solved if we dispatch requests only from one
group for a certain time of time and then move to next group. (Something
what common layer is doing).

If we go for only single group dispatching requests, then we shall have
to implemnt some of the preemption semantics also in higher layer because
in certain cases we want to do preemption across the groups. Like RT task
group preemting non-RT task group etc.

Once we go deeper into implementation, I think we will find more issues.

> > Fairness for buffered writes
> > ============================
> > Doing io control at any place below page cache has disadvantage that page
> > cache might not dispatch more writes from higher weight group hence higher
> > weight group might not see more IO done. Andrew says that we don't have
> > a solution to this problem in kernel and he would like to see it handled
> > properly.
> > 
> > Only way to solve this seems to be to slow down the writers before they
> > write into page cache. IO throttling patch handled it by slowing down 
> > writer if it crossed max specified rate. Other suggestions have come in
> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
> > al-together where some kind of per group write limit can be specified.
> > 
> > So if solution is implemented at IO scheduler layer or at device mapper
> > layer, both shall have to rely on another controller to be co-mounted
> > to handle buffered writes properly.
> >
> > Fairness with-in group
> > ======================
> > One of the issues with higher level controller is that how to do fair
> > throttling so that fairness with-in group is not impacted. Especially
> > the case of making sure that we don't break the notion of ioprio of the
> > processes with-in group.
> 
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> 
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.

Ok, I re-ran that test. Previously default io_limit value was 192 and now
I set it up to 256 as you suggested. I still see writer starving reader. I
have removed "conv=fdatasync" from writer so that a writer is pure buffered
writes.

With vanilla CFQ
----------------
reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s

with dm-ioband default io_limit=192
-----------------------------------
writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s
reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s

ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100

with dm-ioband default io_limit=256
-----------------------------------
reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s

ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100
ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100

Notice that with vanilla CFQ, reader is taking 10 seconds to finish and
with dm-ioband it takes more than 40 seconds to finish. So writer is still
starving the reader with both io_limit 192 and 256.

On top of that can you please give some details how increasing the
buffered queue length reduces the impact of writers?

IO Prio issue
--------------
I ran another test where two ioband devices were created of weight 100 
each on two partitions. In first group 4 readers were launched. Three
readers are of class BE and prio 7, fourth one is of class BE prio 0. In
group2, I launched a buffered writer.

One would expect that prio0 reader gets more bandwidth as compared to
prio 4 readers and prio 7 readers will get more or less same bw. Looks like
that is not happening. Look how vanilla CFQ provides much more bandwidth
to prio0 reader as compared to prio7 reader and how putting them in the
group reduces the difference betweej prio0 and prio7 readers.

Following are the results.

Vanilla CFQ
===========
set1
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s
578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s
578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s

set2
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s
578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s
578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s

set3
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s
578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s

with dm-ioband
==============
ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100

set1
----
prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s
578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s
578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s
578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s

set2
---
prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s
578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s
578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s
578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s

set3
----
prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s
578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s
578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s
578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s

Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader. 
      With dm-ioband this ratio changed to less than 200%.

      I will run more tests, but this show how notion of priority with-in a
      group changes if we implement throttling at higher layer and don't
      keep it with CFQ. 

     The second thing which strikes me is that I divided the disk 50% each
     between readers and writers and in that case would expect protection
     for writers and expect writers to finish fast. But writers have been
     slowed down like and it also kills overall disk throughput. I think
     it probably became seek bound.

     I think the moment I get more time, I will run some timed fio tests
     and look at how overall disk performed and how bandwidth was
     distributed with-in group and between groups. 

> 
> > Especially io throttling patch was very bad in terms of prio with-in 
> > group where throttling treated everyone equally and difference between
> > process prio disappeared.
> >
> > Reads Vs Writes
> > ===============
> > A higher level control most likely will change the ratio in which reads
> > and writes are dispatched to disk with-in group. It used to be decided
> > by IO scheduler so far but with higher level groups doing throttling and
> > possibly buffering the bios and releasing them later, they will have to
> > come up with their own policy on in what proportion reads and writes
> > should be dispatched. In case of IO scheduler based control, all the
> > queuing takes place at IO scheduler and it still retains control of
> > in what ration reads and writes should be dispatched.
> 
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.

At least the version of dm-ioband I have is not producing the desired
results. See above.

Is there a newer version? I will run some tests on that too. But I think
you will again run into same issue where you will decide the ratio of 
read vs write with-in group and as I change the IO schedulers results
will vary. 

So at this point of time I can't think how can you solve read vs write
ratio issue at higher layer without changing the behavior or underlying
IO scheduler.

> 
> > Summary
> > =======
> > 
> > - An io scheduler based io controller can provide better latencies,
> >   stronger isolation between groups, time based fairness and will not
> >   interfere with io schedulers policies like class, ioprio and
> >   reader vs writer issues.
> > 
> >   But it can gunrantee fairness at higher logical level devices.
> >   Especially in case of max bw control, leaf node control does not sound
> >   to be the most appropriate thing.
> > 
> > - IO throttling provides max bw control in terms of absolute rate. It has
> >   the advantage that it can provide control at higher level logical device
> >   and also control buffered writes without need of additional controller
> >   co-mounted.
> > 
> >   But it does only max bw control and not proportion control so one might
> >   not be using resources optimally. It looses sense of task prio and class
> >   with-in group as any of the task can be throttled with-in group. Because
> >   throttling does not kick in till you hit the max bw limit, it should find
> >   it hard to provide same latencies as io scheduler based control.
> > 
> > - dm-ioband also has the advantage that it can provide fairness at higher
> >   level logical devices.
> > 
> >   But, fairness is provided only in terms of size of IO or number of IO.
> >   No time based fairness. It is very throughput oriented and does not 
> >   throttle high speed group if other group is running slow random reader.
> >   This results in bad latnecies for random reader group and weaker
> >   isolation between groups.
> 
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
> 
> >   Also it does not provide fairness if a group is not continuously
> >   backlogged. So if one is running 1-2 dd/sequential readers in the group,
> >   one does not get fairness until workload is increased to a point where
> >   group becomes continuously backlogged. This also results in poor
> >   latencies and limited fairness.
> 
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.

But this has following undesired results.

- Slow moving group does not get reduced latencies. For example, random readers
  in slow moving group get no isolation and will continue to see higher max
  latencies.

- A single sequential reader in one group does not get fair share and
  we might be pushing buffered writes in other group thinking that we
  are getting better throughput. But the fact is that we are eating away
  readers share in group1 and giving it to writers in group2. Also I
  showed that we did not necessarily improve the overall throughput of
  the system by doing so. (Because it increases the number of seeks).

  I had sent you a mail to show that.

http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html

 But you changed the test case to run 4 readers in a single group to show that
 it throughput does not decrease. Please don't change test cases. In case of 4
 sequential readers in the group, group is continuously backlogged and you
 don't steal bandwidth from slow moving group. So in that mail I was not
 even discussing the scenario when you don't steal the bandwidth from
 other group. 

 I specially created one slow moving group with one reader so that we end up
 stealing bandwidth from slow moving group and show that we did not achive
 higher overall throughput by stealing the BW at the same time we did not get
 fairness for single reader and observed decreasing throughput for single
 reader as number of writers in other group increased.

Thanks
Vivek

>  
> > At this point of time it does not look like a single IO controller all
> > the scenarios/requirements. This means few things to me.
> > 
> > - Drop some of the requirements and go with one implementation which meets
> >   those reduced set of requirements.
> >
> > - Have more than one IO controller implementation in kenrel. One for lower
> >   level control for better latencies, stronger isolation and optimal resource
> >   usage and other one for fairness at higher level logical devices and max
> >   bandwidth control. 
> > 
> >   And let user decide which one to use based on his/her needs. 
> > 
> > - Come up with more intelligent way of doing IO control where single
> >   controller covers all the cases.
> > 
> > At this point of time, I am more inclined towards option 2 of having more
> > than one implementation in kernel. :-) (Until and unless we can brainstrom
> > and come up with ideas to make option 3 happen).
> >
> > > It would be great if we discuss our plans on the mailing list, so we
> > > can get early feedback from everyone.
> >  
> > This is what comes to my mind so far. Please add to the list if I have missed
> > some points. Also correct me if I am wrong about the pros/cons of the
> > approaches.
> >
> > Thoughts/ideas/opinions are welcome...
> > 
> > Thanks
> > Vivek
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-29 14:10         ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-29 14:10 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf,
	fchecconi, s-uchida, containers, linux-kernel, akpm,
	righi.andrea, torvalds

On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote:
> Hi Vivek and all,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> 
> > > We are starting from a point where there is no cgroup based IO
> > > scheduling in the kernel. And it is probably not reasonable to satisfy
> > > all IO scheduling related requirements in one patch set. We can start
> > > with something simple, and build on top of that. So a very simple
> > > patch set that enables cgroup based proportional scheduling for CFQ
> > > seems like the way to go at this point.
> > 
> > Sure, we can start with CFQ only. But a bigger question we need to answer
> > is that is CFQ the right place to solve the issue? Jens, do you think 
> > that CFQ is the right place to solve the problem?
> > 
> > Andrew seems to favor a high level approach so that IO schedulers are less
> > complex and we can provide fairness at high level logical devices also. 
> 
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.
> 

The new design is essentially the old design. Except the fact that
suggestion is that in the first step instead of covering all the 4 IO
schedulers, first cover only CFQ and then later others. 

So providing fairness for NOOP is not an issue. Even if we introduce new
IO schedulers down the line, I can't think of a reason why can't we cover
that too with common layer.

> > I will again try to summarize my understanding so far about the pros/cons
> > of each approach and then we can take the discussion forward.
> 
> Good summary. Thanks for your work.
> 
> > Fairness in terms of size of IO or disk time used
> > =================================================
> > On a seeky media, fairness in terms of disk time can get us better results
> > instead fairness interms of size of IO or number of IO.
> > 
> > If we implement some kind of time based solution at higher layer, then 
> > that higher layer should know who used how much of time each group used. We
> > can probably do some kind of timestamping in bio to get a sense when did it
> > get into disk and when did it finish. But on a multi queue hardware there
> > can be multiple requests in the disk either from same queue or from differnet
> > queues and with pure timestamping based apparoch, so far I could not think
> > how at high level we will get an idea who used how much of time.
> 
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?

So far could not think of anything clean. Do you have something in mind.

I was thinking that elevator layer will do the merge of bios. So IO
scheduler/elevator can time stamp the first bio in the request as it goes
into the disk and again timestamp with finish time once request finishes.

This way higher layer can get an idea how much disk time a group of bios
used. But on multi queue, if we dispatch say 4 requests from same queue,
then time accounting becomes an issue.

Consider following where four requests rq1, rq2, rq3 and rq4 are
dispatched to disk at time t0, t1, t2 and t3 respectively and these
requests finish at time t4, t5, t6 and t7. For sake of simlicity assume
time elapsed between each of milestones is t. Also assume that all these
requests are from same queue/group.

        t0   t1   t2   t3  t4   t5   t6   t7
        rq1  rq2  rq3 rq4  rq1  rq2  rq3 rq4

Now higher layer will think that time consumed by group is:

(t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t

But the time elapsed is only 7t.

Secondly if a different group is running only single sequential reader,
there CFQ will be driving queue depth of 1 and time will not be running
faster and this inaccuracy in accounting will lead to unfair share between
groups.

So we need something better to get a sense which group used how much of
disk time.

>  
> > So this is the first point of contention that how do we want to provide
> > fairness. In terms of disk time used or in terms of size of IO/number of
> > IO.
> >
> > Max bandwidth Controller or Proportional bandwidth controller
> > =============================================================
> > What is our primary requirement here? A weight based proportional
> > bandwidth controller where we can use the resources optimally and any
> > kind of throttling kicks in only if there is contention for the disk.
> > 
> > Or we want max bandwidth control where a group is not allowed to use the
> > disk even if disk is free. 
> > 
> > Or we need both? I would think that at some point of time we will need
> > both but we can start with proportional bandwidth control first.
> 
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first. 
> 

What are the cases where time based policy does not work and size based
policy works better and user would choose size based policy and not timed
based one?

I am not against implementing things in higher layer as long as we can
ensure tight control on latencies, strong isolation between groups and
not break CFQ's class and ioprio model with-in group.

> BTW, I will start to reimplement dm-ioband into block layer.

Can you elaborate little bit on this?

> 
> > Fairness for higher level logical devices
> > =========================================
> > Do we want good fairness numbers for higher level logical devices also
> > or it is sufficient to provide fairness at leaf nodes. Providing fairness
> > at leaf nodes can help us use the resources optimally and in the process
> > we can get fairness at higher level also in many of the cases.
> 
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
> 

I am not sure how big an issue is this. This can be easily solved by
making use of NOOP scheduler by these devices. What are the reasons for
these devices to not use even noop? 

> > But do we want strict fairness numbers on higher level logical devices
> > even if it means sub-optimal usage of unerlying phsical devices?
> > 
> > I think that for proportinal bandwidth control, it should be ok to provide
> > fairness at higher level logical device but for max bandwidth control it
> > might make more sense to provide fairness at higher level. Consider a
> > case where from a striped device a customer wants to limit a group to
> > 30MB/s and in case of leaf node control, if every leaf node provides
> > 30MB/s, it might accumulate to much more than specified rate at logical
> > device.
> >
> > Latency Control and strong isolation between groups
> > ===================================================
> > Do we want a good isolation between groups and better latencies and
> > stronger isolation between groups?
> > 
> > I think if problem is solved at IO scheduler level, we can achieve better
> > latency control and hence stronger isolation between groups.
> > 
> > Higher level solutions should find it hard to provide same kind of latency
> > control and isolation between groups as IO scheduler based solution.
> 
> Why do you think that the higher level solution is hard to provide it? 
> I think that it is a matter of how to implement throttling policy.
> 

So far both in dm-ioband and IO throttling solution I have seen that
higher layer implements some of kind leaky bucket/token bucket algorithm,
which inherently allows IO from all the competing groups until they run
out of tokens and then these groups are made to wait till fresh tokens are
issued.

That means, most of the times, IO scheduler will see requests from more
than one group at the same time and that will be the source of weak
isolation between groups.

Consider following simple examples. Assume there are two groups and one
contains 16 random readers and other contains 1 random reader.

		G1	G2
	       16RR	1RR 

Now it might happen that IO scheduler sees requests from all the 17 RR
readers at the same time. (Throttling probably will kick in later because
you would like to give one group a nice slice of 100ms otherwise
sequential readers will suffer a lot and disk will become seek bound).

So CFQ will dispatch requests (at least one), from each of the 16 random
readers first and then from 1 random reader in group 2 and this increases
the max latency for the application in group 2 and provides weak
isolation.

There will also be additional issues with CFQ preemtpion logic. CFQ will
have no knowledge of groups and it will do cross group preemtptions. For
example if a meta data request comes in group1, it will preempt any of
the queue being served in other groups. So somebody doing "find . *" or
"cat <small files>" in one group will keep on preempting a sequential
reader in other group. Again this will probably lead to higher max
latencies.  

Note, even if CFQ does not enable idling on random readers, and expires
queue after single dispatch, seeking time between queues can be
significant. Similarly, if instead of 16 random reders we had 16 random
synchronous writers we will have seek time issue as well as writers can
often dump bigger requests which also adds to latency.

This latency issue can be solved if we dispatch requests only from one
group for a certain time of time and then move to next group. (Something
what common layer is doing).

If we go for only single group dispatching requests, then we shall have
to implemnt some of the preemption semantics also in higher layer because
in certain cases we want to do preemption across the groups. Like RT task
group preemting non-RT task group etc.

Once we go deeper into implementation, I think we will find more issues.

> > Fairness for buffered writes
> > ============================
> > Doing io control at any place below page cache has disadvantage that page
> > cache might not dispatch more writes from higher weight group hence higher
> > weight group might not see more IO done. Andrew says that we don't have
> > a solution to this problem in kernel and he would like to see it handled
> > properly.
> > 
> > Only way to solve this seems to be to slow down the writers before they
> > write into page cache. IO throttling patch handled it by slowing down 
> > writer if it crossed max specified rate. Other suggestions have come in
> > the form of dirty_ratio per memory cgroup or a separate cgroup controller
> > al-together where some kind of per group write limit can be specified.
> > 
> > So if solution is implemented at IO scheduler layer or at device mapper
> > layer, both shall have to rely on another controller to be co-mounted
> > to handle buffered writes properly.
> >
> > Fairness with-in group
> > ======================
> > One of the issues with higher level controller is that how to do fair
> > throttling so that fairness with-in group is not impacted. Especially
> > the case of making sure that we don't break the notion of ioprio of the
> > processes with-in group.
> 
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> 
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.

Ok, I re-ran that test. Previously default io_limit value was 192 and now
I set it up to 256 as you suggested. I still see writer starving reader. I
have removed "conv=fdatasync" from writer so that a writer is pure buffered
writes.

With vanilla CFQ
----------------
reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s

with dm-ioband default io_limit=192
-----------------------------------
writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s
reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s

ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100

with dm-ioband default io_limit=256
-----------------------------------
reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s

ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100
ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100

Notice that with vanilla CFQ, reader is taking 10 seconds to finish and
with dm-ioband it takes more than 40 seconds to finish. So writer is still
starving the reader with both io_limit 192 and 256.

On top of that can you please give some details how increasing the
buffered queue length reduces the impact of writers?

IO Prio issue
--------------
I ran another test where two ioband devices were created of weight 100 
each on two partitions. In first group 4 readers were launched. Three
readers are of class BE and prio 7, fourth one is of class BE prio 0. In
group2, I launched a buffered writer.

One would expect that prio0 reader gets more bandwidth as compared to
prio 4 readers and prio 7 readers will get more or less same bw. Looks like
that is not happening. Look how vanilla CFQ provides much more bandwidth
to prio0 reader as compared to prio7 reader and how putting them in the
group reduces the difference betweej prio0 and prio7 readers.

Following are the results.

Vanilla CFQ
===========
set1
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s
578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s
578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s

set2
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s
578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s
578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s

set3
----
prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s
578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s
578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s

with dm-ioband
==============
ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100
ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100

set1
----
prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s
578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s
578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s
578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s

set2
---
prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s
578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s
578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s
578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s

set3
----
prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s
578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s
578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s
578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s
writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s

Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader. 
      With dm-ioband this ratio changed to less than 200%.

      I will run more tests, but this show how notion of priority with-in a
      group changes if we implement throttling at higher layer and don't
      keep it with CFQ. 

     The second thing which strikes me is that I divided the disk 50% each
     between readers and writers and in that case would expect protection
     for writers and expect writers to finish fast. But writers have been
     slowed down like and it also kills overall disk throughput. I think
     it probably became seek bound.

     I think the moment I get more time, I will run some timed fio tests
     and look at how overall disk performed and how bandwidth was
     distributed with-in group and between groups. 

> 
> > Especially io throttling patch was very bad in terms of prio with-in 
> > group where throttling treated everyone equally and difference between
> > process prio disappeared.
> >
> > Reads Vs Writes
> > ===============
> > A higher level control most likely will change the ratio in which reads
> > and writes are dispatched to disk with-in group. It used to be decided
> > by IO scheduler so far but with higher level groups doing throttling and
> > possibly buffering the bios and releasing them later, they will have to
> > come up with their own policy on in what proportion reads and writes
> > should be dispatched. In case of IO scheduler based control, all the
> > queuing takes place at IO scheduler and it still retains control of
> > in what ration reads and writes should be dispatched.
> 
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.

At least the version of dm-ioband I have is not producing the desired
results. See above.

Is there a newer version? I will run some tests on that too. But I think
you will again run into same issue where you will decide the ratio of 
read vs write with-in group and as I change the IO schedulers results
will vary. 

So at this point of time I can't think how can you solve read vs write
ratio issue at higher layer without changing the behavior or underlying
IO scheduler.

> 
> > Summary
> > =======
> > 
> > - An io scheduler based io controller can provide better latencies,
> >   stronger isolation between groups, time based fairness and will not
> >   interfere with io schedulers policies like class, ioprio and
> >   reader vs writer issues.
> > 
> >   But it can gunrantee fairness at higher logical level devices.
> >   Especially in case of max bw control, leaf node control does not sound
> >   to be the most appropriate thing.
> > 
> > - IO throttling provides max bw control in terms of absolute rate. It has
> >   the advantage that it can provide control at higher level logical device
> >   and also control buffered writes without need of additional controller
> >   co-mounted.
> > 
> >   But it does only max bw control and not proportion control so one might
> >   not be using resources optimally. It looses sense of task prio and class
> >   with-in group as any of the task can be throttled with-in group. Because
> >   throttling does not kick in till you hit the max bw limit, it should find
> >   it hard to provide same latencies as io scheduler based control.
> > 
> > - dm-ioband also has the advantage that it can provide fairness at higher
> >   level logical devices.
> > 
> >   But, fairness is provided only in terms of size of IO or number of IO.
> >   No time based fairness. It is very throughput oriented and does not 
> >   throttle high speed group if other group is running slow random reader.
> >   This results in bad latnecies for random reader group and weaker
> >   isolation between groups.
> 
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
> 
> >   Also it does not provide fairness if a group is not continuously
> >   backlogged. So if one is running 1-2 dd/sequential readers in the group,
> >   one does not get fairness until workload is increased to a point where
> >   group becomes continuously backlogged. This also results in poor
> >   latencies and limited fairness.
> 
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.

But this has following undesired results.

- Slow moving group does not get reduced latencies. For example, random readers
  in slow moving group get no isolation and will continue to see higher max
  latencies.

- A single sequential reader in one group does not get fair share and
  we might be pushing buffered writes in other group thinking that we
  are getting better throughput. But the fact is that we are eating away
  readers share in group1 and giving it to writers in group2. Also I
  showed that we did not necessarily improve the overall throughput of
  the system by doing so. (Because it increases the number of seeks).

  I had sent you a mail to show that.

http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html

 But you changed the test case to run 4 readers in a single group to show that
 it throughput does not decrease. Please don't change test cases. In case of 4
 sequential readers in the group, group is continuously backlogged and you
 don't steal bandwidth from slow moving group. So in that mail I was not
 even discussing the scenario when you don't steal the bandwidth from
 other group. 

 I specially created one slow moving group with one reader so that we end up
 stealing bandwidth from slow moving group and show that we did not achive
 higher overall throughput by stealing the BW at the same time we did not get
 fairness for single reader and observed decreasing throughput for single
 reader as number of writers in other group increased.

Thanks
Vivek

>  
> > At this point of time it does not look like a single IO controller all
> > the scenarios/requirements. This means few things to me.
> > 
> > - Drop some of the requirements and go with one implementation which meets
> >   those reduced set of requirements.
> >
> > - Have more than one IO controller implementation in kenrel. One for lower
> >   level control for better latencies, stronger isolation and optimal resource
> >   usage and other one for fairness at higher level logical devices and max
> >   bandwidth control. 
> > 
> >   And let user decide which one to use based on his/her needs. 
> > 
> > - Come up with more intelligent way of doing IO control where single
> >   controller covers all the cases.
> > 
> > At this point of time, I am more inclined towards option 2 of having more
> > than one implementation in kernel. :-) (Until and unless we can brainstrom
> > and come up with ideas to make option 3 happen).
> >
> > > It would be great if we discuss our plans on the mailing list, so we
> > > can get early feedback from everyone.
> >  
> > This is what comes to my mind so far. Please add to the list if I have missed
> > some points. Also correct me if I am wrong about the pros/cons of the
> > approaches.
> >
> > Thoughts/ideas/opinions are welcome...
> > 
> > Thanks
> > Vivek
> 
> Thanks,
> Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-29 10:49         ` Takuya Yoshikawa
  2009-09-29 14:10         ` Vivek Goyal
  2009-09-30  3:11         ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: Takuya Yoshikawa @ 2009-09-29 10:49 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi,

Ryo Tsuruta wrote:
> Hi Vivek and all,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> 
>>> We are starting from a point where there is no cgroup based IO
>>> scheduling in the kernel. And it is probably not reasonable to satisfy
>>> all IO scheduling related requirements in one patch set. We can start
>>> with something simple, and build on top of that. So a very simple
>>> patch set that enables cgroup based proportional scheduling for CFQ
>>> seems like the way to go at this point.
>> Sure, we can start with CFQ only. But a bigger question we need to answer
>> is that is CFQ the right place to solve the issue? Jens, do you think 
>> that CFQ is the right place to solve the problem?
>>
>> Andrew seems to favor a high level approach so that IO schedulers are less
>> complex and we can provide fairness at high level logical devices also. 
> 
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.

Nauman said "cgroup based proportional scheduling for CFQ" and we need not
expand much of CFQ itself, is it right Nauman?

If so, we can reuse the io controller for new schedulers similar to CFQ.

I do not know well about how much important is it to consider which scheduler
is the current enterprise storages' favarite.
If we introduce an io controller, io pattern to disks will change,
in that case there is no guarantee that NOOP with some io controller
should work better than CFQ with some io controller.

Of course io controller for NOOP may be better.

Thanks,
Takuya Yoshikawa


> 
>> I will again try to summarize my understanding so far about the pros/cons
>> of each approach and then we can take the discussion forward.
> 
> Good summary. Thanks for your work.
> 
>> Fairness in terms of size of IO or disk time used
>> =================================================
>> On a seeky media, fairness in terms of disk time can get us better results
>> instead fairness interms of size of IO or number of IO.
>>
>> If we implement some kind of time based solution at higher layer, then 
>> that higher layer should know who used how much of time each group used. We
>> can probably do some kind of timestamping in bio to get a sense when did it
>> get into disk and when did it finish. But on a multi queue hardware there
>> can be multiple requests in the disk either from same queue or from differnet
>> queues and with pure timestamping based apparoch, so far I could not think
>> how at high level we will get an idea who used how much of time.
> 
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?
>  
>> So this is the first point of contention that how do we want to provide
>> fairness. In terms of disk time used or in terms of size of IO/number of
>> IO.
>>
>> Max bandwidth Controller or Proportional bandwidth controller
>> =============================================================
>> What is our primary requirement here? A weight based proportional
>> bandwidth controller where we can use the resources optimally and any
>> kind of throttling kicks in only if there is contention for the disk.
>>
>> Or we want max bandwidth control where a group is not allowed to use the
>> disk even if disk is free. 
>>
>> Or we need both? I would think that at some point of time we will need
>> both but we can start with proportional bandwidth control first.
> 
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first. 
> 
> BTW, I will start to reimplement dm-ioband into block layer.
> 
>> Fairness for higher level logical devices
>> =========================================
>> Do we want good fairness numbers for higher level logical devices also
>> or it is sufficient to provide fairness at leaf nodes. Providing fairness
>> at leaf nodes can help us use the resources optimally and in the process
>> we can get fairness at higher level also in many of the cases.
> 
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
> 
>> But do we want strict fairness numbers on higher level logical devices
>> even if it means sub-optimal usage of unerlying phsical devices?
>>
>> I think that for proportinal bandwidth control, it should be ok to provide
>> fairness at higher level logical device but for max bandwidth control it
>> might make more sense to provide fairness at higher level. Consider a
>> case where from a striped device a customer wants to limit a group to
>> 30MB/s and in case of leaf node control, if every leaf node provides
>> 30MB/s, it might accumulate to much more than specified rate at logical
>> device.
>>
>> Latency Control and strong isolation between groups
>> ===================================================
>> Do we want a good isolation between groups and better latencies and
>> stronger isolation between groups?
>>
>> I think if problem is solved at IO scheduler level, we can achieve better
>> latency control and hence stronger isolation between groups.
>>
>> Higher level solutions should find it hard to provide same kind of latency
>> control and isolation between groups as IO scheduler based solution.
> 
> Why do you think that the higher level solution is hard to provide it? 
> I think that it is a matter of how to implement throttling policy.
> 
>> Fairness for buffered writes
>> ============================
>> Doing io control at any place below page cache has disadvantage that page
>> cache might not dispatch more writes from higher weight group hence higher
>> weight group might not see more IO done. Andrew says that we don't have
>> a solution to this problem in kernel and he would like to see it handled
>> properly.
>>
>> Only way to solve this seems to be to slow down the writers before they
>> write into page cache. IO throttling patch handled it by slowing down 
>> writer if it crossed max specified rate. Other suggestions have come in
>> the form of dirty_ratio per memory cgroup or a separate cgroup controller
>> al-together where some kind of per group write limit can be specified.
>>
>> So if solution is implemented at IO scheduler layer or at device mapper
>> layer, both shall have to rely on another controller to be co-mounted
>> to handle buffered writes properly.
>>
>> Fairness with-in group
>> ======================
>> One of the issues with higher level controller is that how to do fair
>> throttling so that fairness with-in group is not impacted. Especially
>> the case of making sure that we don't break the notion of ioprio of the
>> processes with-in group.
> 
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> 
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.
> 
>> Especially io throttling patch was very bad in terms of prio with-in 
>> group where throttling treated everyone equally and difference between
>> process prio disappeared.
>>
>> Reads Vs Writes
>> ===============
>> A higher level control most likely will change the ratio in which reads
>> and writes are dispatched to disk with-in group. It used to be decided
>> by IO scheduler so far but with higher level groups doing throttling and
>> possibly buffering the bios and releasing them later, they will have to
>> come up with their own policy on in what proportion reads and writes
>> should be dispatched. In case of IO scheduler based control, all the
>> queuing takes place at IO scheduler and it still retains control of
>> in what ration reads and writes should be dispatched.
> 
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.
> 
>> Summary
>> =======
>>
>> - An io scheduler based io controller can provide better latencies,
>>   stronger isolation between groups, time based fairness and will not
>>   interfere with io schedulers policies like class, ioprio and
>>   reader vs writer issues.
>>
>>   But it can gunrantee fairness at higher logical level devices.
>>   Especially in case of max bw control, leaf node control does not sound
>>   to be the most appropriate thing.
>>
>> - IO throttling provides max bw control in terms of absolute rate. It has
>>   the advantage that it can provide control at higher level logical device
>>   and also control buffered writes without need of additional controller
>>   co-mounted.
>>
>>   But it does only max bw control and not proportion control so one might
>>   not be using resources optimally. It looses sense of task prio and class
>>   with-in group as any of the task can be throttled with-in group. Because
>>   throttling does not kick in till you hit the max bw limit, it should find
>>   it hard to provide same latencies as io scheduler based control.
>>
>> - dm-ioband also has the advantage that it can provide fairness at higher
>>   level logical devices.
>>
>>   But, fairness is provided only in terms of size of IO or number of IO.
>>   No time based fairness. It is very throughput oriented and does not 
>>   throttle high speed group if other group is running slow random reader.
>>   This results in bad latnecies for random reader group and weaker
>>   isolation between groups.
> 
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
> 
>>   Also it does not provide fairness if a group is not continuously
>>   backlogged. So if one is running 1-2 dd/sequential readers in the group,
>>   one does not get fairness until workload is increased to a point where
>>   group becomes continuously backlogged. This also results in poor
>>   latencies and limited fairness.
> 
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.
>  
>> At this point of time it does not look like a single IO controller all
>> the scenarios/requirements. This means few things to me.
>>
>> - Drop some of the requirements and go with one implementation which meets
>>   those reduced set of requirements.
>>
>> - Have more than one IO controller implementation in kenrel. One for lower
>>   level control for better latencies, stronger isolation and optimal resource
>>   usage and other one for fairness at higher level logical devices and max
>>   bandwidth control. 
>>
>>   And let user decide which one to use based on his/her needs. 
>>
>> - Come up with more intelligent way of doing IO control where single
>>   controller covers all the cases.
>>
>> At this point of time, I am more inclined towards option 2 of having more
>> than one implementation in kernel. :-) (Until and unless we can brainstrom
>> and come up with ideas to make option 3 happen).
>>
>>> It would be great if we discuss our plans on the mailing list, so we
>>> can get early feedback from everyone.
>>  
>> This is what comes to my mind so far. Please add to the list if I have missed
>> some points. Also correct me if I am wrong about the pros/cons of the
>> approaches.
>>
>> Thoughts/ideas/opinions are welcome...
>>
>> Thanks
>> Vivek
> 
> Thanks,
> Ryo Tsuruta
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-29  9:56     ` Ryo Tsuruta
@ 2009-09-29 10:49       ` Takuya Yoshikawa
  2009-09-29 14:10         ` Vivek Goyal
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 349+ messages in thread
From: Takuya Yoshikawa @ 2009-09-29 10:49 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: vgoyal, nauman, linux-kernel, jens.axboe, containers, dm-devel,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo, riel

Hi,

Ryo Tsuruta wrote:
> Hi Vivek and all,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
>> On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> 
>>> We are starting from a point where there is no cgroup based IO
>>> scheduling in the kernel. And it is probably not reasonable to satisfy
>>> all IO scheduling related requirements in one patch set. We can start
>>> with something simple, and build on top of that. So a very simple
>>> patch set that enables cgroup based proportional scheduling for CFQ
>>> seems like the way to go at this point.
>> Sure, we can start with CFQ only. But a bigger question we need to answer
>> is that is CFQ the right place to solve the issue? Jens, do you think 
>> that CFQ is the right place to solve the problem?
>>
>> Andrew seems to favor a high level approach so that IO schedulers are less
>> complex and we can provide fairness at high level logical devices also. 
> 
> I'm not in favor of expansion of CFQ, because some enterprise storages
> are better performed with NOOP rather than CFQ, and I think bandwidth
> control is needed much more for such storage system. Is it easy to
> support other IO schedulers even if a new IO scheduler is introduced?
> I would like to know a bit more specific about Namuman's scheduler design.

Nauman said "cgroup based proportional scheduling for CFQ" and we need not
expand much of CFQ itself, is it right Nauman?

If so, we can reuse the io controller for new schedulers similar to CFQ.

I do not know well about how much important is it to consider which scheduler
is the current enterprise storages' favarite.
If we introduce an io controller, io pattern to disks will change,
in that case there is no guarantee that NOOP with some io controller
should work better than CFQ with some io controller.

Of course io controller for NOOP may be better.

Thanks,
Takuya Yoshikawa


> 
>> I will again try to summarize my understanding so far about the pros/cons
>> of each approach and then we can take the discussion forward.
> 
> Good summary. Thanks for your work.
> 
>> Fairness in terms of size of IO or disk time used
>> =================================================
>> On a seeky media, fairness in terms of disk time can get us better results
>> instead fairness interms of size of IO or number of IO.
>>
>> If we implement some kind of time based solution at higher layer, then 
>> that higher layer should know who used how much of time each group used. We
>> can probably do some kind of timestamping in bio to get a sense when did it
>> get into disk and when did it finish. But on a multi queue hardware there
>> can be multiple requests in the disk either from same queue or from differnet
>> queues and with pure timestamping based apparoch, so far I could not think
>> how at high level we will get an idea who used how much of time.
> 
> IIUC, could the overlap time be calculated from time-stamp on a multi
> queue hardware?
>  
>> So this is the first point of contention that how do we want to provide
>> fairness. In terms of disk time used or in terms of size of IO/number of
>> IO.
>>
>> Max bandwidth Controller or Proportional bandwidth controller
>> =============================================================
>> What is our primary requirement here? A weight based proportional
>> bandwidth controller where we can use the resources optimally and any
>> kind of throttling kicks in only if there is contention for the disk.
>>
>> Or we want max bandwidth control where a group is not allowed to use the
>> disk even if disk is free. 
>>
>> Or we need both? I would think that at some point of time we will need
>> both but we can start with proportional bandwidth control first.
> 
> How about making throttling policy be user selectable like the IO
> scheduler and putting it in the higher layer? So we could support
> all of policies (time-based, size-based and rate limiting). There
> seems not to only one solution which satisfies all users. But I agree
> with starting with proportional bandwidth control first. 
> 
> BTW, I will start to reimplement dm-ioband into block layer.
> 
>> Fairness for higher level logical devices
>> =========================================
>> Do we want good fairness numbers for higher level logical devices also
>> or it is sufficient to provide fairness at leaf nodes. Providing fairness
>> at leaf nodes can help us use the resources optimally and in the process
>> we can get fairness at higher level also in many of the cases.
> 
> We should also take care of block devices which provide their own
> make_request_fn() and not use a IO scheduler. We can't use the leaf
> nodes approach to such devices.
> 
>> But do we want strict fairness numbers on higher level logical devices
>> even if it means sub-optimal usage of unerlying phsical devices?
>>
>> I think that for proportinal bandwidth control, it should be ok to provide
>> fairness at higher level logical device but for max bandwidth control it
>> might make more sense to provide fairness at higher level. Consider a
>> case where from a striped device a customer wants to limit a group to
>> 30MB/s and in case of leaf node control, if every leaf node provides
>> 30MB/s, it might accumulate to much more than specified rate at logical
>> device.
>>
>> Latency Control and strong isolation between groups
>> ===================================================
>> Do we want a good isolation between groups and better latencies and
>> stronger isolation between groups?
>>
>> I think if problem is solved at IO scheduler level, we can achieve better
>> latency control and hence stronger isolation between groups.
>>
>> Higher level solutions should find it hard to provide same kind of latency
>> control and isolation between groups as IO scheduler based solution.
> 
> Why do you think that the higher level solution is hard to provide it? 
> I think that it is a matter of how to implement throttling policy.
> 
>> Fairness for buffered writes
>> ============================
>> Doing io control at any place below page cache has disadvantage that page
>> cache might not dispatch more writes from higher weight group hence higher
>> weight group might not see more IO done. Andrew says that we don't have
>> a solution to this problem in kernel and he would like to see it handled
>> properly.
>>
>> Only way to solve this seems to be to slow down the writers before they
>> write into page cache. IO throttling patch handled it by slowing down 
>> writer if it crossed max specified rate. Other suggestions have come in
>> the form of dirty_ratio per memory cgroup or a separate cgroup controller
>> al-together where some kind of per group write limit can be specified.
>>
>> So if solution is implemented at IO scheduler layer or at device mapper
>> layer, both shall have to rely on another controller to be co-mounted
>> to handle buffered writes properly.
>>
>> Fairness with-in group
>> ======================
>> One of the issues with higher level controller is that how to do fair
>> throttling so that fairness with-in group is not impacted. Especially
>> the case of making sure that we don't break the notion of ioprio of the
>> processes with-in group.
> 
> I ran your test script to confirm that the notion of ioprio was not
> broken by dm-ioband. Here is the results of the test.
> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html
> 
> I think that the time period during which dm-ioband holds IO requests
> for throttling would be too short to break the notion of ioprio.
> 
>> Especially io throttling patch was very bad in terms of prio with-in 
>> group where throttling treated everyone equally and difference between
>> process prio disappeared.
>>
>> Reads Vs Writes
>> ===============
>> A higher level control most likely will change the ratio in which reads
>> and writes are dispatched to disk with-in group. It used to be decided
>> by IO scheduler so far but with higher level groups doing throttling and
>> possibly buffering the bios and releasing them later, they will have to
>> come up with their own policy on in what proportion reads and writes
>> should be dispatched. In case of IO scheduler based control, all the
>> queuing takes place at IO scheduler and it still retains control of
>> in what ration reads and writes should be dispatched.
> 
> I don't think it is a concern. The current implementation of dm-ioband
> is that sync/async IO requests are handled separately and the
> backlogged IOs are released according to the order of arrival if both
> sync and async requests are backlogged.
> 
>> Summary
>> =======
>>
>> - An io scheduler based io controller can provide better latencies,
>>   stronger isolation between groups, time based fairness and will not
>>   interfere with io schedulers policies like class, ioprio and
>>   reader vs writer issues.
>>
>>   But it can gunrantee fairness at higher logical level devices.
>>   Especially in case of max bw control, leaf node control does not sound
>>   to be the most appropriate thing.
>>
>> - IO throttling provides max bw control in terms of absolute rate. It has
>>   the advantage that it can provide control at higher level logical device
>>   and also control buffered writes without need of additional controller
>>   co-mounted.
>>
>>   But it does only max bw control and not proportion control so one might
>>   not be using resources optimally. It looses sense of task prio and class
>>   with-in group as any of the task can be throttled with-in group. Because
>>   throttling does not kick in till you hit the max bw limit, it should find
>>   it hard to provide same latencies as io scheduler based control.
>>
>> - dm-ioband also has the advantage that it can provide fairness at higher
>>   level logical devices.
>>
>>   But, fairness is provided only in terms of size of IO or number of IO.
>>   No time based fairness. It is very throughput oriented and does not 
>>   throttle high speed group if other group is running slow random reader.
>>   This results in bad latnecies for random reader group and weaker
>>   isolation between groups.
> 
> A new policy can be added to dm-ioband. Actually, range-bw policy,
> which provides min and max bandwidth control, does time-based
> throttling. Moreover there is room for improvement for existing
> policies. The write-starve-read issue you pointed out will be solved
> soon.
> 
>>   Also it does not provide fairness if a group is not continuously
>>   backlogged. So if one is running 1-2 dd/sequential readers in the group,
>>   one does not get fairness until workload is increased to a point where
>>   group becomes continuously backlogged. This also results in poor
>>   latencies and limited fairness.
> 
> This is intended to efficiently use bandwidth of underlying devices
> when IO load is low.
>  
>> At this point of time it does not look like a single IO controller all
>> the scenarios/requirements. This means few things to me.
>>
>> - Drop some of the requirements and go with one implementation which meets
>>   those reduced set of requirements.
>>
>> - Have more than one IO controller implementation in kenrel. One for lower
>>   level control for better latencies, stronger isolation and optimal resource
>>   usage and other one for fairness at higher level logical devices and max
>>   bandwidth control. 
>>
>>   And let user decide which one to use based on his/her needs. 
>>
>> - Come up with more intelligent way of doing IO control where single
>>   controller covers all the cases.
>>
>> At this point of time, I am more inclined towards option 2 of having more
>> than one implementation in kernel. :-) (Until and unless we can brainstrom
>> and come up with ideas to make option 3 happen).
>>
>>> It would be great if we discuss our plans on the mailing list, so we
>>> can get early feedback from everyone.
>>  
>> This is what comes to my mind so far. Please add to the list if I have missed
>> some points. Also correct me if I am wrong about the pros/cons of the
>> approaches.
>>
>> Thoughts/ideas/opinions are welcome...
>>
>> Thanks
>> Vivek
> 
> Thanks,
> Ryo Tsuruta
> 


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]     ` <20090929032255.GA10664-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-29  9:56       ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-29  9:56 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek and all,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:

> > We are starting from a point where there is no cgroup based IO
> > scheduling in the kernel. And it is probably not reasonable to satisfy
> > all IO scheduling related requirements in one patch set. We can start
> > with something simple, and build on top of that. So a very simple
> > patch set that enables cgroup based proportional scheduling for CFQ
> > seems like the way to go at this point.
> 
> Sure, we can start with CFQ only. But a bigger question we need to answer
> is that is CFQ the right place to solve the issue? Jens, do you think 
> that CFQ is the right place to solve the problem?
> 
> Andrew seems to favor a high level approach so that IO schedulers are less
> complex and we can provide fairness at high level logical devices also. 

I'm not in favor of expansion of CFQ, because some enterprise storages
are better performed with NOOP rather than CFQ, and I think bandwidth
control is needed much more for such storage system. Is it easy to
support other IO schedulers even if a new IO scheduler is introduced?
I would like to know a bit more specific about Namuman's scheduler design.

> I will again try to summarize my understanding so far about the pros/cons
> of each approach and then we can take the discussion forward.

Good summary. Thanks for your work.

> Fairness in terms of size of IO or disk time used
> =================================================
> On a seeky media, fairness in terms of disk time can get us better results
> instead fairness interms of size of IO or number of IO.
> 
> If we implement some kind of time based solution at higher layer, then 
> that higher layer should know who used how much of time each group used. We
> can probably do some kind of timestamping in bio to get a sense when did it
> get into disk and when did it finish. But on a multi queue hardware there
> can be multiple requests in the disk either from same queue or from differnet
> queues and with pure timestamping based apparoch, so far I could not think
> how at high level we will get an idea who used how much of time.

IIUC, could the overlap time be calculated from time-stamp on a multi
queue hardware?
 
> So this is the first point of contention that how do we want to provide
> fairness. In terms of disk time used or in terms of size of IO/number of
> IO.
>
> Max bandwidth Controller or Proportional bandwidth controller
> =============================================================
> What is our primary requirement here? A weight based proportional
> bandwidth controller where we can use the resources optimally and any
> kind of throttling kicks in only if there is contention for the disk.
> 
> Or we want max bandwidth control where a group is not allowed to use the
> disk even if disk is free. 
> 
> Or we need both? I would think that at some point of time we will need
> both but we can start with proportional bandwidth control first.

How about making throttling policy be user selectable like the IO
scheduler and putting it in the higher layer? So we could support
all of policies (time-based, size-based and rate limiting). There
seems not to only one solution which satisfies all users. But I agree
with starting with proportional bandwidth control first. 

BTW, I will start to reimplement dm-ioband into block layer.

> Fairness for higher level logical devices
> =========================================
> Do we want good fairness numbers for higher level logical devices also
> or it is sufficient to provide fairness at leaf nodes. Providing fairness
> at leaf nodes can help us use the resources optimally and in the process
> we can get fairness at higher level also in many of the cases.

We should also take care of block devices which provide their own
make_request_fn() and not use a IO scheduler. We can't use the leaf
nodes approach to such devices.

> But do we want strict fairness numbers on higher level logical devices
> even if it means sub-optimal usage of unerlying phsical devices?
> 
> I think that for proportinal bandwidth control, it should be ok to provide
> fairness at higher level logical device but for max bandwidth control it
> might make more sense to provide fairness at higher level. Consider a
> case where from a striped device a customer wants to limit a group to
> 30MB/s and in case of leaf node control, if every leaf node provides
> 30MB/s, it might accumulate to much more than specified rate at logical
> device.
>
> Latency Control and strong isolation between groups
> ===================================================
> Do we want a good isolation between groups and better latencies and
> stronger isolation between groups?
> 
> I think if problem is solved at IO scheduler level, we can achieve better
> latency control and hence stronger isolation between groups.
> 
> Higher level solutions should find it hard to provide same kind of latency
> control and isolation between groups as IO scheduler based solution.

Why do you think that the higher level solution is hard to provide it? 
I think that it is a matter of how to implement throttling policy.

> Fairness for buffered writes
> ============================
> Doing io control at any place below page cache has disadvantage that page
> cache might not dispatch more writes from higher weight group hence higher
> weight group might not see more IO done. Andrew says that we don't have
> a solution to this problem in kernel and he would like to see it handled
> properly.
> 
> Only way to solve this seems to be to slow down the writers before they
> write into page cache. IO throttling patch handled it by slowing down 
> writer if it crossed max specified rate. Other suggestions have come in
> the form of dirty_ratio per memory cgroup or a separate cgroup controller
> al-together where some kind of per group write limit can be specified.
> 
> So if solution is implemented at IO scheduler layer or at device mapper
> layer, both shall have to rely on another controller to be co-mounted
> to handle buffered writes properly.
>
> Fairness with-in group
> ======================
> One of the issues with higher level controller is that how to do fair
> throttling so that fairness with-in group is not impacted. Especially
> the case of making sure that we don't break the notion of ioprio of the
> processes with-in group.

I ran your test script to confirm that the notion of ioprio was not
broken by dm-ioband. Here is the results of the test.
https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html

I think that the time period during which dm-ioband holds IO requests
for throttling would be too short to break the notion of ioprio.

> Especially io throttling patch was very bad in terms of prio with-in 
> group where throttling treated everyone equally and difference between
> process prio disappeared.
>
> Reads Vs Writes
> ===============
> A higher level control most likely will change the ratio in which reads
> and writes are dispatched to disk with-in group. It used to be decided
> by IO scheduler so far but with higher level groups doing throttling and
> possibly buffering the bios and releasing them later, they will have to
> come up with their own policy on in what proportion reads and writes
> should be dispatched. In case of IO scheduler based control, all the
> queuing takes place at IO scheduler and it still retains control of
> in what ration reads and writes should be dispatched.

I don't think it is a concern. The current implementation of dm-ioband
is that sync/async IO requests are handled separately and the
backlogged IOs are released according to the order of arrival if both
sync and async requests are backlogged.

> Summary
> =======
> 
> - An io scheduler based io controller can provide better latencies,
>   stronger isolation between groups, time based fairness and will not
>   interfere with io schedulers policies like class, ioprio and
>   reader vs writer issues.
> 
>   But it can gunrantee fairness at higher logical level devices.
>   Especially in case of max bw control, leaf node control does not sound
>   to be the most appropriate thing.
> 
> - IO throttling provides max bw control in terms of absolute rate. It has
>   the advantage that it can provide control at higher level logical device
>   and also control buffered writes without need of additional controller
>   co-mounted.
> 
>   But it does only max bw control and not proportion control so one might
>   not be using resources optimally. It looses sense of task prio and class
>   with-in group as any of the task can be throttled with-in group. Because
>   throttling does not kick in till you hit the max bw limit, it should find
>   it hard to provide same latencies as io scheduler based control.
> 
> - dm-ioband also has the advantage that it can provide fairness at higher
>   level logical devices.
> 
>   But, fairness is provided only in terms of size of IO or number of IO.
>   No time based fairness. It is very throughput oriented and does not 
>   throttle high speed group if other group is running slow random reader.
>   This results in bad latnecies for random reader group and weaker
>   isolation between groups.

A new policy can be added to dm-ioband. Actually, range-bw policy,
which provides min and max bandwidth control, does time-based
throttling. Moreover there is room for improvement for existing
policies. The write-starve-read issue you pointed out will be solved
soon.

>   Also it does not provide fairness if a group is not continuously
>   backlogged. So if one is running 1-2 dd/sequential readers in the group,
>   one does not get fairness until workload is increased to a point where
>   group becomes continuously backlogged. This also results in poor
>   latencies and limited fairness.

This is intended to efficiently use bandwidth of underlying devices
when IO load is low.
 
> At this point of time it does not look like a single IO controller all
> the scenarios/requirements. This means few things to me.
> 
> - Drop some of the requirements and go with one implementation which meets
>   those reduced set of requirements.
>
> - Have more than one IO controller implementation in kenrel. One for lower
>   level control for better latencies, stronger isolation and optimal resource
>   usage and other one for fairness at higher level logical devices and max
>   bandwidth control. 
> 
>   And let user decide which one to use based on his/her needs. 
> 
> - Come up with more intelligent way of doing IO control where single
>   controller covers all the cases.
> 
> At this point of time, I am more inclined towards option 2 of having more
> than one implementation in kernel. :-) (Until and unless we can brainstrom
> and come up with ideas to make option 3 happen).
>
> > It would be great if we discuss our plans on the mailing list, so we
> > can get early feedback from everyone.
>  
> This is what comes to my mind so far. Please add to the list if I have missed
> some points. Also correct me if I am wrong about the pros/cons of the
> approaches.
>
> Thoughts/ideas/opinions are welcome...
> 
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-29  3:22     ` Vivek Goyal
  (?)
  (?)
@ 2009-09-29  9:56     ` Ryo Tsuruta
  2009-09-29 10:49       ` Takuya Yoshikawa
                         ` (3 more replies)
  -1 siblings, 4 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-29  9:56 UTC (permalink / raw)
  To: vgoyal
  Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah,
	lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

Hi Vivek and all,

Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:

> > We are starting from a point where there is no cgroup based IO
> > scheduling in the kernel. And it is probably not reasonable to satisfy
> > all IO scheduling related requirements in one patch set. We can start
> > with something simple, and build on top of that. So a very simple
> > patch set that enables cgroup based proportional scheduling for CFQ
> > seems like the way to go at this point.
> 
> Sure, we can start with CFQ only. But a bigger question we need to answer
> is that is CFQ the right place to solve the issue? Jens, do you think 
> that CFQ is the right place to solve the problem?
> 
> Andrew seems to favor a high level approach so that IO schedulers are less
> complex and we can provide fairness at high level logical devices also. 

I'm not in favor of expansion of CFQ, because some enterprise storages
are better performed with NOOP rather than CFQ, and I think bandwidth
control is needed much more for such storage system. Is it easy to
support other IO schedulers even if a new IO scheduler is introduced?
I would like to know a bit more specific about Namuman's scheduler design.

> I will again try to summarize my understanding so far about the pros/cons
> of each approach and then we can take the discussion forward.

Good summary. Thanks for your work.

> Fairness in terms of size of IO or disk time used
> =================================================
> On a seeky media, fairness in terms of disk time can get us better results
> instead fairness interms of size of IO or number of IO.
> 
> If we implement some kind of time based solution at higher layer, then 
> that higher layer should know who used how much of time each group used. We
> can probably do some kind of timestamping in bio to get a sense when did it
> get into disk and when did it finish. But on a multi queue hardware there
> can be multiple requests in the disk either from same queue or from differnet
> queues and with pure timestamping based apparoch, so far I could not think
> how at high level we will get an idea who used how much of time.

IIUC, could the overlap time be calculated from time-stamp on a multi
queue hardware?
 
> So this is the first point of contention that how do we want to provide
> fairness. In terms of disk time used or in terms of size of IO/number of
> IO.
>
> Max bandwidth Controller or Proportional bandwidth controller
> =============================================================
> What is our primary requirement here? A weight based proportional
> bandwidth controller where we can use the resources optimally and any
> kind of throttling kicks in only if there is contention for the disk.
> 
> Or we want max bandwidth control where a group is not allowed to use the
> disk even if disk is free. 
> 
> Or we need both? I would think that at some point of time we will need
> both but we can start with proportional bandwidth control first.

How about making throttling policy be user selectable like the IO
scheduler and putting it in the higher layer? So we could support
all of policies (time-based, size-based and rate limiting). There
seems not to only one solution which satisfies all users. But I agree
with starting with proportional bandwidth control first. 

BTW, I will start to reimplement dm-ioband into block layer.

> Fairness for higher level logical devices
> =========================================
> Do we want good fairness numbers for higher level logical devices also
> or it is sufficient to provide fairness at leaf nodes. Providing fairness
> at leaf nodes can help us use the resources optimally and in the process
> we can get fairness at higher level also in many of the cases.

We should also take care of block devices which provide their own
make_request_fn() and not use a IO scheduler. We can't use the leaf
nodes approach to such devices.

> But do we want strict fairness numbers on higher level logical devices
> even if it means sub-optimal usage of unerlying phsical devices?
> 
> I think that for proportinal bandwidth control, it should be ok to provide
> fairness at higher level logical device but for max bandwidth control it
> might make more sense to provide fairness at higher level. Consider a
> case where from a striped device a customer wants to limit a group to
> 30MB/s and in case of leaf node control, if every leaf node provides
> 30MB/s, it might accumulate to much more than specified rate at logical
> device.
>
> Latency Control and strong isolation between groups
> ===================================================
> Do we want a good isolation between groups and better latencies and
> stronger isolation between groups?
> 
> I think if problem is solved at IO scheduler level, we can achieve better
> latency control and hence stronger isolation between groups.
> 
> Higher level solutions should find it hard to provide same kind of latency
> control and isolation between groups as IO scheduler based solution.

Why do you think that the higher level solution is hard to provide it? 
I think that it is a matter of how to implement throttling policy.

> Fairness for buffered writes
> ============================
> Doing io control at any place below page cache has disadvantage that page
> cache might not dispatch more writes from higher weight group hence higher
> weight group might not see more IO done. Andrew says that we don't have
> a solution to this problem in kernel and he would like to see it handled
> properly.
> 
> Only way to solve this seems to be to slow down the writers before they
> write into page cache. IO throttling patch handled it by slowing down 
> writer if it crossed max specified rate. Other suggestions have come in
> the form of dirty_ratio per memory cgroup or a separate cgroup controller
> al-together where some kind of per group write limit can be specified.
> 
> So if solution is implemented at IO scheduler layer or at device mapper
> layer, both shall have to rely on another controller to be co-mounted
> to handle buffered writes properly.
>
> Fairness with-in group
> ======================
> One of the issues with higher level controller is that how to do fair
> throttling so that fairness with-in group is not impacted. Especially
> the case of making sure that we don't break the notion of ioprio of the
> processes with-in group.

I ran your test script to confirm that the notion of ioprio was not
broken by dm-ioband. Here is the results of the test.
https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html

I think that the time period during which dm-ioband holds IO requests
for throttling would be too short to break the notion of ioprio.

> Especially io throttling patch was very bad in terms of prio with-in 
> group where throttling treated everyone equally and difference between
> process prio disappeared.
>
> Reads Vs Writes
> ===============
> A higher level control most likely will change the ratio in which reads
> and writes are dispatched to disk with-in group. It used to be decided
> by IO scheduler so far but with higher level groups doing throttling and
> possibly buffering the bios and releasing them later, they will have to
> come up with their own policy on in what proportion reads and writes
> should be dispatched. In case of IO scheduler based control, all the
> queuing takes place at IO scheduler and it still retains control of
> in what ration reads and writes should be dispatched.

I don't think it is a concern. The current implementation of dm-ioband
is that sync/async IO requests are handled separately and the
backlogged IOs are released according to the order of arrival if both
sync and async requests are backlogged.

> Summary
> =======
> 
> - An io scheduler based io controller can provide better latencies,
>   stronger isolation between groups, time based fairness and will not
>   interfere with io schedulers policies like class, ioprio and
>   reader vs writer issues.
> 
>   But it can gunrantee fairness at higher logical level devices.
>   Especially in case of max bw control, leaf node control does not sound
>   to be the most appropriate thing.
> 
> - IO throttling provides max bw control in terms of absolute rate. It has
>   the advantage that it can provide control at higher level logical device
>   and also control buffered writes without need of additional controller
>   co-mounted.
> 
>   But it does only max bw control and not proportion control so one might
>   not be using resources optimally. It looses sense of task prio and class
>   with-in group as any of the task can be throttled with-in group. Because
>   throttling does not kick in till you hit the max bw limit, it should find
>   it hard to provide same latencies as io scheduler based control.
> 
> - dm-ioband also has the advantage that it can provide fairness at higher
>   level logical devices.
> 
>   But, fairness is provided only in terms of size of IO or number of IO.
>   No time based fairness. It is very throughput oriented and does not 
>   throttle high speed group if other group is running slow random reader.
>   This results in bad latnecies for random reader group and weaker
>   isolation between groups.

A new policy can be added to dm-ioband. Actually, range-bw policy,
which provides min and max bandwidth control, does time-based
throttling. Moreover there is room for improvement for existing
policies. The write-starve-read issue you pointed out will be solved
soon.

>   Also it does not provide fairness if a group is not continuously
>   backlogged. So if one is running 1-2 dd/sequential readers in the group,
>   one does not get fairness until workload is increased to a point where
>   group becomes continuously backlogged. This also results in poor
>   latencies and limited fairness.

This is intended to efficiently use bandwidth of underlying devices
when IO load is low.
 
> At this point of time it does not look like a single IO controller all
> the scenarios/requirements. This means few things to me.
> 
> - Drop some of the requirements and go with one implementation which meets
>   those reduced set of requirements.
>
> - Have more than one IO controller implementation in kenrel. One for lower
>   level control for better latencies, stronger isolation and optimal resource
>   usage and other one for fairness at higher level logical devices and max
>   bandwidth control. 
> 
>   And let user decide which one to use based on his/her needs. 
> 
> - Come up with more intelligent way of doing IO control where single
>   controller covers all the cases.
> 
> At this point of time, I am more inclined towards option 2 of having more
> than one implementation in kernel. :-) (Until and unless we can brainstrom
> and come up with ideas to make option 3 happen).
>
> > It would be great if we discuss our plans on the mailing list, so we
> > can get early feedback from everyone.
>  
> This is what comes to my mind so far. Please add to the list if I have missed
> some points. Also correct me if I am wrong about the pros/cons of the
> approaches.
>
> Thoughts/ideas/opinions are welcome...
> 
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                 ` <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-29  7:14                   ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-09-29  7:14 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Mike,
On Mon, Sep 28, 2009 at 8:53 PM, Mike Galbraith <efault@gmx.de> wrote:
> On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote:
>> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:
>
>> I guess changing class to IDLE should have helped a bit as now this is
>> equivalent to setting the quantum to 1 and after dispatching one request
>> to disk, CFQ will always expire the writer once. So it might happen that
>> by the the reader preempted writer, we have less number of requests in
>> disk and lesser latency for this reader.
>
> I expected SCHED_IDLE to be better than setting quantum to 1, because
> max is quantum*4 if you aren't IDLE.  But that's not what happened.  I
> just retested with all knobs set back to stock, fairness off, and
> quantum set to 1 with everything running nice 0.  2.8 seconds avg :-/

Idle doesn't work very well for async writes, since the writer process
will just send its writes to the page cache.
The real writeback will happen in the context of a kernel thread, with
best effort scheduling class.

>
>> > I saw
>> > the reference to Vivek's patch, and gave it a shot.  Makes a large
>> > difference.
>> >                                                            Avg
>> > perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
>> >               16.24   175.82   154.38   228.97   147.16  144.5     noop
>> >               43.23    57.39    96.13   148.25   180.09  105.0     deadline
>> >                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
>> >               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
>> >                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
>> >                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
>> >                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
>> >                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
>> >
>>
>> Hmm.., looks like average latency went down only in  case of fairness=1
>> and not in case of fairness=0. (Looking at previous mail, average vanilla
>> cfq latencies were around 12 seconds).
>
> Yup.
>
>> Are you running all this in root group or have you put writers and readers
>> into separate cgroups?
>
> No cgroups here.
>
>> If everything is running in root group, then I am curious why latency went
>> down in case of fairness=1. The only thing fairness=1 parameter does is
>> that it lets complete all the requests from previous queue before start
>> dispatching from next queue. On top of this is valid only if no preemption
>> took place. In your test case, konsole should preempt the writer so
>> practically fairness=1 might not make much difference.
>
> fairness=1 very definitely makes a very large difference.  All of those
> cfq numbers were logged in back to back runs.
>
>> In fact now Jens has committed a patch which achieves the similar effect as
>> fairness=1 for async queues.
>
> Yeah, I was there yesterday.  I speculated that that would hurt my
> reader, but rearranging things didn't help one bit.  Playing with merge,
> I managed to give dd ~7% more throughput, and injured poor reader even
> more.  (problem analysis via hammer/axe not always most effective;)
>
>> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
>> Author: Jens Axboe <jens.axboe@oracle.com>
>> Date:   Fri Jul 3 12:57:48 2009 +0200
>>
>>     cfq-iosched: drain device queue before switching to a sync queue
>>
>>     To lessen the impact of async IO on sync IO, let the device drain of
>>     any async IO in progress when switching to a sync cfqq that has idling
>>     enabled.
>>
>>
>> If everything is in separate cgroups, then we should have seen latency
>> improvements in case of fairness=0 case also. I am little perplexed here..
>>
>> Thanks
>> Vivek
>
>

Thanks,
Corrado


-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 18:53               ` Mike Galbraith
@ 2009-09-29  7:14                 ` Corrado Zoccolo
       [not found]                 ` <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-09-29  7:14 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe,
	Tobias Oetiker

Hi Mike,
On Mon, Sep 28, 2009 at 8:53 PM, Mike Galbraith <efault@gmx.de> wrote:
> On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote:
>> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:
>
>> I guess changing class to IDLE should have helped a bit as now this is
>> equivalent to setting the quantum to 1 and after dispatching one request
>> to disk, CFQ will always expire the writer once. So it might happen that
>> by the the reader preempted writer, we have less number of requests in
>> disk and lesser latency for this reader.
>
> I expected SCHED_IDLE to be better than setting quantum to 1, because
> max is quantum*4 if you aren't IDLE.  But that's not what happened.  I
> just retested with all knobs set back to stock, fairness off, and
> quantum set to 1 with everything running nice 0.  2.8 seconds avg :-/

Idle doesn't work very well for async writes, since the writer process
will just send its writes to the page cache.
The real writeback will happen in the context of a kernel thread, with
best effort scheduling class.

>
>> > I saw
>> > the reference to Vivek's patch, and gave it a shot.  Makes a large
>> > difference.
>> >                                                            Avg
>> > perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
>> >               16.24   175.82   154.38   228.97   147.16  144.5     noop
>> >               43.23    57.39    96.13   148.25   180.09  105.0     deadline
>> >                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
>> >               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
>> >                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
>> >                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
>> >                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
>> >                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
>> >
>>
>> Hmm.., looks like average latency went down only in  case of fairness=1
>> and not in case of fairness=0. (Looking at previous mail, average vanilla
>> cfq latencies were around 12 seconds).
>
> Yup.
>
>> Are you running all this in root group or have you put writers and readers
>> into separate cgroups?
>
> No cgroups here.
>
>> If everything is running in root group, then I am curious why latency went
>> down in case of fairness=1. The only thing fairness=1 parameter does is
>> that it lets complete all the requests from previous queue before start
>> dispatching from next queue. On top of this is valid only if no preemption
>> took place. In your test case, konsole should preempt the writer so
>> practically fairness=1 might not make much difference.
>
> fairness=1 very definitely makes a very large difference.  All of those
> cfq numbers were logged in back to back runs.
>
>> In fact now Jens has committed a patch which achieves the similar effect as
>> fairness=1 for async queues.
>
> Yeah, I was there yesterday.  I speculated that that would hurt my
> reader, but rearranging things didn't help one bit.  Playing with merge,
> I managed to give dd ~7% more throughput, and injured poor reader even
> more.  (problem analysis via hammer/axe not always most effective;)
>
>> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
>> Author: Jens Axboe <jens.axboe@oracle.com>
>> Date:   Fri Jul 3 12:57:48 2009 +0200
>>
>>     cfq-iosched: drain device queue before switching to a sync queue
>>
>>     To lessen the impact of async IO on sync IO, let the device drain of
>>     any async IO in progress when switching to a sync cfqq that has idling
>>     enabled.
>>
>>
>> If everything is in separate cgroups, then we should have seen latency
>> improvements in case of fairness=0 case also. I am little perplexed here..
>>
>> Thanks
>> Vivek
>
>

Thanks,
Corrado


-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]             ` <20090928171420.GA3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-29  7:10               ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-09-29  7:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,
On Mon, Sep 28, 2009 at 7:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote:
>> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
>> >> Hi Vivek,
>> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> >> >> Vivek Goyal wrote:
>> >> >> > Notes:
>> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >> >> >   Bring down its throughput and bump up latencies significantly.
>> >> >>
>> >> >>
>> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> >> >> too.
>> >> >>
>> >> >> I'm basing this assumption on the observations I made on both OpenSuse
>> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of
>> >> >> 2009-09-20.
>> >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
>> >> >>
>> >> >>
>> >> >> Thus, I'm posting this to show that your work is greatly appreciated,
>> >> >> given the rather disappointig status quo of Linux's fairness when it
>> >> >> comes to disk IO time.
>> >> >>
>> >> >> I hope that your efforts lead to a change in performance of current
>> >> >> userland applications, the sooner, the better.
>> >> >>
>> >> > [Please don't remove people from original CC list. I am putting them back.]
>> >> >
>> >> > Hi Ulrich,
>> >> >
>> >> > I quicky went through that mail thread and I tried following on my
>> >> > desktop.
>> >> >
>> >> > ##########################################
>> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> >> > sleep 5
>> >> > time firefox
>> >> > # close firefox once gui pops up.
>> >> > ##########################################
>> >> >
>> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
>> >> > following.
>> >> >
>> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>> >> >
>> >> > (Results do vary across runs, especially if system is booted fresh. Don't
>> >> >  know why...).
>> >> >
>> >> >
>> >> > Then I tried putting both the applications in separate groups and assign
>> >> > them weights 200 each.
>> >> >
>> >> > ##########################################
>> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> >> > echo $! > /cgroup/io/test1/tasks
>> >> > sleep 5
>> >> > echo $$ > /cgroup/io/test2/tasks
>> >> > time firefox
>> >> > # close firefox once gui pops up.
>> >> > ##########################################
>> >> >
>> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>> >> >
>> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>> >> >
>> >> > Notice that throughput of dd also improved.
>> >> >
>> >> > I ran the block trace and noticed in many a cases firefox threads
>> >> > immediately preempted the "dd". Probably because it was a file system
>> >> > request. So in this case latency will arise from seek time.
>> >> >
>> >> > In some other cases, threads had to wait for up to 100ms because dd was
>> >> > not preempted. In this case latency will arise both from waiting on queue
>> >> > as well as seek time.
>> >>
>> >> I think cfq should already be doing something similar, i.e. giving
>> >> 100ms slices to firefox, that alternate with dd, unless:
>> >> * firefox is too seeky (in this case, the idle window will be too small)
>> >> * firefox has too much think time.
>> >>
>> >
>> Hi Vivek,
>> > Hi Corrado,
>> >
>> > "firefox" is the shell script to setup the environment and launch the
>> > broser. It seems to be a group of threads. Some of them run in parallel
>> > and some of these seems to be running one after the other (once previous
>> > process or threads finished).
>>
>> Ok.
>>
>> >
>> >> To rule out the first case, what happens if you run the test with your
>> >> "fairness for seeky processes" patch?
>> >
>> > I applied that patch and it helps a lot.
>> >
>> > http://lwn.net/Articles/341032/
>> >
>> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.
>>
>> Great.
>> Can you try the attached patch (on top of 2.6.31)?
>> It implements the alternative approach we discussed privately in july,
>> and it addresses the possible latency increase that could happen with
>> your patch.
>>
>> To summarize for everyone, we separate sync sequential queues, sync
>> seeky queues and async queues in three separate RR strucutres, and
>> alternate servicing requests between them.
>>
>> When servicing seeky queues (the ones that are usually penalized by
>> cfq, for which no fairness is usually provided), we do not idle
>> between them, but we do idle for the last queue (the idle can be
>> exited when any seeky queue has requests). This allows us to allocate
>> disk time globally for all seeky processes, and to reduce seeky
>> processes latencies.
>>
>
> Ok, I seem to be doing same thing at group level (In group scheduling
> patches). I do not idle on individual sync seeky queues but if this is
> last queue in the group, then I do idle to make sure group does not loose
> its fair share and exit from idle the moment there is any busy queue in
> the group.
>
> So you seem to be grouping all the sync seeky queues system wide in a
> single group. So all the sync seeky queues collectively get 100ms in a
> single round of dispatch?

A round of dispatch (defined by tunable target_latency, default 300ms)
is subdivided between the three groups, proportionally to how many
queues are waiting in each, so if we have 1 sequential and 2 seeky
(and 0 async), we get 100ms for seq and 200ms for seeky.

> I am wondering what happens if there are lot
> of such sync seeky queues this 100ms time slice is consumed before all the
> sync seeky queues got a chance to dispatch. Does that mean that some of
> the queues can completely skip the one dispatch round?
It can happen: if each seek costs 10ms, and you have more than 30
seeky processes, then you are guaranteed that they cannot issue all in
the same round.
When this happens, the ones that did not issue before, will be the
first ones to be issued in the next round.

Thanks,
Corrado

>
> Thanks
> Vivek
>
>> I tested with 'konsole -e exit', while doing a sequential write with
>> dd, and the start up time reduced from 37s to 7s, on an old laptop
>> disk.
>>
>> Thanks,
>> Corrado
>>
>> >
>> >> To rule out the first case, what happens if you run the test with your
>> >> "fairness for seeky processes" patch?
>> >
>> > I applied that patch and it helps a lot.
>> >
>> > http://lwn.net/Articles/341032/
>> >
>> > With above patchset applied, and fairness=1, firefox pops up in 27-28
>> > seconds.
>> >
>> > So it looks like if we don't disable idle window for seeky processes on
>> > hardware supporting command queuing, it helps in this particular case.
>> >
>> > Thanks
>> > Vivek
>> >
>
>
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 17:14             ` Vivek Goyal
  (?)
@ 2009-09-29  7:10             ` Corrado Zoccolo
  -1 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-09-29  7:10 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker

Hi Vivek,
On Mon, Sep 28, 2009 at 7:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote:
>> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
>> >> Hi Vivek,
>> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> >> >> Vivek Goyal wrote:
>> >> >> > Notes:
>> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >> >> >   Bring down its throughput and bump up latencies significantly.
>> >> >>
>> >> >>
>> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> >> >> too.
>> >> >>
>> >> >> I'm basing this assumption on the observations I made on both OpenSuse
>> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of
>> >> >> 2009-09-20.
>> >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
>> >> >>
>> >> >>
>> >> >> Thus, I'm posting this to show that your work is greatly appreciated,
>> >> >> given the rather disappointig status quo of Linux's fairness when it
>> >> >> comes to disk IO time.
>> >> >>
>> >> >> I hope that your efforts lead to a change in performance of current
>> >> >> userland applications, the sooner, the better.
>> >> >>
>> >> > [Please don't remove people from original CC list. I am putting them back.]
>> >> >
>> >> > Hi Ulrich,
>> >> >
>> >> > I quicky went through that mail thread and I tried following on my
>> >> > desktop.
>> >> >
>> >> > ##########################################
>> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> >> > sleep 5
>> >> > time firefox
>> >> > # close firefox once gui pops up.
>> >> > ##########################################
>> >> >
>> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
>> >> > following.
>> >> >
>> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>> >> >
>> >> > (Results do vary across runs, especially if system is booted fresh. Don't
>> >> >  know why...).
>> >> >
>> >> >
>> >> > Then I tried putting both the applications in separate groups and assign
>> >> > them weights 200 each.
>> >> >
>> >> > ##########################################
>> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> >> > echo $! > /cgroup/io/test1/tasks
>> >> > sleep 5
>> >> > echo $$ > /cgroup/io/test2/tasks
>> >> > time firefox
>> >> > # close firefox once gui pops up.
>> >> > ##########################################
>> >> >
>> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>> >> >
>> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>> >> >
>> >> > Notice that throughput of dd also improved.
>> >> >
>> >> > I ran the block trace and noticed in many a cases firefox threads
>> >> > immediately preempted the "dd". Probably because it was a file system
>> >> > request. So in this case latency will arise from seek time.
>> >> >
>> >> > In some other cases, threads had to wait for up to 100ms because dd was
>> >> > not preempted. In this case latency will arise both from waiting on queue
>> >> > as well as seek time.
>> >>
>> >> I think cfq should already be doing something similar, i.e. giving
>> >> 100ms slices to firefox, that alternate with dd, unless:
>> >> * firefox is too seeky (in this case, the idle window will be too small)
>> >> * firefox has too much think time.
>> >>
>> >
>> Hi Vivek,
>> > Hi Corrado,
>> >
>> > "firefox" is the shell script to setup the environment and launch the
>> > broser. It seems to be a group of threads. Some of them run in parallel
>> > and some of these seems to be running one after the other (once previous
>> > process or threads finished).
>>
>> Ok.
>>
>> >
>> >> To rule out the first case, what happens if you run the test with your
>> >> "fairness for seeky processes" patch?
>> >
>> > I applied that patch and it helps a lot.
>> >
>> > http://lwn.net/Articles/341032/
>> >
>> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.
>>
>> Great.
>> Can you try the attached patch (on top of 2.6.31)?
>> It implements the alternative approach we discussed privately in july,
>> and it addresses the possible latency increase that could happen with
>> your patch.
>>
>> To summarize for everyone, we separate sync sequential queues, sync
>> seeky queues and async queues in three separate RR strucutres, and
>> alternate servicing requests between them.
>>
>> When servicing seeky queues (the ones that are usually penalized by
>> cfq, for which no fairness is usually provided), we do not idle
>> between them, but we do idle for the last queue (the idle can be
>> exited when any seeky queue has requests). This allows us to allocate
>> disk time globally for all seeky processes, and to reduce seeky
>> processes latencies.
>>
>
> Ok, I seem to be doing same thing at group level (In group scheduling
> patches). I do not idle on individual sync seeky queues but if this is
> last queue in the group, then I do idle to make sure group does not loose
> its fair share and exit from idle the moment there is any busy queue in
> the group.
>
> So you seem to be grouping all the sync seeky queues system wide in a
> single group. So all the sync seeky queues collectively get 100ms in a
> single round of dispatch?

A round of dispatch (defined by tunable target_latency, default 300ms)
is subdivided between the three groups, proportionally to how many
queues are waiting in each, so if we have 1 sequential and 2 seeky
(and 0 async), we get 100ms for seq and 200ms for seeky.

> I am wondering what happens if there are lot
> of such sync seeky queues this 100ms time slice is consumed before all the
> sync seeky queues got a chance to dispatch. Does that mean that some of
> the queues can completely skip the one dispatch round?
It can happen: if each seek costs 10ms, and you have more than 30
seeky processes, then you are guaranteed that they cannot issue all in
the same round.
When this happens, the ones that did not issue before, will be the
first ones to be issued in the next round.

Thanks,
Corrado

>
> Thanks
> Vivek
>
>> I tested with 'konsole -e exit', while doing a sequential write with
>> dd, and the start up time reduced from 37s to 7s, on an old laptop
>> disk.
>>
>> Thanks,
>> Corrado
>>
>> >
>> >> To rule out the first case, what happens if you run the test with your
>> >> "fairness for seeky processes" patch?
>> >
>> > I applied that patch and it helps a lot.
>> >
>> > http://lwn.net/Articles/341032/
>> >
>> > With above patchset applied, and fairness=1, firefox pops up in 27-28
>> > seconds.
>> >
>> > So it looks like if we don't disable idle window for seeky processes on
>> > hardware supporting command queuing, it helps in this particular case.
>> >
>> > Thanks
>> > Vivek
>> >
>
>
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
The self-confidence of a warrior is not the self-confidence of the average
man. The average man seeks certainty in the eyes of the onlooker and calls
that self-confidence. The warrior seeks impeccability in his own eyes and
calls that humbleness.
                               Tales of Power - C. Castaneda

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]             ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-09-28 18:18               ` Vivek Goyal
@ 2009-09-29  5:55               ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-29  5:55 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, 2009-09-28 at 19:51 +0200, Mike Galbraith wrote:

> I'll give your patch a spin as well.

I applied it to tip, and fixed up rejects.  I haven't done a line for
line verification against the original patch yet (brave or..), so add
giant economy sized pinch of salt.

In the form it ended up in, it didn't help here.  I tried twiddling
knobs, but it didn't help either.  Reducing latency target from 300 to
30 did nada, but dropping to 3 did... I got to poke BRB.

Plugging Vivek's fairness tweakable on top, and enabling it, my timings
return to decent numbers, so that one liner absatively posilutely is
where my write vs read woes are coming from.

FWIW, below is patch wedged into tip v2.6.31-10215-ga3c9602

---
 block/cfq-iosched.c |  281 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 227 insertions(+), 54 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 1
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
+static int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static int cfq_hist_divisor = 4;
+/*
+ * Number of times that other workloads can be scheduled before async
+ */
+static const unsigned int cfq_async_penalty = 4;
 
 /*
  * offset from end of service tree
@@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125;
 /*
  * below this threshold, we consider thinktime immediate
  */
-#define CFQ_MIN_TT		(2)
+#define CFQ_MIN_TT		(1)
 
 #define CFQ_SLICE_SCALE		(5)
 #define CFQ_HW_QUEUE_MIN	(5)
@@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 struct cfq_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
+	unsigned count;
 };
-#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, }
+#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, 0, }
 
 /*
  * Per process-grouping structure
@@ -113,6 +120,21 @@ struct cfq_queue {
 	unsigned short ioprio_class, org_ioprio_class;
 
 	pid_t pid;
+
+	struct cfq_rb_root *service_tree;
+	struct cfq_io_context *cic;
+};
+
+enum wl_prio_t {
+	IDLE_WL = -1,
+	BE_WL = 0,
+	RT_WL = 1
+};
+
+enum wl_type_t {
+	ASYNC_WL = 0,
+	SYNC_NOIDLE_WL = 1,
+	SYNC_WL = 2
 };
 
 /*
@@ -124,7 +146,13 @@ struct cfq_data {
 	/*
 	 * rr list of queues with requests and the count of them
 	 */
-	struct cfq_rb_root service_tree;
+	struct cfq_rb_root service_trees[2][3];
+	struct cfq_rb_root service_tree_idle;
+
+	enum wl_prio_t serving_prio;
+	enum wl_type_t serving_type;
+	unsigned long workload_expires;
+	unsigned int async_starved;
 
 	/*
 	 * Each priority tree is sorted by next_request position.  These
@@ -134,9 +162,11 @@ struct cfq_data {
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
 	unsigned int busy_queues;
+	unsigned int busy_queues_avg[2];
 
 	int rq_in_driver[2];
 	int sync_flight;
+	int reads_delayed;
 
 	/*
 	 * queue-depth detection
@@ -173,6 +203,9 @@ struct cfq_data {
 	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
+	unsigned int cfq_target_latency;
+	unsigned int cfq_hist_divisor;
+	unsigned int cfq_async_penalty;
 
 	struct list_head cic_list;
 
@@ -182,6 +215,11 @@ struct cfq_data {
 	struct cfq_queue oom_cfqq;
 };
 
+static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type,
+							  struct cfq_data *cfqd) {
+	return prio == IDLE_WL ? &cfqd->service_tree_idle :  &cfqd->service_trees[prio][type];
+}
+
 enum cfqq_state_flags {
 	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
 	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
@@ -226,6 +264,17 @@ CFQ_CFQQ_FNS(coop);
 #define cfq_log(cfqd, fmt, args...)	\
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
+#define CIC_SEEK_THR	1024
+#define CIC_SEEKY(cic)	((cic)->seek_mean > CIC_SEEK_THR)
+#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic))
+
+static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) {
+	return wl==IDLE_WL? cfqd->service_tree_idle.count :
+		cfqd->service_trees[wl][ASYNC_WL].count
+		+ cfqd->service_trees[wl][SYNC_NOIDLE_WL].count
+		+ cfqd->service_trees[wl][SYNC_WL].count;
+}
+
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
 				       struct io_context *, gfp_t);
@@ -247,6 +296,7 @@ static inline void cic_set_cfqq(struct c
 				struct cfq_queue *cfqq, int is_sync)
 {
 	cic->cfqq[!!is_sync] = cfqq;
+	cfqq->cic = cic;
 }
 
 /*
@@ -301,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd,
 	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
 }
 
+static inline unsigned
+cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
+	unsigned min_q, max_q;
+	unsigned mult  = cfqd->cfq_hist_divisor - 1;
+	unsigned round = cfqd->cfq_hist_divisor / 2;
+	unsigned busy  = cfq_busy_queues_wl(rt, cfqd);
+	min_q = min(cfqd->busy_queues_avg[rt], busy);
+	max_q = max(cfqd->busy_queues_avg[rt], busy);
+	cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+		cfqd->cfq_hist_divisor;
+	return cfqd->busy_queues_avg[rt];
+}
+
 static inline void
 cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+	unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1];
+	unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
+	unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
+
+	if (iq > process_thr) {
+		unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle
+			/ cfqd->cfq_slice[1];
+		slice = max(slice * process_thr / iq, min(slice, low_slice));
+	}
+
+	cfqq->slice_end = jiffies + slice;
 	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
 }
 
@@ -443,6 +516,7 @@ static void cfq_rb_erase(struct rb_node
 	if (root->left == n)
 		root->left = NULL;
 	rb_erase_init(n, &root->rb);
+	--root->count;
 }
 
 /*
@@ -483,46 +557,56 @@ static unsigned long cfq_slice_offset(st
 }
 
 /*
- * The cfqd->service_tree holds all pending cfq_queue's that have
+ * The cfqd->service_trees holds all pending cfq_queue's that have
  * requests waiting to be processed. It is sorted in the order that
  * we will service the queues.
  */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
+static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	struct rb_node **p, *parent;
 	struct cfq_queue *__cfqq;
 	unsigned long rb_key;
+	struct cfq_rb_root *service_tree;
 	int left;
 
 	if (cfq_class_idle(cfqq)) {
 		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
+		service_tree = &cfqd->service_tree_idle;
+		parent = rb_last(&service_tree->rb);
 		if (parent && parent != &cfqq->rb_node) {
 			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 			rb_key += __cfqq->rb_key;
 		} else
 			rb_key += jiffies;
-	} else if (!add_front) {
+	} else {
+		enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL;
+		enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL;
+
 		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
 		rb_key += cfqq->slice_resid;
 		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
+
+		if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq)))
+			type = SYNC_NOIDLE_WL;
+
+		service_tree = service_tree_for(prio, type, cfqd);
+	}
 
 	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
 		/*
 		 * same position, nothing more to do
 		 */
-		if (rb_key == cfqq->rb_key)
+		if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree)
 			return;
 
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+		cfqq->service_tree = NULL;
 	}
 
 	left = 1;
 	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
+	cfqq->service_tree = service_tree;
+	p = &service_tree->rb.rb_node;
 	while (*p) {
 		struct rb_node **n;
 
@@ -554,11 +638,12 @@ static void cfq_service_tree_add(struct
 	}
 
 	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
+		service_tree->left = &cfqq->rb_node;
 
 	cfqq->rb_key = rb_key;
 	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
+	rb_insert_color(&cfqq->rb_node, &service_tree->rb);
+	service_tree->count++;
 }
 
 static struct cfq_queue *
@@ -631,7 +716,7 @@ static void cfq_resort_rr_list(struct cf
 	 * Resorting requires the cfqq to be on the RR list already.
 	 */
 	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
+		cfq_service_tree_add(cfqd, cfqq);
 		cfq_prio_tree_add(cfqd, cfqq);
 	}
 }
@@ -660,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_d
 	BUG_ON(!cfq_cfqq_on_rr(cfqq));
 	cfq_clear_cfqq_on_rr(cfqq);
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+		cfqq->service_tree = NULL;
+	}
 	if (cfqq->p_root) {
 		rb_erase(&cfqq->p_node, cfqq->p_root);
 		cfqq->p_root = NULL;
@@ -923,10 +1010,11 @@ static inline void cfq_slice_expired(str
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
+	struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd);
 
-	return cfq_rb_first(&cfqd->service_tree);
+	if (RB_EMPTY_ROOT(&service_tree->rb))
+		return NULL;
+	return cfq_rb_first(service_tree);
 }
 
 /*
@@ -954,9 +1042,6 @@ static inline sector_t cfq_dist_from_las
 		return cfqd->last_position - blk_rq_pos(rq);
 }
 
-#define CIC_SEEK_THR	8 * 1024
-#define CIC_SEEKY(cic)	((cic)->seek_mean > CIC_SEEK_THR)
-
 static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
 {
 	struct cfq_io_context *cic = cfqd->active_cic;
@@ -1044,6 +1129,10 @@ static struct cfq_queue *cfq_close_coope
 	if (cfq_cfqq_coop(cfqq))
 		return NULL;
 
+	/* we don't want to mix processes with different characteristics */
+	if (cfqq->service_tree != cur_cfqq->service_tree)
+		return NULL;
+
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
 	return cfqq;
@@ -1087,14 +1176,15 @@ static void cfq_arm_slice_timer(struct c
 
 	cfq_mark_cfqq_wait_request(cfqq);
 
-	/*
-	 * we don't want to idle for seeks, but we do want to allow
-	 * fair distribution of slice time for a process doing back-to-back
-	 * seeks. so allow a little bit of time for him to submit a new rq
-	 */
-	sl = cfqd->cfq_slice_idle;
-	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
+	sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies);
+
+	/* very small idle if we are serving noidle trees, and there are more trees */
+	if (cfqd->serving_type == SYNC_NOIDLE_WL &&
+	    service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) {
+		if (blk_queue_nonrot(cfqd->queue))
+			return;
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
+	}
 
 	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
@@ -1110,6 +1200,11 @@ static void cfq_dispatch_insert(struct r
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
 
+	if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) {
+		cfqd->reads_delayed = max_t(int, cfqd->reads_delayed,
+					    (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2));
+	}
+
 	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
 	cfqq->dispatched++;
@@ -1156,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd,
 	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
 }
 
+enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) {
+	struct cfq_queue *id, *ni;
+	ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd));
+	id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd));
+	if (id && ni && id->rb_key < ni->rb_key)
+		return SYNC_WL;
+	if (!ni) return SYNC_WL;
+	return SYNC_NOIDLE_WL;
+}
+
 /*
  * Select a queue for service. If we have a current active queue,
  * check whether to continue servicing it, or retrieve and set a new one.
@@ -1196,15 +1301,68 @@ static struct cfq_queue *cfq_select_queu
 	 * flight or is idling for a new request, allow either of these
 	 * conditions to happen (or time out) before selecting a new queue.
 	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
+	if (timer_pending(&cfqd->idle_slice_timer) || 
 	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
 		cfqq = NULL;
 		goto keep_queue;
 	}
-
 expire:
 	cfq_slice_expired(cfqd, 0);
 new_queue:
+	if (!new_cfqq) {
+		enum wl_prio_t previous_prio = cfqd->serving_prio;
+
+		if (cfq_busy_queues_wl(RT_WL, cfqd))
+			cfqd->serving_prio = RT_WL;
+		else if (cfq_busy_queues_wl(BE_WL, cfqd))
+			cfqd->serving_prio = BE_WL;
+		else {
+			cfqd->serving_prio = IDLE_WL;
+			cfqd->workload_expires = jiffies + 1;
+			cfqd->reads_delayed = 0;
+		}
+
+		if (cfqd->serving_prio != IDLE_WL) {
+			int counts[]={
+				service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count,
+				service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count,
+				service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count
+			};
+			int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2];
+
+			if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) {
+				cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL;
+				cfqd->async_starved = 0;
+				cfqd->reads_delayed = 0;
+			} else {
+				if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) {
+					if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] &&
+					    cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed))
+						cfqd->serving_type = ASYNC_WL;
+					else 
+						cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio);
+				} else
+					goto same_wl;
+			}
+
+			{
+				unsigned slice = cfqd->cfq_target_latency;
+				slice = slice * counts[cfqd->serving_type] /
+					max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio],
+					      counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]);
+					    
+				if (cfqd->serving_type == ASYNC_WL)
+					slice = max(1U, (slice / (1 + cfqd->reads_delayed))
+						    * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]);
+				else
+					slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle));
+
+				cfqd->workload_expires = jiffies + slice;
+				cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL);
+			}
+		}
+	}
+ same_wl:
 	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
 keep_queue:
 	return cfqq;
@@ -1231,8 +1389,13 @@ static int cfq_forced_dispatch(struct cf
 {
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
+	int i,j;
+	for (i = 0; i < 2; ++i)
+		for (j = 0; j < 3; ++j)
+			while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL)
+				dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
 	cfq_slice_expired(cfqd, 0);
@@ -1300,6 +1463,12 @@ static int cfq_dispatch_requests(struct
 		return 0;
 
 	/*
+	 * Drain async requests before we start sync IO
+	 */
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+		return 0;
+
+	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
@@ -1993,18 +2162,8 @@ cfq_should_preempt(struct cfq_data *cfqd
 	if (cfq_class_idle(cfqq))
 		return 1;
 
-	/*
-	 * if the new request is sync, but the currently running queue is
-	 * not, let the sync request have priority.
-	 */
-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
-		return 1;
-
-	/*
-	 * So both queues are sync. Let the new request get disk time if
-	 * it's a metadata request and the current queue is doing regular IO.
-	 */
-	if (rq_is_meta(rq) && !cfqq->meta_pending)
+	if (cfqd->serving_type == SYNC_NOIDLE_WL
+	    && new_cfqq->service_tree == cfqq->service_tree)
 		return 1;
 
 	/*
@@ -2035,13 +2194,9 @@ static void cfq_preempt_queue(struct cfq
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
 	cfq_slice_expired(cfqd, 1);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
 	BUG_ON(!cfq_cfqq_on_rr(cfqq));
 
-	cfq_service_tree_add(cfqd, cfqq, 1);
+	cfq_service_tree_add(cfqd, cfqq);
 
 	cfqq->slice_end = 0;
 	cfq_mark_cfqq_slice_new(cfqq);
@@ -2438,13 +2593,16 @@ static void cfq_exit_queue(struct elevat
 static void *cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
-	int i;
+	int i,j;
 
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
+	for (i = 0; i < 2; ++i)
+		for (j = 0; j < 3; ++j)
+			cfqd->service_trees[i][j] = CFQ_RB_ROOT;
+	cfqd->service_tree_idle = CFQ_RB_ROOT;
 
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
@@ -2481,6 +2639,9 @@ static void *cfq_init_queue(struct reque
 	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
+	cfqd->cfq_target_latency = cfq_target_latency;
+	cfqd->cfq_hist_divisor = cfq_hist_divisor;
+	cfqd->cfq_async_penalty = cfq_async_penalty;
 	cfqd->hw_tag = 1;
 
 	return cfqd;
@@ -2517,6 +2678,7 @@ fail:
 /*
  * sysfs parts below -->
  */
+
 static ssize_t
 cfq_var_show(unsigned int var, char *page)
 {
@@ -2550,6 +2712,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd-
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
+SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
+SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0);
+SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2581,6 +2746,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cf
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
+
+STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1);
+STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0);
+STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0);
+
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2596,6 +2766,9 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	CFQ_ATTR(target_latency),
+	CFQ_ATTR(hist_divisor),
+	CFQ_ATTR(async_penalty),
 	__ATTR_NULL
 };

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 17:51           ` Mike Galbraith
  2009-09-28 18:18               ` Vivek Goyal
@ 2009-09-29  5:55             ` Mike Galbraith
       [not found]             ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-29  5:55 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe,
	Tobias Oetiker

On Mon, 2009-09-28 at 19:51 +0200, Mike Galbraith wrote:

> I'll give your patch a spin as well.

I applied it to tip, and fixed up rejects.  I haven't done a line for
line verification against the original patch yet (brave or..), so add
giant economy sized pinch of salt.

In the form it ended up in, it didn't help here.  I tried twiddling
knobs, but it didn't help either.  Reducing latency target from 300 to
30 did nada, but dropping to 3 did... I got to poke BRB.

Plugging Vivek's fairness tweakable on top, and enabling it, my timings
return to decent numbers, so that one liner absatively posilutely is
where my write vs read woes are coming from.

FWIW, below is patch wedged into tip v2.6.31-10215-ga3c9602

---
 block/cfq-iosched.c |  281 ++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 227 insertions(+), 54 deletions(-)

Index: linux-2.6/block/cfq-iosched.c
===================================================================
--- linux-2.6.orig/block/cfq-iosched.c
+++ linux-2.6/block/cfq-iosched.c
@@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 1
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
+static int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static int cfq_hist_divisor = 4;
+/*
+ * Number of times that other workloads can be scheduled before async
+ */
+static const unsigned int cfq_async_penalty = 4;
 
 /*
  * offset from end of service tree
@@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125;
 /*
  * below this threshold, we consider thinktime immediate
  */
-#define CFQ_MIN_TT		(2)
+#define CFQ_MIN_TT		(1)
 
 #define CFQ_SLICE_SCALE		(5)
 #define CFQ_HW_QUEUE_MIN	(5)
@@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 struct cfq_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
+	unsigned count;
 };
-#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, }
+#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, 0, }
 
 /*
  * Per process-grouping structure
@@ -113,6 +120,21 @@ struct cfq_queue {
 	unsigned short ioprio_class, org_ioprio_class;
 
 	pid_t pid;
+
+	struct cfq_rb_root *service_tree;
+	struct cfq_io_context *cic;
+};
+
+enum wl_prio_t {
+	IDLE_WL = -1,
+	BE_WL = 0,
+	RT_WL = 1
+};
+
+enum wl_type_t {
+	ASYNC_WL = 0,
+	SYNC_NOIDLE_WL = 1,
+	SYNC_WL = 2
 };
 
 /*
@@ -124,7 +146,13 @@ struct cfq_data {
 	/*
 	 * rr list of queues with requests and the count of them
 	 */
-	struct cfq_rb_root service_tree;
+	struct cfq_rb_root service_trees[2][3];
+	struct cfq_rb_root service_tree_idle;
+
+	enum wl_prio_t serving_prio;
+	enum wl_type_t serving_type;
+	unsigned long workload_expires;
+	unsigned int async_starved;
 
 	/*
 	 * Each priority tree is sorted by next_request position.  These
@@ -134,9 +162,11 @@ struct cfq_data {
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
 	unsigned int busy_queues;
+	unsigned int busy_queues_avg[2];
 
 	int rq_in_driver[2];
 	int sync_flight;
+	int reads_delayed;
 
 	/*
 	 * queue-depth detection
@@ -173,6 +203,9 @@ struct cfq_data {
 	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
+	unsigned int cfq_target_latency;
+	unsigned int cfq_hist_divisor;
+	unsigned int cfq_async_penalty;
 
 	struct list_head cic_list;
 
@@ -182,6 +215,11 @@ struct cfq_data {
 	struct cfq_queue oom_cfqq;
 };
 
+static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type,
+							  struct cfq_data *cfqd) {
+	return prio == IDLE_WL ? &cfqd->service_tree_idle :  &cfqd->service_trees[prio][type];
+}
+
 enum cfqq_state_flags {
 	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
 	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
@@ -226,6 +264,17 @@ CFQ_CFQQ_FNS(coop);
 #define cfq_log(cfqd, fmt, args...)	\
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
+#define CIC_SEEK_THR	1024
+#define CIC_SEEKY(cic)	((cic)->seek_mean > CIC_SEEK_THR)
+#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic))
+
+static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) {
+	return wl==IDLE_WL? cfqd->service_tree_idle.count :
+		cfqd->service_trees[wl][ASYNC_WL].count
+		+ cfqd->service_trees[wl][SYNC_NOIDLE_WL].count
+		+ cfqd->service_trees[wl][SYNC_WL].count;
+}
+
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
 				       struct io_context *, gfp_t);
@@ -247,6 +296,7 @@ static inline void cic_set_cfqq(struct c
 				struct cfq_queue *cfqq, int is_sync)
 {
 	cic->cfqq[!!is_sync] = cfqq;
+	cfqq->cic = cic;
 }
 
 /*
@@ -301,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd,
 	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
 }
 
+static inline unsigned
+cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
+	unsigned min_q, max_q;
+	unsigned mult  = cfqd->cfq_hist_divisor - 1;
+	unsigned round = cfqd->cfq_hist_divisor / 2;
+	unsigned busy  = cfq_busy_queues_wl(rt, cfqd);
+	min_q = min(cfqd->busy_queues_avg[rt], busy);
+	max_q = max(cfqd->busy_queues_avg[rt], busy);
+	cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+		cfqd->cfq_hist_divisor;
+	return cfqd->busy_queues_avg[rt];
+}
+
 static inline void
 cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+	unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1];
+	unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
+	unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
+
+	if (iq > process_thr) {
+		unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle
+			/ cfqd->cfq_slice[1];
+		slice = max(slice * process_thr / iq, min(slice, low_slice));
+	}
+
+	cfqq->slice_end = jiffies + slice;
 	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
 }
 
@@ -443,6 +516,7 @@ static void cfq_rb_erase(struct rb_node
 	if (root->left == n)
 		root->left = NULL;
 	rb_erase_init(n, &root->rb);
+	--root->count;
 }
 
 /*
@@ -483,46 +557,56 @@ static unsigned long cfq_slice_offset(st
 }
 
 /*
- * The cfqd->service_tree holds all pending cfq_queue's that have
+ * The cfqd->service_trees holds all pending cfq_queue's that have
  * requests waiting to be processed. It is sorted in the order that
  * we will service the queues.
  */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
+static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	struct rb_node **p, *parent;
 	struct cfq_queue *__cfqq;
 	unsigned long rb_key;
+	struct cfq_rb_root *service_tree;
 	int left;
 
 	if (cfq_class_idle(cfqq)) {
 		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
+		service_tree = &cfqd->service_tree_idle;
+		parent = rb_last(&service_tree->rb);
 		if (parent && parent != &cfqq->rb_node) {
 			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 			rb_key += __cfqq->rb_key;
 		} else
 			rb_key += jiffies;
-	} else if (!add_front) {
+	} else {
+		enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL;
+		enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL;
+
 		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
 		rb_key += cfqq->slice_resid;
 		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
+
+		if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq)))
+			type = SYNC_NOIDLE_WL;
+
+		service_tree = service_tree_for(prio, type, cfqd);
+	}
 
 	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
 		/*
 		 * same position, nothing more to do
 		 */
-		if (rb_key == cfqq->rb_key)
+		if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree)
 			return;
 
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+		cfqq->service_tree = NULL;
 	}
 
 	left = 1;
 	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
+	cfqq->service_tree = service_tree;
+	p = &service_tree->rb.rb_node;
 	while (*p) {
 		struct rb_node **n;
 
@@ -554,11 +638,12 @@ static void cfq_service_tree_add(struct
 	}
 
 	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
+		service_tree->left = &cfqq->rb_node;
 
 	cfqq->rb_key = rb_key;
 	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
+	rb_insert_color(&cfqq->rb_node, &service_tree->rb);
+	service_tree->count++;
 }
 
 static struct cfq_queue *
@@ -631,7 +716,7 @@ static void cfq_resort_rr_list(struct cf
 	 * Resorting requires the cfqq to be on the RR list already.
 	 */
 	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
+		cfq_service_tree_add(cfqd, cfqq);
 		cfq_prio_tree_add(cfqd, cfqq);
 	}
 }
@@ -660,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_d
 	BUG_ON(!cfq_cfqq_on_rr(cfqq));
 	cfq_clear_cfqq_on_rr(cfqq);
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+		cfqq->service_tree = NULL;
+	}
 	if (cfqq->p_root) {
 		rb_erase(&cfqq->p_node, cfqq->p_root);
 		cfqq->p_root = NULL;
@@ -923,10 +1010,11 @@ static inline void cfq_slice_expired(str
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
+	struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd);
 
-	return cfq_rb_first(&cfqd->service_tree);
+	if (RB_EMPTY_ROOT(&service_tree->rb))
+		return NULL;
+	return cfq_rb_first(service_tree);
 }
 
 /*
@@ -954,9 +1042,6 @@ static inline sector_t cfq_dist_from_las
 		return cfqd->last_position - blk_rq_pos(rq);
 }
 
-#define CIC_SEEK_THR	8 * 1024
-#define CIC_SEEKY(cic)	((cic)->seek_mean > CIC_SEEK_THR)
-
 static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
 {
 	struct cfq_io_context *cic = cfqd->active_cic;
@@ -1044,6 +1129,10 @@ static struct cfq_queue *cfq_close_coope
 	if (cfq_cfqq_coop(cfqq))
 		return NULL;
 
+	/* we don't want to mix processes with different characteristics */
+	if (cfqq->service_tree != cur_cfqq->service_tree)
+		return NULL;
+
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
 	return cfqq;
@@ -1087,14 +1176,15 @@ static void cfq_arm_slice_timer(struct c
 
 	cfq_mark_cfqq_wait_request(cfqq);
 
-	/*
-	 * we don't want to idle for seeks, but we do want to allow
-	 * fair distribution of slice time for a process doing back-to-back
-	 * seeks. so allow a little bit of time for him to submit a new rq
-	 */
-	sl = cfqd->cfq_slice_idle;
-	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
+	sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies);
+
+	/* very small idle if we are serving noidle trees, and there are more trees */
+	if (cfqd->serving_type == SYNC_NOIDLE_WL &&
+	    service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) {
+		if (blk_queue_nonrot(cfqd->queue))
+			return;
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
+	}
 
 	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
@@ -1110,6 +1200,11 @@ static void cfq_dispatch_insert(struct r
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
 
+	if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) {
+		cfqd->reads_delayed = max_t(int, cfqd->reads_delayed,
+					    (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2));
+	}
+
 	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
 	cfqq->dispatched++;
@@ -1156,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd,
 	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
 }
 
+enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) {
+	struct cfq_queue *id, *ni;
+	ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd));
+	id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd));
+	if (id && ni && id->rb_key < ni->rb_key)
+		return SYNC_WL;
+	if (!ni) return SYNC_WL;
+	return SYNC_NOIDLE_WL;
+}
+
 /*
  * Select a queue for service. If we have a current active queue,
  * check whether to continue servicing it, or retrieve and set a new one.
@@ -1196,15 +1301,68 @@ static struct cfq_queue *cfq_select_queu
 	 * flight or is idling for a new request, allow either of these
 	 * conditions to happen (or time out) before selecting a new queue.
 	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
+	if (timer_pending(&cfqd->idle_slice_timer) || 
 	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
 		cfqq = NULL;
 		goto keep_queue;
 	}
-
 expire:
 	cfq_slice_expired(cfqd, 0);
 new_queue:
+	if (!new_cfqq) {
+		enum wl_prio_t previous_prio = cfqd->serving_prio;
+
+		if (cfq_busy_queues_wl(RT_WL, cfqd))
+			cfqd->serving_prio = RT_WL;
+		else if (cfq_busy_queues_wl(BE_WL, cfqd))
+			cfqd->serving_prio = BE_WL;
+		else {
+			cfqd->serving_prio = IDLE_WL;
+			cfqd->workload_expires = jiffies + 1;
+			cfqd->reads_delayed = 0;
+		}
+
+		if (cfqd->serving_prio != IDLE_WL) {
+			int counts[]={
+				service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count,
+				service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count,
+				service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count
+			};
+			int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2];
+
+			if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) {
+				cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL;
+				cfqd->async_starved = 0;
+				cfqd->reads_delayed = 0;
+			} else {
+				if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) {
+					if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] &&
+					    cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed))
+						cfqd->serving_type = ASYNC_WL;
+					else 
+						cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio);
+				} else
+					goto same_wl;
+			}
+
+			{
+				unsigned slice = cfqd->cfq_target_latency;
+				slice = slice * counts[cfqd->serving_type] /
+					max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio],
+					      counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]);
+					    
+				if (cfqd->serving_type == ASYNC_WL)
+					slice = max(1U, (slice / (1 + cfqd->reads_delayed))
+						    * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]);
+				else
+					slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle));
+
+				cfqd->workload_expires = jiffies + slice;
+				cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL);
+			}
+		}
+	}
+ same_wl:
 	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
 keep_queue:
 	return cfqq;
@@ -1231,8 +1389,13 @@ static int cfq_forced_dispatch(struct cf
 {
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
+	int i,j;
+	for (i = 0; i < 2; ++i)
+		for (j = 0; j < 3; ++j)
+			while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL)
+				dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
 	cfq_slice_expired(cfqd, 0);
@@ -1300,6 +1463,12 @@ static int cfq_dispatch_requests(struct
 		return 0;
 
 	/*
+	 * Drain async requests before we start sync IO
+	 */
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+		return 0;
+
+	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
@@ -1993,18 +2162,8 @@ cfq_should_preempt(struct cfq_data *cfqd
 	if (cfq_class_idle(cfqq))
 		return 1;
 
-	/*
-	 * if the new request is sync, but the currently running queue is
-	 * not, let the sync request have priority.
-	 */
-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
-		return 1;
-
-	/*
-	 * So both queues are sync. Let the new request get disk time if
-	 * it's a metadata request and the current queue is doing regular IO.
-	 */
-	if (rq_is_meta(rq) && !cfqq->meta_pending)
+	if (cfqd->serving_type == SYNC_NOIDLE_WL
+	    && new_cfqq->service_tree == cfqq->service_tree)
 		return 1;
 
 	/*
@@ -2035,13 +2194,9 @@ static void cfq_preempt_queue(struct cfq
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
 	cfq_slice_expired(cfqd, 1);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
 	BUG_ON(!cfq_cfqq_on_rr(cfqq));
 
-	cfq_service_tree_add(cfqd, cfqq, 1);
+	cfq_service_tree_add(cfqd, cfqq);
 
 	cfqq->slice_end = 0;
 	cfq_mark_cfqq_slice_new(cfqq);
@@ -2438,13 +2593,16 @@ static void cfq_exit_queue(struct elevat
 static void *cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
-	int i;
+	int i,j;
 
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
+	for (i = 0; i < 2; ++i)
+		for (j = 0; j < 3; ++j)
+			cfqd->service_trees[i][j] = CFQ_RB_ROOT;
+	cfqd->service_tree_idle = CFQ_RB_ROOT;
 
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
@@ -2481,6 +2639,9 @@ static void *cfq_init_queue(struct reque
 	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
+	cfqd->cfq_target_latency = cfq_target_latency;
+	cfqd->cfq_hist_divisor = cfq_hist_divisor;
+	cfqd->cfq_async_penalty = cfq_async_penalty;
 	cfqd->hw_tag = 1;
 
 	return cfqd;
@@ -2517,6 +2678,7 @@ fail:
 /*
  * sysfs parts below -->
  */
+
 static ssize_t
 cfq_var_show(unsigned int var, char *page)
 {
@@ -2550,6 +2712,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd-
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
+SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
+SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0);
+SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2581,6 +2746,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cf
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
+
+STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1);
+STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0);
+STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0);
+
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2596,6 +2766,9 @@ static struct elv_fs_entry cfq_attrs[] =
 	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	CFQ_ATTR(target_latency),
+	CFQ_ATTR(hist_divisor),
+	CFQ_ATTR(async_penalty),
 	__ATTR_NULL
 };
 



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]   ` <e98e18940909281737q142c788dpd20b8bdc05dd0eff-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-09-29  3:22     ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-29  3:22 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> Hi Vivek,
> Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
> Jens about IO controller during Linux Plumbers Conference '09. Jens
> expressed his concerns about the size and complexity of the patches. I
> believe that is a reasonable concern. We talked about things that
> could be done to reduce the size of the patches. The requirement that
> the "solution has to work with all IO schedulers" seems like a
> secondary concern at this point; and it came out as one thing that can
> help to reduce the size of the patch set.

Initially doing cgroup based IO control only for CFQ should help a lot in
reducing the patchset size.

> Another possibility is to
> use a simpler scheduling algorithm e.g. weighted round robin, instead
> of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
> the fact that it is complex to understand, and might be cumbersome to
> maintain.

Core of the BFQ I have gotten rid of already. The remaining part is idle tree
and data structures. I will see how can I simplify it further.

> Also, hierarchical scheduling is something that could be
> unnecessary in the first set of patches, even though cgroups are
> hierarchical in nature.

Sure. Though I don't think that a lot of code is there because of
hierarchical nature. If we solve the issue at CFQ layer, we have to
maintain atleast two levels. One for queue and other for groups. So even
the simplest solution becomes almost hierarchical in nature. But I will
still see how to get rid of some code here too...
> 
> We are starting from a point where there is no cgroup based IO
> scheduling in the kernel. And it is probably not reasonable to satisfy
> all IO scheduling related requirements in one patch set. We can start
> with something simple, and build on top of that. So a very simple
> patch set that enables cgroup based proportional scheduling for CFQ
> seems like the way to go at this point.

Sure, we can start with CFQ only. But a bigger question we need to answer
is that is CFQ the right place to solve the issue? Jens, do you think 
that CFQ is the right place to solve the problem?

Andrew seems to favor a high level approach so that IO schedulers are less
complex and we can provide fairness at high level logical devices also. 

I will again try to summarize my understanding so far about the pros/cons
of each approach and then we can take the discussion forward.

Fairness in terms of size of IO or disk time used
=================================================
On a seeky media, fairness in terms of disk time can get us better results
instead fairness interms of size of IO or number of IO.

If we implement some kind of time based solution at higher layer, then 
that higher layer should know who used how much of time each group used. We
can probably do some kind of timestamping in bio to get a sense when did it
get into disk and when did it finish. But on a multi queue hardware there
can be multiple requests in the disk either from same queue or from differnet
queues and with pure timestamping based apparoch, so far I could not think
how at high level we will get an idea who used how much of time.

So this is the first point of contention that how do we want to provide
fairness. In terms of disk time used or in terms of size of IO/number of
IO.

Max bandwidth Controller or Proportional bandwidth controller
=============================================================
What is our primary requirement here? A weight based proportional
bandwidth controller where we can use the resources optimally and any
kind of throttling kicks in only if there is contention for the disk.

Or we want max bandwidth control where a group is not allowed to use the
disk even if disk is free. 

Or we need both? I would think that at some point of time we will need
both but we can start with proportional bandwidth control first.

Fairness for higher level logical devices
=========================================
Do we want good fairness numbers for higher level logical devices also
or it is sufficient to provide fairness at leaf nodes. Providing fairness
at leaf nodes can help us use the resources optimally and in the process
we can get fairness at higher level also in many of the cases.

But do we want strict fairness numbers on higher level logical devices
even if it means sub-optimal usage of unerlying phsical devices?

I think that for proportinal bandwidth control, it should be ok to provide
fairness at higher level logical device but for max bandwidth control it
might make more sense to provide fairness at higher level. Consider a
case where from a striped device a customer wants to limit a group to
30MB/s and in case of leaf node control, if every leaf node provides
30MB/s, it might accumulate to much more than specified rate at logical
device.

Latency Control and strong isolation between groups
===================================================
Do we want a good isolation between groups and better latencies and
stronger isolation between groups?

I think if problem is solved at IO scheduler level, we can achieve better
latency control and hence stronger isolation between groups.

Higher level solutions should find it hard to provide same kind of latency
control and isolation between groups as IO scheduler based solution.

Fairness for buffered writes
============================
Doing io control at any place below page cache has disadvantage that page
cache might not dispatch more writes from higher weight group hence higher
weight group might not see more IO done. Andrew says that we don't have
a solution to this problem in kernel and he would like to see it handled
properly.

Only way to solve this seems to be to slow down the writers before they
write into page cache. IO throttling patch handled it by slowing down 
writer if it crossed max specified rate. Other suggestions have come in
the form of dirty_ratio per memory cgroup or a separate cgroup controller
al-together where some kind of per group write limit can be specified.

So if solution is implemented at IO scheduler layer or at device mapper
layer, both shall have to rely on another controller to be co-mounted
to handle buffered writes properly.

Fairness with-in group
======================
One of the issues with higher level controller is that how to do fair
throttling so that fairness with-in group is not impacted. Especially
the case of making sure that we don't break the notion of ioprio of the
processes with-in group.

Especially io throttling patch was very bad in terms of prio with-in 
group where throttling treated everyone equally and difference between
process prio disappeared.

Reads Vs Writes
===============
A higher level control most likely will change the ratio in which reads
and writes are dispatched to disk with-in group. It used to be decided
by IO scheduler so far but with higher level groups doing throttling and
possibly buffering the bios and releasing them later, they will have to
come up with their own policy on in what proportion reads and writes
should be dispatched. In case of IO scheduler based control, all the
queuing takes place at IO scheduler and it still retains control of
in what ration reads and writes should be dispatched.


Summary
=======

- An io scheduler based io controller can provide better latencies,
  stronger isolation between groups, time based fairness and will not
  interfere with io schedulers policies like class, ioprio and
  reader vs writer issues.

  But it can gunrantee fairness at higher logical level devices.
  Especially in case of max bw control, leaf node control does not sound
  to be the most appropriate thing.

- IO throttling provides max bw control in terms of absolute rate. It has
  the advantage that it can provide control at higher level logical device
  and also control buffered writes without need of additional controller
  co-mounted.

  But it does only max bw control and not proportion control so one might
  not be using resources optimally. It looses sense of task prio and class
  with-in group as any of the task can be throttled with-in group. Because
  throttling does not kick in till you hit the max bw limit, it should find
  it hard to provide same latencies as io scheduler based control.

- dm-ioband also has the advantage that it can provide fairness at higher
  level logical devices.

  But, fairness is provided only in terms of size of IO or number of IO.
  No time based fairness. It is very throughput oriented and does not 
  throttle high speed group if other group is running slow random reader.
  This results in bad latnecies for random reader group and weaker
  isolation between groups.

  Also it does not provide fairness if a group is not continuously
  backlogged. So if one is running 1-2 dd/sequential readers in the group,
  one does not get fairness until workload is increased to a point where
  group becomes continuously backlogged. This also results in poor
  latencies and limited fairness.

At this point of time it does not look like a single IO controller all
the scenarios/requirements. This means few things to me.

- Drop some of the requirements and go with one implementation which meets
  those reduced set of requirements.

- Have more than one IO controller implementation in kenrel. One for lower
  level control for better latencies, stronger isolation and optimal resource
  usage and other one for fairness at higher level logical devices and max
  bandwidth control. 

  And let user decide which one to use based on his/her needs. 

- Come up with more intelligent way of doing IO control where single
  controller covers all the cases.

At this point of time, I am more inclined towards option 2 of having more
than one implementation in kernel. :-) (Until and unless we can brainstrom
and come up with ideas to make option 3 happen).

> 
> It would be great if we discuss our plans on the mailing list, so we
> can get early feedback from everyone.
 
This is what comes to my mind so far. Please add to the list if I have missed
some points. Also correct me if I am wrong about the pros/cons of the
approaches.

Thoughts/ideas/opinions are welcome...

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-29  0:37   ` Nauman Rafique
@ 2009-09-29  3:22     ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-29  3:22 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf,
	mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> Hi Vivek,
> Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
> Jens about IO controller during Linux Plumbers Conference '09. Jens
> expressed his concerns about the size and complexity of the patches. I
> believe that is a reasonable concern. We talked about things that
> could be done to reduce the size of the patches. The requirement that
> the "solution has to work with all IO schedulers" seems like a
> secondary concern at this point; and it came out as one thing that can
> help to reduce the size of the patch set.

Initially doing cgroup based IO control only for CFQ should help a lot in
reducing the patchset size.

> Another possibility is to
> use a simpler scheduling algorithm e.g. weighted round robin, instead
> of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
> the fact that it is complex to understand, and might be cumbersome to
> maintain.

Core of the BFQ I have gotten rid of already. The remaining part is idle tree
and data structures. I will see how can I simplify it further.

> Also, hierarchical scheduling is something that could be
> unnecessary in the first set of patches, even though cgroups are
> hierarchical in nature.

Sure. Though I don't think that a lot of code is there because of
hierarchical nature. If we solve the issue at CFQ layer, we have to
maintain atleast two levels. One for queue and other for groups. So even
the simplest solution becomes almost hierarchical in nature. But I will
still see how to get rid of some code here too...
> 
> We are starting from a point where there is no cgroup based IO
> scheduling in the kernel. And it is probably not reasonable to satisfy
> all IO scheduling related requirements in one patch set. We can start
> with something simple, and build on top of that. So a very simple
> patch set that enables cgroup based proportional scheduling for CFQ
> seems like the way to go at this point.

Sure, we can start with CFQ only. But a bigger question we need to answer
is that is CFQ the right place to solve the issue? Jens, do you think 
that CFQ is the right place to solve the problem?

Andrew seems to favor a high level approach so that IO schedulers are less
complex and we can provide fairness at high level logical devices also. 

I will again try to summarize my understanding so far about the pros/cons
of each approach and then we can take the discussion forward.

Fairness in terms of size of IO or disk time used
=================================================
On a seeky media, fairness in terms of disk time can get us better results
instead fairness interms of size of IO or number of IO.

If we implement some kind of time based solution at higher layer, then 
that higher layer should know who used how much of time each group used. We
can probably do some kind of timestamping in bio to get a sense when did it
get into disk and when did it finish. But on a multi queue hardware there
can be multiple requests in the disk either from same queue or from differnet
queues and with pure timestamping based apparoch, so far I could not think
how at high level we will get an idea who used how much of time.

So this is the first point of contention that how do we want to provide
fairness. In terms of disk time used or in terms of size of IO/number of
IO.

Max bandwidth Controller or Proportional bandwidth controller
=============================================================
What is our primary requirement here? A weight based proportional
bandwidth controller where we can use the resources optimally and any
kind of throttling kicks in only if there is contention for the disk.

Or we want max bandwidth control where a group is not allowed to use the
disk even if disk is free. 

Or we need both? I would think that at some point of time we will need
both but we can start with proportional bandwidth control first.

Fairness for higher level logical devices
=========================================
Do we want good fairness numbers for higher level logical devices also
or it is sufficient to provide fairness at leaf nodes. Providing fairness
at leaf nodes can help us use the resources optimally and in the process
we can get fairness at higher level also in many of the cases.

But do we want strict fairness numbers on higher level logical devices
even if it means sub-optimal usage of unerlying phsical devices?

I think that for proportinal bandwidth control, it should be ok to provide
fairness at higher level logical device but for max bandwidth control it
might make more sense to provide fairness at higher level. Consider a
case where from a striped device a customer wants to limit a group to
30MB/s and in case of leaf node control, if every leaf node provides
30MB/s, it might accumulate to much more than specified rate at logical
device.

Latency Control and strong isolation between groups
===================================================
Do we want a good isolation between groups and better latencies and
stronger isolation between groups?

I think if problem is solved at IO scheduler level, we can achieve better
latency control and hence stronger isolation between groups.

Higher level solutions should find it hard to provide same kind of latency
control and isolation between groups as IO scheduler based solution.

Fairness for buffered writes
============================
Doing io control at any place below page cache has disadvantage that page
cache might not dispatch more writes from higher weight group hence higher
weight group might not see more IO done. Andrew says that we don't have
a solution to this problem in kernel and he would like to see it handled
properly.

Only way to solve this seems to be to slow down the writers before they
write into page cache. IO throttling patch handled it by slowing down 
writer if it crossed max specified rate. Other suggestions have come in
the form of dirty_ratio per memory cgroup or a separate cgroup controller
al-together where some kind of per group write limit can be specified.

So if solution is implemented at IO scheduler layer or at device mapper
layer, both shall have to rely on another controller to be co-mounted
to handle buffered writes properly.

Fairness with-in group
======================
One of the issues with higher level controller is that how to do fair
throttling so that fairness with-in group is not impacted. Especially
the case of making sure that we don't break the notion of ioprio of the
processes with-in group.

Especially io throttling patch was very bad in terms of prio with-in 
group where throttling treated everyone equally and difference between
process prio disappeared.

Reads Vs Writes
===============
A higher level control most likely will change the ratio in which reads
and writes are dispatched to disk with-in group. It used to be decided
by IO scheduler so far but with higher level groups doing throttling and
possibly buffering the bios and releasing them later, they will have to
come up with their own policy on in what proportion reads and writes
should be dispatched. In case of IO scheduler based control, all the
queuing takes place at IO scheduler and it still retains control of
in what ration reads and writes should be dispatched.


Summary
=======

- An io scheduler based io controller can provide better latencies,
  stronger isolation between groups, time based fairness and will not
  interfere with io schedulers policies like class, ioprio and
  reader vs writer issues.

  But it can gunrantee fairness at higher logical level devices.
  Especially in case of max bw control, leaf node control does not sound
  to be the most appropriate thing.

- IO throttling provides max bw control in terms of absolute rate. It has
  the advantage that it can provide control at higher level logical device
  and also control buffered writes without need of additional controller
  co-mounted.

  But it does only max bw control and not proportion control so one might
  not be using resources optimally. It looses sense of task prio and class
  with-in group as any of the task can be throttled with-in group. Because
  throttling does not kick in till you hit the max bw limit, it should find
  it hard to provide same latencies as io scheduler based control.

- dm-ioband also has the advantage that it can provide fairness at higher
  level logical devices.

  But, fairness is provided only in terms of size of IO or number of IO.
  No time based fairness. It is very throughput oriented and does not 
  throttle high speed group if other group is running slow random reader.
  This results in bad latnecies for random reader group and weaker
  isolation between groups.

  Also it does not provide fairness if a group is not continuously
  backlogged. So if one is running 1-2 dd/sequential readers in the group,
  one does not get fairness until workload is increased to a point where
  group becomes continuously backlogged. This also results in poor
  latencies and limited fairness.

At this point of time it does not look like a single IO controller all
the scenarios/requirements. This means few things to me.

- Drop some of the requirements and go with one implementation which meets
  those reduced set of requirements.

- Have more than one IO controller implementation in kenrel. One for lower
  level control for better latencies, stronger isolation and optimal resource
  usage and other one for fairness at higher level logical devices and max
  bandwidth control. 

  And let user decide which one to use based on his/her needs. 

- Come up with more intelligent way of doing IO control where single
  controller covers all the cases.

At this point of time, I am more inclined towards option 2 of having more
than one implementation in kernel. :-) (Until and unless we can brainstrom
and come up with ideas to make option 3 happen).

> 
> It would be great if we discuss our plans on the mailing list, so we
> can get early feedback from everyone.
 
This is what comes to my mind so far. Please add to the list if I have missed
some points. Also correct me if I am wrong about the pros/cons of the
approaches.

Thoughts/ideas/opinions are welcome...

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-29  3:22     ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-29  3:22 UTC (permalink / raw)
  To: Nauman Rafique
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, mingo, m-ikeda, riel, lizf, fchecconi,
	s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds

On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote:
> Hi Vivek,
> Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
> Jens about IO controller during Linux Plumbers Conference '09. Jens
> expressed his concerns about the size and complexity of the patches. I
> believe that is a reasonable concern. We talked about things that
> could be done to reduce the size of the patches. The requirement that
> the "solution has to work with all IO schedulers" seems like a
> secondary concern at this point; and it came out as one thing that can
> help to reduce the size of the patch set.

Initially doing cgroup based IO control only for CFQ should help a lot in
reducing the patchset size.

> Another possibility is to
> use a simpler scheduling algorithm e.g. weighted round robin, instead
> of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
> the fact that it is complex to understand, and might be cumbersome to
> maintain.

Core of the BFQ I have gotten rid of already. The remaining part is idle tree
and data structures. I will see how can I simplify it further.

> Also, hierarchical scheduling is something that could be
> unnecessary in the first set of patches, even though cgroups are
> hierarchical in nature.

Sure. Though I don't think that a lot of code is there because of
hierarchical nature. If we solve the issue at CFQ layer, we have to
maintain atleast two levels. One for queue and other for groups. So even
the simplest solution becomes almost hierarchical in nature. But I will
still see how to get rid of some code here too...
> 
> We are starting from a point where there is no cgroup based IO
> scheduling in the kernel. And it is probably not reasonable to satisfy
> all IO scheduling related requirements in one patch set. We can start
> with something simple, and build on top of that. So a very simple
> patch set that enables cgroup based proportional scheduling for CFQ
> seems like the way to go at this point.

Sure, we can start with CFQ only. But a bigger question we need to answer
is that is CFQ the right place to solve the issue? Jens, do you think 
that CFQ is the right place to solve the problem?

Andrew seems to favor a high level approach so that IO schedulers are less
complex and we can provide fairness at high level logical devices also. 

I will again try to summarize my understanding so far about the pros/cons
of each approach and then we can take the discussion forward.

Fairness in terms of size of IO or disk time used
=================================================
On a seeky media, fairness in terms of disk time can get us better results
instead fairness interms of size of IO or number of IO.

If we implement some kind of time based solution at higher layer, then 
that higher layer should know who used how much of time each group used. We
can probably do some kind of timestamping in bio to get a sense when did it
get into disk and when did it finish. But on a multi queue hardware there
can be multiple requests in the disk either from same queue or from differnet
queues and with pure timestamping based apparoch, so far I could not think
how at high level we will get an idea who used how much of time.

So this is the first point of contention that how do we want to provide
fairness. In terms of disk time used or in terms of size of IO/number of
IO.

Max bandwidth Controller or Proportional bandwidth controller
=============================================================
What is our primary requirement here? A weight based proportional
bandwidth controller where we can use the resources optimally and any
kind of throttling kicks in only if there is contention for the disk.

Or we want max bandwidth control where a group is not allowed to use the
disk even if disk is free. 

Or we need both? I would think that at some point of time we will need
both but we can start with proportional bandwidth control first.

Fairness for higher level logical devices
=========================================
Do we want good fairness numbers for higher level logical devices also
or it is sufficient to provide fairness at leaf nodes. Providing fairness
at leaf nodes can help us use the resources optimally and in the process
we can get fairness at higher level also in many of the cases.

But do we want strict fairness numbers on higher level logical devices
even if it means sub-optimal usage of unerlying phsical devices?

I think that for proportinal bandwidth control, it should be ok to provide
fairness at higher level logical device but for max bandwidth control it
might make more sense to provide fairness at higher level. Consider a
case where from a striped device a customer wants to limit a group to
30MB/s and in case of leaf node control, if every leaf node provides
30MB/s, it might accumulate to much more than specified rate at logical
device.

Latency Control and strong isolation between groups
===================================================
Do we want a good isolation between groups and better latencies and
stronger isolation between groups?

I think if problem is solved at IO scheduler level, we can achieve better
latency control and hence stronger isolation between groups.

Higher level solutions should find it hard to provide same kind of latency
control and isolation between groups as IO scheduler based solution.

Fairness for buffered writes
============================
Doing io control at any place below page cache has disadvantage that page
cache might not dispatch more writes from higher weight group hence higher
weight group might not see more IO done. Andrew says that we don't have
a solution to this problem in kernel and he would like to see it handled
properly.

Only way to solve this seems to be to slow down the writers before they
write into page cache. IO throttling patch handled it by slowing down 
writer if it crossed max specified rate. Other suggestions have come in
the form of dirty_ratio per memory cgroup or a separate cgroup controller
al-together where some kind of per group write limit can be specified.

So if solution is implemented at IO scheduler layer or at device mapper
layer, both shall have to rely on another controller to be co-mounted
to handle buffered writes properly.

Fairness with-in group
======================
One of the issues with higher level controller is that how to do fair
throttling so that fairness with-in group is not impacted. Especially
the case of making sure that we don't break the notion of ioprio of the
processes with-in group.

Especially io throttling patch was very bad in terms of prio with-in 
group where throttling treated everyone equally and difference between
process prio disappeared.

Reads Vs Writes
===============
A higher level control most likely will change the ratio in which reads
and writes are dispatched to disk with-in group. It used to be decided
by IO scheduler so far but with higher level groups doing throttling and
possibly buffering the bios and releasing them later, they will have to
come up with their own policy on in what proportion reads and writes
should be dispatched. In case of IO scheduler based control, all the
queuing takes place at IO scheduler and it still retains control of
in what ration reads and writes should be dispatched.


Summary
=======

- An io scheduler based io controller can provide better latencies,
  stronger isolation between groups, time based fairness and will not
  interfere with io schedulers policies like class, ioprio and
  reader vs writer issues.

  But it can gunrantee fairness at higher logical level devices.
  Especially in case of max bw control, leaf node control does not sound
  to be the most appropriate thing.

- IO throttling provides max bw control in terms of absolute rate. It has
  the advantage that it can provide control at higher level logical device
  and also control buffered writes without need of additional controller
  co-mounted.

  But it does only max bw control and not proportion control so one might
  not be using resources optimally. It looses sense of task prio and class
  with-in group as any of the task can be throttled with-in group. Because
  throttling does not kick in till you hit the max bw limit, it should find
  it hard to provide same latencies as io scheduler based control.

- dm-ioband also has the advantage that it can provide fairness at higher
  level logical devices.

  But, fairness is provided only in terms of size of IO or number of IO.
  No time based fairness. It is very throughput oriented and does not 
  throttle high speed group if other group is running slow random reader.
  This results in bad latnecies for random reader group and weaker
  isolation between groups.

  Also it does not provide fairness if a group is not continuously
  backlogged. So if one is running 1-2 dd/sequential readers in the group,
  one does not get fairness until workload is increased to a point where
  group becomes continuously backlogged. This also results in poor
  latencies and limited fairness.

At this point of time it does not look like a single IO controller all
the scenarios/requirements. This means few things to me.

- Drop some of the requirements and go with one implementation which meets
  those reduced set of requirements.

- Have more than one IO controller implementation in kenrel. One for lower
  level control for better latencies, stronger isolation and optimal resource
  usage and other one for fairness at higher level logical devices and max
  bandwidth control. 

  And let user decide which one to use based on his/her needs. 

- Come up with more intelligent way of doing IO control where single
  controller covers all the cases.

At this point of time, I am more inclined towards option 2 of having more
than one implementation in kernel. :-) (Until and unless we can brainstrom
and come up with ideas to make option 3 happen).

> 
> It would be great if we discuss our plans on the mailing list, so we
> can get early feedback from everyone.
 
This is what comes to my mind so far. Please add to the list if I have missed
some points. Also correct me if I am wrong about the pros/cons of the
approaches.

Thoughts/ideas/opinions are welcome...

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-24 21:33   ` Andrew Morton
  2009-09-25  2:20   ` Ulrich Lukas
@ 2009-09-29  0:37   ` Nauman Rafique
  2 siblings, 0 replies; 349+ messages in thread
From: Nauman Rafique @ 2009-09-29  0:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,
Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
Jens about IO controller during Linux Plumbers Conference '09. Jens
expressed his concerns about the size and complexity of the patches. I
believe that is a reasonable concern. We talked about things that
could be done to reduce the size of the patches. The requirement that
the "solution has to work with all IO schedulers" seems like a
secondary concern at this point; and it came out as one thing that can
help to reduce the size of the patch set. Another possibility is to
use a simpler scheduling algorithm e.g. weighted round robin, instead
of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
the fact that it is complex to understand, and might be cumbersome to
maintain. Also, hierarchical scheduling is something that could be
unnecessary in the first set of patches, even though cgroups are
hierarchical in nature.

We are starting from a point where there is no cgroup based IO
scheduling in the kernel. And it is probably not reasonable to satisfy
all IO scheduling related requirements in one patch set. We can start
with something simple, and build on top of that. So a very simple
patch set that enables cgroup based proportional scheduling for CFQ
seems like the way to go at this point.

It would be great if we discuss our plans on the mailing list, so we
can get early feedback from everyone.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-24 19:25 Vivek Goyal
@ 2009-09-29  0:37   ` Nauman Rafique
  2009-09-25  2:20 ` Ulrich Lukas
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 349+ messages in thread
From: Nauman Rafique @ 2009-09-29  0:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf,
	mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka,
	guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya

Hi Vivek,
Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
Jens about IO controller during Linux Plumbers Conference '09. Jens
expressed his concerns about the size and complexity of the patches. I
believe that is a reasonable concern. We talked about things that
could be done to reduce the size of the patches. The requirement that
the "solution has to work with all IO schedulers" seems like a
secondary concern at this point; and it came out as one thing that can
help to reduce the size of the patch set. Another possibility is to
use a simpler scheduling algorithm e.g. weighted round robin, instead
of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
the fact that it is complex to understand, and might be cumbersome to
maintain. Also, hierarchical scheduling is something that could be
unnecessary in the first set of patches, even though cgroups are
hierarchical in nature.

We are starting from a point where there is no cgroup based IO
scheduling in the kernel. And it is probably not reasonable to satisfy
all IO scheduling related requirements in one patch set. We can start
with something simple, and build on top of that. So a very simple
patch set that enables cgroup based proportional scheduling for CFQ
seems like the way to go at this point.

It would be great if we discuss our plans on the mailing list, so we
can get early feedback from everyone.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-29  0:37   ` Nauman Rafique
  0 siblings, 0 replies; 349+ messages in thread
From: Nauman Rafique @ 2009-09-29  0:37 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew,
	yoshikawa.takuya, jmoyer, mingo, m-ikeda, riel, lizf, fchecconi,
	s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds

Hi Vivek,
Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with
Jens about IO controller during Linux Plumbers Conference '09. Jens
expressed his concerns about the size and complexity of the patches. I
believe that is a reasonable concern. We talked about things that
could be done to reduce the size of the patches. The requirement that
the "solution has to work with all IO schedulers" seems like a
secondary concern at this point; and it came out as one thing that can
help to reduce the size of the patch set. Another possibility is to
use a simpler scheduling algorithm e.g. weighted round robin, instead
of BFQ scheduler. BFQ indeed has great properties, but we cannot deny
the fact that it is complex to understand, and might be cumbersome to
maintain. Also, hierarchical scheduling is something that could be
unnecessary in the first set of patches, even though cgroups are
hierarchical in nature.

We are starting from a point where there is no cgroup based IO
scheduling in the kernel. And it is probably not reasonable to satisfy
all IO scheduling related requirements in one patch set. We can start
with something simple, and build on top of that. So a very simple
patch set that enables cgroup based proportional scheduling for CFQ
seems like the way to go at this point.

It would be great if we discuss our plans on the mailing list, so we
can get early feedback from everyone.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]               ` <20090928181846.GC3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-28 18:53                 ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28 18:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Corrado Zoccolo, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote:
> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:

> I guess changing class to IDLE should have helped a bit as now this is
> equivalent to setting the quantum to 1 and after dispatching one request
> to disk, CFQ will always expire the writer once. So it might happen that
> by the the reader preempted writer, we have less number of requests in
> disk and lesser latency for this reader.

I expected SCHED_IDLE to be better than setting quantum to 1, because
max is quantum*4 if you aren't IDLE.  But that's not what happened.  I
just retested with all knobs set back to stock, fairness off, and
quantum set to 1 with everything running nice 0.  2.8 seconds avg :-/

> > I saw
> > the reference to Vivek's patch, and gave it a shot.  Makes a large
> > difference.
> >                                                            Avg
> > perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
> >               16.24   175.82   154.38   228.97   147.16  144.5     noop
> >               43.23    57.39    96.13   148.25   180.09  105.0     deadline
> >                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
> >               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
> >                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
> >                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
> >                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
> >                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
> > 
> 
> Hmm.., looks like average latency went down only in  case of fairness=1
> and not in case of fairness=0. (Looking at previous mail, average vanilla
> cfq latencies were around 12 seconds).

Yup.

> Are you running all this in root group or have you put writers and readers
> into separate cgroups?

No cgroups here.

> If everything is running in root group, then I am curious why latency went
> down in case of fairness=1. The only thing fairness=1 parameter does is
> that it lets complete all the requests from previous queue before start
> dispatching from next queue. On top of this is valid only if no preemption
> took place. In your test case, konsole should preempt the writer so
> practically fairness=1 might not make much difference.

fairness=1 very definitely makes a very large difference.  All of those
cfq numbers were logged in back to back runs.

> In fact now Jens has committed a patch which achieves the similar effect as
> fairness=1 for async queues.

Yeah, I was there yesterday.  I speculated that that would hurt my
reader, but rearranging things didn't help one bit.  Playing with merge,
I managed to give dd ~7% more throughput, and injured poor reader even
more.  (problem analysis via hammer/axe not always most effective;)

> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
> Author: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> Date:   Fri Jul 3 12:57:48 2009 +0200
> 
>     cfq-iosched: drain device queue before switching to a sync queue
>     
>     To lessen the impact of async IO on sync IO, let the device drain of
>     any async IO in progress when switching to a sync cfqq that has idling
>     enabled.
> 
> 
> If everything is in separate cgroups, then we should have seen latency 
> improvements in case of fairness=0 case also. I am little perplexed here..
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 18:18               ` Vivek Goyal
  (?)
@ 2009-09-28 18:53               ` Mike Galbraith
  2009-09-29  7:14                 ` Corrado Zoccolo
       [not found]                 ` <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28 18:53 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Corrado Zoccolo, Ulrich Lukas, linux-kernel, containers,
	dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe,
	Tobias Oetiker

On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote:
> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:

> I guess changing class to IDLE should have helped a bit as now this is
> equivalent to setting the quantum to 1 and after dispatching one request
> to disk, CFQ will always expire the writer once. So it might happen that
> by the the reader preempted writer, we have less number of requests in
> disk and lesser latency for this reader.

I expected SCHED_IDLE to be better than setting quantum to 1, because
max is quantum*4 if you aren't IDLE.  But that's not what happened.  I
just retested with all knobs set back to stock, fairness off, and
quantum set to 1 with everything running nice 0.  2.8 seconds avg :-/

> > I saw
> > the reference to Vivek's patch, and gave it a shot.  Makes a large
> > difference.
> >                                                            Avg
> > perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
> >               16.24   175.82   154.38   228.97   147.16  144.5     noop
> >               43.23    57.39    96.13   148.25   180.09  105.0     deadline
> >                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
> >               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
> >                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
> >                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
> >                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
> >                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
> > 
> 
> Hmm.., looks like average latency went down only in  case of fairness=1
> and not in case of fairness=0. (Looking at previous mail, average vanilla
> cfq latencies were around 12 seconds).

Yup.

> Are you running all this in root group or have you put writers and readers
> into separate cgroups?

No cgroups here.

> If everything is running in root group, then I am curious why latency went
> down in case of fairness=1. The only thing fairness=1 parameter does is
> that it lets complete all the requests from previous queue before start
> dispatching from next queue. On top of this is valid only if no preemption
> took place. In your test case, konsole should preempt the writer so
> practically fairness=1 might not make much difference.

fairness=1 very definitely makes a very large difference.  All of those
cfq numbers were logged in back to back runs.

> In fact now Jens has committed a patch which achieves the similar effect as
> fairness=1 for async queues.

Yeah, I was there yesterday.  I speculated that that would hurt my
reader, but rearranging things didn't help one bit.  Playing with merge,
I managed to give dd ~7% more throughput, and injured poor reader even
more.  (problem analysis via hammer/axe not always most effective;)

> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
> Author: Jens Axboe <jens.axboe@oracle.com>
> Date:   Fri Jul 3 12:57:48 2009 +0200
> 
>     cfq-iosched: drain device queue before switching to a sync queue
>     
>     To lessen the impact of async IO on sync IO, let the device drain of
>     any async IO in progress when switching to a sync cfqq that has idling
>     enabled.
> 
> 
> If everything is in separate cgroups, then we should have seen latency 
> improvements in case of fairness=0 case also. I am little perplexed here..
> 
> Thanks
> Vivek


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]                 ` <20090928174809.GB3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-28 18:24                   ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28 18:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, 2009-09-28 at 13:48 -0400, Vivek Goyal wrote:

> Hmm.., so close to 25% reduction on average in completion time of konsole.
> But this is in presece of writer. Does this help even in presence of 1 or
> more sequential readers going?

Dunno, I've only tested sequential writer.

> So here latency seems to be coming from three sources.
> 
> - Wait in CFQ before request is dispatched (only in case of competing seq readers).
> - seek latencies
> - latencies because of bigger requests are already dispatched to disk.
> 
> So limiting the size of request will help with third factor but not with first  
> two factors and here seek latencies seem to be the biggest contributor.

Yeah, seek latency seems to dominate.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 17:48                 ` Vivek Goyal
  (?)
@ 2009-09-28 18:24                 ` Mike Galbraith
  -1 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28 18:24 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Jens Axboe, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Mon, 2009-09-28 at 13:48 -0400, Vivek Goyal wrote:

> Hmm.., so close to 25% reduction on average in completion time of konsole.
> But this is in presece of writer. Does this help even in presence of 1 or
> more sequential readers going?

Dunno, I've only tested sequential writer.

> So here latency seems to be coming from three sources.
> 
> - Wait in CFQ before request is dispatched (only in case of competing seq readers).
> - seek latencies
> - latencies because of bigger requests are already dispatched to disk.
> 
> So limiting the size of request will help with third factor but not with first  
> two factors and here seek latencies seem to be the biggest contributor.

Yeah, seek latency seems to dominate.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]             ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-28 18:18               ` Vivek Goyal
  2009-09-29  5:55               ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 18:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	Corrado Zoccolo, dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:
> On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote:
> 
> > Great.
> > Can you try the attached patch (on top of 2.6.31)?
> > It implements the alternative approach we discussed privately in july,
> > and it addresses the possible latency increase that could happen with
> > your patch.
> > 
> > To summarize for everyone, we separate sync sequential queues, sync
> > seeky queues and async queues in three separate RR strucutres, and
> > alternate servicing requests between them.
> > 
> > When servicing seeky queues (the ones that are usually penalized by
> > cfq, for which no fairness is usually provided), we do not idle
> > between them, but we do idle for the last queue (the idle can be
> > exited when any seeky queue has requests). This allows us to allocate
> > disk time globally for all seeky processes, and to reduce seeky
> > processes latencies.
> > 
> > I tested with 'konsole -e exit', while doing a sequential write with
> > dd, and the start up time reduced from 37s to 7s, on an old laptop
> > disk.
> 
> I was fiddling around trying to get IDLE class to behave at least, and
> getting a bit frustrated.  Class/priority didn't seem to make much if
> any difference for konsole -e exit timings, and now I know why.

You seem to be testing kconsole timings against a writer. In case of a 
writer prio will not make much of a difference as prio only adjusts length
of slice given to process and writers rarely get to use their slice
length. Reader immediately preemtps it...

I guess changing class to IDLE should have helped a bit as now this is
equivalent to setting the quantum to 1 and after dispatching one request
to disk, CFQ will always expire the writer once. So it might happen that
by the the reader preempted writer, we have less number of requests in
disk and lesser latency for this reader.

> I saw
> the reference to Vivek's patch, and gave it a shot.  Makes a large
> difference.
>                                                            Avg
> perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
>               16.24   175.82   154.38   228.97   147.16  144.5     noop
>               43.23    57.39    96.13   148.25   180.09  105.0     deadline
>                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
>               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
>                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
>                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
>                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
>                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
> 

Hmm.., looks like average latency went down only in  case of fairness=1
and not in case of fairness=0. (Looking at previous mail, average vanilla
cfq latencies were around 12 seconds).

Are you running all this in root group or have you put writers and readers
into separate cgroups?

If everything is running in root group, then I am curious why latency went
down in case of fairness=1. The only thing fairness=1 parameter does is
that it lets complete all the requests from previous queue before start
dispatching from next queue. On top of this is valid only if no preemption
took place. In your test case, konsole should preempt the writer so
practically fairness=1 might not make much difference.

In fact now Jens has committed a patch which achieves the similar effect as
fairness=1 for async queues.

commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
Author: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Date:   Fri Jul 3 12:57:48 2009 +0200

    cfq-iosched: drain device queue before switching to a sync queue
    
    To lessen the impact of async IO on sync IO, let the device drain of
    any async IO in progress when switching to a sync cfqq that has idling
    enabled.


If everything is in separate cgroups, then we should have seen latency 
improvements in case of fairness=0 case also. I am little perplexed here..

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 17:51           ` Mike Galbraith
@ 2009-09-28 18:18               ` Vivek Goyal
  2009-09-29  5:55             ` Mike Galbraith
       [not found]             ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 18:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Corrado Zoccolo, Ulrich Lukas, linux-kernel, containers,
	dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe,
	Tobias Oetiker

On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:
> On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote:
> 
> > Great.
> > Can you try the attached patch (on top of 2.6.31)?
> > It implements the alternative approach we discussed privately in july,
> > and it addresses the possible latency increase that could happen with
> > your patch.
> > 
> > To summarize for everyone, we separate sync sequential queues, sync
> > seeky queues and async queues in three separate RR strucutres, and
> > alternate servicing requests between them.
> > 
> > When servicing seeky queues (the ones that are usually penalized by
> > cfq, for which no fairness is usually provided), we do not idle
> > between them, but we do idle for the last queue (the idle can be
> > exited when any seeky queue has requests). This allows us to allocate
> > disk time globally for all seeky processes, and to reduce seeky
> > processes latencies.
> > 
> > I tested with 'konsole -e exit', while doing a sequential write with
> > dd, and the start up time reduced from 37s to 7s, on an old laptop
> > disk.
> 
> I was fiddling around trying to get IDLE class to behave at least, and
> getting a bit frustrated.  Class/priority didn't seem to make much if
> any difference for konsole -e exit timings, and now I know why.

You seem to be testing kconsole timings against a writer. In case of a 
writer prio will not make much of a difference as prio only adjusts length
of slice given to process and writers rarely get to use their slice
length. Reader immediately preemtps it...

I guess changing class to IDLE should have helped a bit as now this is
equivalent to setting the quantum to 1 and after dispatching one request
to disk, CFQ will always expire the writer once. So it might happen that
by the the reader preempted writer, we have less number of requests in
disk and lesser latency for this reader.

> I saw
> the reference to Vivek's patch, and gave it a shot.  Makes a large
> difference.
>                                                            Avg
> perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
>               16.24   175.82   154.38   228.97   147.16  144.5     noop
>               43.23    57.39    96.13   148.25   180.09  105.0     deadline
>                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
>               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
>                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
>                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
>                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
>                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
> 

Hmm.., looks like average latency went down only in  case of fairness=1
and not in case of fairness=0. (Looking at previous mail, average vanilla
cfq latencies were around 12 seconds).

Are you running all this in root group or have you put writers and readers
into separate cgroups?

If everything is running in root group, then I am curious why latency went
down in case of fairness=1. The only thing fairness=1 parameter does is
that it lets complete all the requests from previous queue before start
dispatching from next queue. On top of this is valid only if no preemption
took place. In your test case, konsole should preempt the writer so
practically fairness=1 might not make much difference.

In fact now Jens has committed a patch which achieves the similar effect as
fairness=1 for async queues.

commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Fri Jul 3 12:57:48 2009 +0200

    cfq-iosched: drain device queue before switching to a sync queue
    
    To lessen the impact of async IO on sync IO, let the device drain of
    any async IO in progress when switching to a sync cfqq that has idling
    enabled.


If everything is in separate cgroups, then we should have seen latency 
improvements in case of fairness=0 case also. I am little perplexed here..

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-28 18:18               ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 18:18 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Tobias Oetiker, dhaval, peterz, Corrado Zoccolo, dm-devel,
	dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan,
	fernando, Ulrich Lukas, mikew, jmoyer, nauman, mingo, m-ikeda,
	riel, lizf, fchecconi, containers, linux-kernel, akpm,
	righi.andrea, torvalds

On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote:
> On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote:
> 
> > Great.
> > Can you try the attached patch (on top of 2.6.31)?
> > It implements the alternative approach we discussed privately in july,
> > and it addresses the possible latency increase that could happen with
> > your patch.
> > 
> > To summarize for everyone, we separate sync sequential queues, sync
> > seeky queues and async queues in three separate RR strucutres, and
> > alternate servicing requests between them.
> > 
> > When servicing seeky queues (the ones that are usually penalized by
> > cfq, for which no fairness is usually provided), we do not idle
> > between them, but we do idle for the last queue (the idle can be
> > exited when any seeky queue has requests). This allows us to allocate
> > disk time globally for all seeky processes, and to reduce seeky
> > processes latencies.
> > 
> > I tested with 'konsole -e exit', while doing a sequential write with
> > dd, and the start up time reduced from 37s to 7s, on an old laptop
> > disk.
> 
> I was fiddling around trying to get IDLE class to behave at least, and
> getting a bit frustrated.  Class/priority didn't seem to make much if
> any difference for konsole -e exit timings, and now I know why.

You seem to be testing kconsole timings against a writer. In case of a 
writer prio will not make much of a difference as prio only adjusts length
of slice given to process and writers rarely get to use their slice
length. Reader immediately preemtps it...

I guess changing class to IDLE should have helped a bit as now this is
equivalent to setting the quantum to 1 and after dispatching one request
to disk, CFQ will always expire the writer once. So it might happen that
by the the reader preempted writer, we have less number of requests in
disk and lesser latency for this reader.

> I saw
> the reference to Vivek's patch, and gave it a shot.  Makes a large
> difference.
>                                                            Avg
> perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
>               16.24   175.82   154.38   228.97   147.16  144.5     noop
>               43.23    57.39    96.13   148.25   180.09  105.0     deadline
>                9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
>               12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
>                9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
>                4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
>                3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
>                2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE
> 

Hmm.., looks like average latency went down only in  case of fairness=1
and not in case of fairness=0. (Looking at previous mail, average vanilla
cfq latencies were around 12 seconds).

Are you running all this in root group or have you put writers and readers
into separate cgroups?

If everything is running in root group, then I am curious why latency went
down in case of fairness=1. The only thing fairness=1 parameter does is
that it lets complete all the requests from previous queue before start
dispatching from next queue. On top of this is valid only if no preemption
took place. In your test case, konsole should preempt the writer so
practically fairness=1 might not make much difference.

In fact now Jens has committed a patch which achieves the similar effect as
fairness=1 for async queues.

commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9
Author: Jens Axboe <jens.axboe@oracle.com>
Date:   Fri Jul 3 12:57:48 2009 +0200

    cfq-iosched: drain device queue before switching to a sync queue
    
    To lessen the impact of async IO on sync IO, let the device drain of
    any async IO in progress when switching to a sync cfqq that has idling
    enabled.


If everything is in separate cgroups, then we should have seen latency 
improvements in case of fairness=0 case also. I am little perplexed here..

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]           ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-09-28 17:14             ` Vivek Goyal
@ 2009-09-28 17:51             ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28 17:51 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote:

> Great.
> Can you try the attached patch (on top of 2.6.31)?
> It implements the alternative approach we discussed privately in july,
> and it addresses the possible latency increase that could happen with
> your patch.
> 
> To summarize for everyone, we separate sync sequential queues, sync
> seeky queues and async queues in three separate RR strucutres, and
> alternate servicing requests between them.
> 
> When servicing seeky queues (the ones that are usually penalized by
> cfq, for which no fairness is usually provided), we do not idle
> between them, but we do idle for the last queue (the idle can be
> exited when any seeky queue has requests). This allows us to allocate
> disk time globally for all seeky processes, and to reduce seeky
> processes latencies.
> 
> I tested with 'konsole -e exit', while doing a sequential write with
> dd, and the start up time reduced from 37s to 7s, on an old laptop
> disk.

I was fiddling around trying to get IDLE class to behave at least, and
getting a bit frustrated.  Class/priority didn't seem to make much if
any difference for konsole -e exit timings, and now I know why.  I saw
the reference to Vivek's patch, and gave it a shot.  Makes a large
difference.
                                                           Avg
perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
              16.24   175.82   154.38   228.97   147.16  144.5     noop
              43.23    57.39    96.13   148.25   180.09  105.0     deadline
               9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
              12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
               9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
               4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
               3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
               2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE

I'll give your patch a spin as well.

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 15:35         ` Corrado Zoccolo
  2009-09-28 17:14             ` Vivek Goyal
@ 2009-09-28 17:51           ` Mike Galbraith
  2009-09-28 18:18               ` Vivek Goyal
                               ` (2 more replies)
       [not found]           ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 3 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28 17:51 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe,
	Tobias Oetiker

On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote:

> Great.
> Can you try the attached patch (on top of 2.6.31)?
> It implements the alternative approach we discussed privately in july,
> and it addresses the possible latency increase that could happen with
> your patch.
> 
> To summarize for everyone, we separate sync sequential queues, sync
> seeky queues and async queues in three separate RR strucutres, and
> alternate servicing requests between them.
> 
> When servicing seeky queues (the ones that are usually penalized by
> cfq, for which no fairness is usually provided), we do not idle
> between them, but we do idle for the last queue (the idle can be
> exited when any seeky queue has requests). This allows us to allocate
> disk time globally for all seeky processes, and to reduce seeky
> processes latencies.
> 
> I tested with 'konsole -e exit', while doing a sequential write with
> dd, and the start up time reduced from 37s to 7s, on an old laptop
> disk.

I was fiddling around trying to get IDLE class to behave at least, and
getting a bit frustrated.  Class/priority didn't seem to make much if
any difference for konsole -e exit timings, and now I know why.  I saw
the reference to Vivek's patch, and gave it a shot.  Makes a large
difference.
                                                           Avg
perf stat     12.82     7.19     8.49     5.76     9.32    8.7     anticipatory
              16.24   175.82   154.38   228.97   147.16  144.5     noop
              43.23    57.39    96.13   148.25   180.09  105.0     deadline
               9.15    14.51     9.39    15.06     9.90   11.6     cfq fairness=0 dd=nice 0
              12.22     9.85    12.55     9.88    15.06   11.9     cfq fairness=0 dd=nice 19
               9.77    13.19    11.78    17.40     9.51   11.9     cfq fairness=0 dd=SCHED_IDLE
               4.59     2.74     4.70     3.45     4.69    4.0     cfq fairness=1 dd=nice 0
               3.79     4.66     2.66     5.15     3.03    3.8     cfq fairness=1 dd=nice 19
               2.79     4.73     2.79     4.02     2.50    3.3     cfq fairness=1 dd=SCHED_IDLE

I'll give your patch a spin as well.

	-Mike


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]               ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-09-28  5:55                 ` Mike Galbraith
@ 2009-09-28 17:48                 ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 17:48 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe,
	agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 28, 2009 at 06:04:08AM +0200, Mike Galbraith wrote:
> On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote:
> > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:
> 
> > I'll give it a shot first thing in the A.M.
> 
> > > diff --git a/block/elevator.c b/block/elevator.c
> > > index 1975b61..d00a72b 100644
> > > --- a/block/elevator.c
> > > +++ b/block/elevator.c
> > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> > >  	 * See if our hash lookup can find a potential backmerge.
> > >  	 */
> > >  	__rq = elv_rqhash_find(q, bio->bi_sector);
> > > -	if (__rq && elv_rq_merge_ok(__rq, bio)) {
> > > -		*req = __rq;
> > > -		return ELEVATOR_BACK_MERGE;
> > > +	if (__rq) {
> > > +		/*
> > > +		 * If requests are queued behind this one, disallow merge. This
> > > +		 * prevents streaming IO from continually passing new IO.
> > > +		 */
> > > +		if (elv_latter_request(q, __rq))
> > > +			return ELEVATOR_NO_MERGE;
> > > +		if (elv_rq_merge_ok(__rq, bio)) {
> > > +			*req = __rq;
> > > +			return ELEVATOR_BACK_MERGE;
> > > +		}
> > >  	}
> > >  
> > >  	if (e->ops->elevator_merge_fn)
> 
> - = virgin tip v2.6.31-10215-ga3c9602
> + = with patchlet
>                                                             Avg
> dd pre         67.4     70.9     65.4     68.9     66.2     67.7-
>                65.9     68.5     69.8     65.2     65.8     67.0-     Avg
>                70.4     70.3     65.1     66.4     70.1     68.4-     67.7-
>                73.1     64.6     65.3     65.3     64.9     66.6+     65.6+     .968
>                63.8     67.9     65.2     65.1     64.4     65.2+
>                64.9     66.3     64.1     65.2     64.8     65.0+
> perf stat      8.66    16.29     9.65    14.88     9.45     11.7-
>               15.36     9.71    15.47    10.44    12.93     12.7-
>               10.55    15.11    10.22    15.35    10.32     12.3-     12.2-
>                9.87     7.53    10.62     7.51     9.95      9.0+      9.1+     .745
>                7.73    10.12     8.19    11.87     8.07      9.1+
>               11.04     7.62    10.14     8.13    10.23      9.4+
> dd post        63.4     60.5     66.7     64.5     67.3     64.4-
>                64.4     66.8     64.3     61.5     62.0     63.8-
>                63.8     64.9     66.2     65.6     66.9     65.4-     64.5-
>                60.9     63.4     60.2     63.4     65.5     62.6+     61.8+     .958
>                63.3     59.9     61.9     62.7     61.2     61.8+
>                60.1     63.7     59.5     61.5     60.6     61.0+
> 

Hmm.., so close to 25% reduction on average in completion time of konsole.
But this is in presece of writer. Does this help even in presence of 1 or
more sequential readers going?

So here latency seems to be coming from three sources.

- Wait in CFQ before request is dispatched (only in case of competing seq readers).
- seek latencies
- latencies because of bigger requests are already dispatched to disk.

So limiting the size of request will help with third factor but not with first  
two factors and here seek latencies seem to be the biggest contributor. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28  4:04             ` Mike Galbraith
@ 2009-09-28 17:48                 ` Vivek Goyal
  2009-09-28 17:48                 ` Vivek Goyal
       [not found]               ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 17:48 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Jens Axboe, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Mon, Sep 28, 2009 at 06:04:08AM +0200, Mike Galbraith wrote:
> On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote:
> > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:
> 
> > I'll give it a shot first thing in the A.M.
> 
> > > diff --git a/block/elevator.c b/block/elevator.c
> > > index 1975b61..d00a72b 100644
> > > --- a/block/elevator.c
> > > +++ b/block/elevator.c
> > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> > >  	 * See if our hash lookup can find a potential backmerge.
> > >  	 */
> > >  	__rq = elv_rqhash_find(q, bio->bi_sector);
> > > -	if (__rq && elv_rq_merge_ok(__rq, bio)) {
> > > -		*req = __rq;
> > > -		return ELEVATOR_BACK_MERGE;
> > > +	if (__rq) {
> > > +		/*
> > > +		 * If requests are queued behind this one, disallow merge. This
> > > +		 * prevents streaming IO from continually passing new IO.
> > > +		 */
> > > +		if (elv_latter_request(q, __rq))
> > > +			return ELEVATOR_NO_MERGE;
> > > +		if (elv_rq_merge_ok(__rq, bio)) {
> > > +			*req = __rq;
> > > +			return ELEVATOR_BACK_MERGE;
> > > +		}
> > >  	}
> > >  
> > >  	if (e->ops->elevator_merge_fn)
> 
> - = virgin tip v2.6.31-10215-ga3c9602
> + = with patchlet
>                                                             Avg
> dd pre         67.4     70.9     65.4     68.9     66.2     67.7-
>                65.9     68.5     69.8     65.2     65.8     67.0-     Avg
>                70.4     70.3     65.1     66.4     70.1     68.4-     67.7-
>                73.1     64.6     65.3     65.3     64.9     66.6+     65.6+     .968
>                63.8     67.9     65.2     65.1     64.4     65.2+
>                64.9     66.3     64.1     65.2     64.8     65.0+
> perf stat      8.66    16.29     9.65    14.88     9.45     11.7-
>               15.36     9.71    15.47    10.44    12.93     12.7-
>               10.55    15.11    10.22    15.35    10.32     12.3-     12.2-
>                9.87     7.53    10.62     7.51     9.95      9.0+      9.1+     .745
>                7.73    10.12     8.19    11.87     8.07      9.1+
>               11.04     7.62    10.14     8.13    10.23      9.4+
> dd post        63.4     60.5     66.7     64.5     67.3     64.4-
>                64.4     66.8     64.3     61.5     62.0     63.8-
>                63.8     64.9     66.2     65.6     66.9     65.4-     64.5-
>                60.9     63.4     60.2     63.4     65.5     62.6+     61.8+     .958
>                63.3     59.9     61.9     62.7     61.2     61.8+
>                60.1     63.7     59.5     61.5     60.6     61.0+
> 

Hmm.., so close to 25% reduction on average in completion time of konsole.
But this is in presece of writer. Does this help even in presence of 1 or
more sequential readers going?

So here latency seems to be coming from three sources.

- Wait in CFQ before request is dispatched (only in case of competing seq readers).
- seek latencies
- latencies because of bigger requests are already dispatched to disk.

So limiting the size of request will help with third factor but not with first  
two factors and here seek latencies seem to be the biggest contributor. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-28 17:48                 ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 17:48 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Mon, Sep 28, 2009 at 06:04:08AM +0200, Mike Galbraith wrote:
> On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote:
> > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:
> 
> > I'll give it a shot first thing in the A.M.
> 
> > > diff --git a/block/elevator.c b/block/elevator.c
> > > index 1975b61..d00a72b 100644
> > > --- a/block/elevator.c
> > > +++ b/block/elevator.c
> > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> > >  	 * See if our hash lookup can find a potential backmerge.
> > >  	 */
> > >  	__rq = elv_rqhash_find(q, bio->bi_sector);
> > > -	if (__rq && elv_rq_merge_ok(__rq, bio)) {
> > > -		*req = __rq;
> > > -		return ELEVATOR_BACK_MERGE;
> > > +	if (__rq) {
> > > +		/*
> > > +		 * If requests are queued behind this one, disallow merge. This
> > > +		 * prevents streaming IO from continually passing new IO.
> > > +		 */
> > > +		if (elv_latter_request(q, __rq))
> > > +			return ELEVATOR_NO_MERGE;
> > > +		if (elv_rq_merge_ok(__rq, bio)) {
> > > +			*req = __rq;
> > > +			return ELEVATOR_BACK_MERGE;
> > > +		}
> > >  	}
> > >  
> > >  	if (e->ops->elevator_merge_fn)
> 
> - = virgin tip v2.6.31-10215-ga3c9602
> + = with patchlet
>                                                             Avg
> dd pre         67.4     70.9     65.4     68.9     66.2     67.7-
>                65.9     68.5     69.8     65.2     65.8     67.0-     Avg
>                70.4     70.3     65.1     66.4     70.1     68.4-     67.7-
>                73.1     64.6     65.3     65.3     64.9     66.6+     65.6+     .968
>                63.8     67.9     65.2     65.1     64.4     65.2+
>                64.9     66.3     64.1     65.2     64.8     65.0+
> perf stat      8.66    16.29     9.65    14.88     9.45     11.7-
>               15.36     9.71    15.47    10.44    12.93     12.7-
>               10.55    15.11    10.22    15.35    10.32     12.3-     12.2-
>                9.87     7.53    10.62     7.51     9.95      9.0+      9.1+     .745
>                7.73    10.12     8.19    11.87     8.07      9.1+
>               11.04     7.62    10.14     8.13    10.23      9.4+
> dd post        63.4     60.5     66.7     64.5     67.3     64.4-
>                64.4     66.8     64.3     61.5     62.0     63.8-
>                63.8     64.9     66.2     65.6     66.9     65.4-     64.5-
>                60.9     63.4     60.2     63.4     65.5     62.6+     61.8+     .958
>                63.3     59.9     61.9     62.7     61.2     61.8+
>                60.1     63.7     59.5     61.5     60.6     61.0+
> 

Hmm.., so close to 25% reduction on average in completion time of konsole.
But this is in presece of writer. Does this help even in presence of 1 or
more sequential readers going?

So here latency seems to be coming from three sources.

- Wait in CFQ before request is dispatched (only in case of competing seq readers).
- seek latencies
- latencies because of bigger requests are already dispatched to disk.

So limiting the size of request will help with third factor but not with first  
two factors and here seek latencies seem to be the biggest contributor. 

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]           ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-09-28 17:14             ` Vivek Goyal
  2009-09-28 17:51             ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 17:14 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote:
> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> >> >> Vivek Goyal wrote:
> >> >> > Notes:
> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >> >> >   Bring down its throughput and bump up latencies significantly.
> >> >>
> >> >>
> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> >> >> too.
> >> >>
> >> >> I'm basing this assumption on the observations I made on both OpenSuse
> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of
> >> >> 2009-09-20.
> >> >> (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org)
> >> >>
> >> >>
> >> >> Thus, I'm posting this to show that your work is greatly appreciated,
> >> >> given the rather disappointig status quo of Linux's fairness when it
> >> >> comes to disk IO time.
> >> >>
> >> >> I hope that your efforts lead to a change in performance of current
> >> >> userland applications, the sooner, the better.
> >> >>
> >> > [Please don't remove people from original CC list. I am putting them back.]
> >> >
> >> > Hi Ulrich,
> >> >
> >> > I quicky went through that mail thread and I tried following on my
> >> > desktop.
> >> >
> >> > ##########################################
> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
> >> > sleep 5
> >> > time firefox
> >> > # close firefox once gui pops up.
> >> > ##########################################
> >> >
> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
> >> > following.
> >> >
> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> >> >
> >> > (Results do vary across runs, especially if system is booted fresh. Don't
> >> >  know why...).
> >> >
> >> >
> >> > Then I tried putting both the applications in separate groups and assign
> >> > them weights 200 each.
> >> >
> >> > ##########################################
> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
> >> > echo $! > /cgroup/io/test1/tasks
> >> > sleep 5
> >> > echo $$ > /cgroup/io/test2/tasks
> >> > time firefox
> >> > # close firefox once gui pops up.
> >> > ##########################################
> >> >
> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> >> >
> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> >> >
> >> > Notice that throughput of dd also improved.
> >> >
> >> > I ran the block trace and noticed in many a cases firefox threads
> >> > immediately preempted the "dd". Probably because it was a file system
> >> > request. So in this case latency will arise from seek time.
> >> >
> >> > In some other cases, threads had to wait for up to 100ms because dd was
> >> > not preempted. In this case latency will arise both from waiting on queue
> >> > as well as seek time.
> >>
> >> I think cfq should already be doing something similar, i.e. giving
> >> 100ms slices to firefox, that alternate with dd, unless:
> >> * firefox is too seeky (in this case, the idle window will be too small)
> >> * firefox has too much think time.
> >>
> >
> Hi Vivek,
> > Hi Corrado,
> >
> > "firefox" is the shell script to setup the environment and launch the
> > broser. It seems to be a group of threads. Some of them run in parallel
> > and some of these seems to be running one after the other (once previous
> > process or threads finished).
> 
> Ok.
> 
> >
> >> To rule out the first case, what happens if you run the test with your
> >> "fairness for seeky processes" patch?
> >
> > I applied that patch and it helps a lot.
> >
> > http://lwn.net/Articles/341032/
> >
> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.
> 
> Great.
> Can you try the attached patch (on top of 2.6.31)?
> It implements the alternative approach we discussed privately in july,
> and it addresses the possible latency increase that could happen with
> your patch.
> 
> To summarize for everyone, we separate sync sequential queues, sync
> seeky queues and async queues in three separate RR strucutres, and
> alternate servicing requests between them.
> 
> When servicing seeky queues (the ones that are usually penalized by
> cfq, for which no fairness is usually provided), we do not idle
> between them, but we do idle for the last queue (the idle can be
> exited when any seeky queue has requests). This allows us to allocate
> disk time globally for all seeky processes, and to reduce seeky
> processes latencies.
> 

Ok, I seem to be doing same thing at group level (In group scheduling
patches). I do not idle on individual sync seeky queues but if this is
last queue in the group, then I do idle to make sure group does not loose
its fair share and exit from idle the moment there is any busy queue in
the group.

So you seem to be grouping all the sync seeky queues system wide in a 
single group. So all the sync seeky queues collectively get 100ms in a
single round of dispatch? I am wondering what happens if there are lot
of such sync seeky queues this 100ms time slice is consumed before all the
sync seeky queues got a chance to dispatch. Does that mean that some of
the queues can completely skip the one dispatch round?

Thanks
Vivek

> I tested with 'konsole -e exit', while doing a sequential write with
> dd, and the start up time reduced from 37s to 7s, on an old laptop
> disk.
> 
> Thanks,
> Corrado
> 
> >
> >> To rule out the first case, what happens if you run the test with your
> >> "fairness for seeky processes" patch?
> >
> > I applied that patch and it helps a lot.
> >
> > http://lwn.net/Articles/341032/
> >
> > With above patchset applied, and fairness=1, firefox pops up in 27-28
> > seconds.
> >
> > So it looks like if we don't disable idle window for seeky processes on
> > hardware supporting command queuing, it helps in this particular case.
> >
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 15:35         ` Corrado Zoccolo
@ 2009-09-28 17:14             ` Vivek Goyal
  2009-09-28 17:51           ` Mike Galbraith
       [not found]           ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 17:14 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker

On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote:
> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> >> >> Vivek Goyal wrote:
> >> >> > Notes:
> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >> >> >   Bring down its throughput and bump up latencies significantly.
> >> >>
> >> >>
> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> >> >> too.
> >> >>
> >> >> I'm basing this assumption on the observations I made on both OpenSuse
> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of
> >> >> 2009-09-20.
> >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
> >> >>
> >> >>
> >> >> Thus, I'm posting this to show that your work is greatly appreciated,
> >> >> given the rather disappointig status quo of Linux's fairness when it
> >> >> comes to disk IO time.
> >> >>
> >> >> I hope that your efforts lead to a change in performance of current
> >> >> userland applications, the sooner, the better.
> >> >>
> >> > [Please don't remove people from original CC list. I am putting them back.]
> >> >
> >> > Hi Ulrich,
> >> >
> >> > I quicky went through that mail thread and I tried following on my
> >> > desktop.
> >> >
> >> > ##########################################
> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
> >> > sleep 5
> >> > time firefox
> >> > # close firefox once gui pops up.
> >> > ##########################################
> >> >
> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
> >> > following.
> >> >
> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> >> >
> >> > (Results do vary across runs, especially if system is booted fresh. Don't
> >> >  know why...).
> >> >
> >> >
> >> > Then I tried putting both the applications in separate groups and assign
> >> > them weights 200 each.
> >> >
> >> > ##########################################
> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
> >> > echo $! > /cgroup/io/test1/tasks
> >> > sleep 5
> >> > echo $$ > /cgroup/io/test2/tasks
> >> > time firefox
> >> > # close firefox once gui pops up.
> >> > ##########################################
> >> >
> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> >> >
> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> >> >
> >> > Notice that throughput of dd also improved.
> >> >
> >> > I ran the block trace and noticed in many a cases firefox threads
> >> > immediately preempted the "dd". Probably because it was a file system
> >> > request. So in this case latency will arise from seek time.
> >> >
> >> > In some other cases, threads had to wait for up to 100ms because dd was
> >> > not preempted. In this case latency will arise both from waiting on queue
> >> > as well as seek time.
> >>
> >> I think cfq should already be doing something similar, i.e. giving
> >> 100ms slices to firefox, that alternate with dd, unless:
> >> * firefox is too seeky (in this case, the idle window will be too small)
> >> * firefox has too much think time.
> >>
> >
> Hi Vivek,
> > Hi Corrado,
> >
> > "firefox" is the shell script to setup the environment and launch the
> > broser. It seems to be a group of threads. Some of them run in parallel
> > and some of these seems to be running one after the other (once previous
> > process or threads finished).
> 
> Ok.
> 
> >
> >> To rule out the first case, what happens if you run the test with your
> >> "fairness for seeky processes" patch?
> >
> > I applied that patch and it helps a lot.
> >
> > http://lwn.net/Articles/341032/
> >
> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.
> 
> Great.
> Can you try the attached patch (on top of 2.6.31)?
> It implements the alternative approach we discussed privately in july,
> and it addresses the possible latency increase that could happen with
> your patch.
> 
> To summarize for everyone, we separate sync sequential queues, sync
> seeky queues and async queues in three separate RR strucutres, and
> alternate servicing requests between them.
> 
> When servicing seeky queues (the ones that are usually penalized by
> cfq, for which no fairness is usually provided), we do not idle
> between them, but we do idle for the last queue (the idle can be
> exited when any seeky queue has requests). This allows us to allocate
> disk time globally for all seeky processes, and to reduce seeky
> processes latencies.
> 

Ok, I seem to be doing same thing at group level (In group scheduling
patches). I do not idle on individual sync seeky queues but if this is
last queue in the group, then I do idle to make sure group does not loose
its fair share and exit from idle the moment there is any busy queue in
the group.

So you seem to be grouping all the sync seeky queues system wide in a 
single group. So all the sync seeky queues collectively get 100ms in a
single round of dispatch? I am wondering what happens if there are lot
of such sync seeky queues this 100ms time slice is consumed before all the
sync seeky queues got a chance to dispatch. Does that mean that some of
the queues can completely skip the one dispatch round?

Thanks
Vivek

> I tested with 'konsole -e exit', while doing a sequential write with
> dd, and the start up time reduced from 37s to 7s, on an old laptop
> disk.
> 
> Thanks,
> Corrado
> 
> >
> >> To rule out the first case, what happens if you run the test with your
> >> "fairness for seeky processes" patch?
> >
> > I applied that patch and it helps a lot.
> >
> > http://lwn.net/Articles/341032/
> >
> > With above patchset applied, and fairness=1, firefox pops up in 27-28
> > seconds.
> >
> > So it looks like if we don't disable idle window for seeky processes on
> > hardware supporting command queuing, it helps in this particular case.
> >
> > Thanks
> > Vivek
> >



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-28 17:14             ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 17:14 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Tobias Oetiker, dhaval, peterz, dm-devel, dpshah, jens.axboe,
	agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas,
	mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi,
	containers, linux-kernel, akpm, righi.andrea, torvalds

On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote:
> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
> >> Hi Vivek,
> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> >> >> Vivek Goyal wrote:
> >> >> > Notes:
> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >> >> >   Bring down its throughput and bump up latencies significantly.
> >> >>
> >> >>
> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> >> >> too.
> >> >>
> >> >> I'm basing this assumption on the observations I made on both OpenSuse
> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of
> >> >> 2009-09-20.
> >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
> >> >>
> >> >>
> >> >> Thus, I'm posting this to show that your work is greatly appreciated,
> >> >> given the rather disappointig status quo of Linux's fairness when it
> >> >> comes to disk IO time.
> >> >>
> >> >> I hope that your efforts lead to a change in performance of current
> >> >> userland applications, the sooner, the better.
> >> >>
> >> > [Please don't remove people from original CC list. I am putting them back.]
> >> >
> >> > Hi Ulrich,
> >> >
> >> > I quicky went through that mail thread and I tried following on my
> >> > desktop.
> >> >
> >> > ##########################################
> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
> >> > sleep 5
> >> > time firefox
> >> > # close firefox once gui pops up.
> >> > ##########################################
> >> >
> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
> >> > following.
> >> >
> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> >> >
> >> > (Results do vary across runs, especially if system is booted fresh. Don't
> >> >  know why...).
> >> >
> >> >
> >> > Then I tried putting both the applications in separate groups and assign
> >> > them weights 200 each.
> >> >
> >> > ##########################################
> >> > dd if=/home/vgoyal/4G-file of=/dev/null &
> >> > echo $! > /cgroup/io/test1/tasks
> >> > sleep 5
> >> > echo $$ > /cgroup/io/test2/tasks
> >> > time firefox
> >> > # close firefox once gui pops up.
> >> > ##########################################
> >> >
> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> >> >
> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> >> >
> >> > Notice that throughput of dd also improved.
> >> >
> >> > I ran the block trace and noticed in many a cases firefox threads
> >> > immediately preempted the "dd". Probably because it was a file system
> >> > request. So in this case latency will arise from seek time.
> >> >
> >> > In some other cases, threads had to wait for up to 100ms because dd was
> >> > not preempted. In this case latency will arise both from waiting on queue
> >> > as well as seek time.
> >>
> >> I think cfq should already be doing something similar, i.e. giving
> >> 100ms slices to firefox, that alternate with dd, unless:
> >> * firefox is too seeky (in this case, the idle window will be too small)
> >> * firefox has too much think time.
> >>
> >
> Hi Vivek,
> > Hi Corrado,
> >
> > "firefox" is the shell script to setup the environment and launch the
> > broser. It seems to be a group of threads. Some of them run in parallel
> > and some of these seems to be running one after the other (once previous
> > process or threads finished).
> 
> Ok.
> 
> >
> >> To rule out the first case, what happens if you run the test with your
> >> "fairness for seeky processes" patch?
> >
> > I applied that patch and it helps a lot.
> >
> > http://lwn.net/Articles/341032/
> >
> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.
> 
> Great.
> Can you try the attached patch (on top of 2.6.31)?
> It implements the alternative approach we discussed privately in july,
> and it addresses the possible latency increase that could happen with
> your patch.
> 
> To summarize for everyone, we separate sync sequential queues, sync
> seeky queues and async queues in three separate RR strucutres, and
> alternate servicing requests between them.
> 
> When servicing seeky queues (the ones that are usually penalized by
> cfq, for which no fairness is usually provided), we do not idle
> between them, but we do idle for the last queue (the idle can be
> exited when any seeky queue has requests). This allows us to allocate
> disk time globally for all seeky processes, and to reduce seeky
> processes latencies.
> 

Ok, I seem to be doing same thing at group level (In group scheduling
patches). I do not idle on individual sync seeky queues but if this is
last queue in the group, then I do idle to make sure group does not loose
its fair share and exit from idle the moment there is any busy queue in
the group.

So you seem to be grouping all the sync seeky queues system wide in a 
single group. So all the sync seeky queues collectively get 100ms in a
single round of dispatch? I am wondering what happens if there are lot
of such sync seeky queues this 100ms time slice is consumed before all the
sync seeky queues got a chance to dispatch. Does that mean that some of
the queues can completely skip the one dispatch round?

Thanks
Vivek

> I tested with 'konsole -e exit', while doing a sequential write with
> dd, and the start up time reduced from 37s to 7s, on an old laptop
> disk.
> 
> Thanks,
> Corrado
> 
> >
> >> To rule out the first case, what happens if you run the test with your
> >> "fairness for seeky processes" patch?
> >
> > I applied that patch and it helps a lot.
> >
> > http://lwn.net/Articles/341032/
> >
> > With above patchset applied, and fairness=1, firefox pops up in 27-28
> > seconds.
> >
> > So it looks like if we don't disable idle window for seeky processes on
> > hardware supporting command queuing, it helps in this particular case.
> >
> > Thanks
> > Vivek
> >

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]         ` <20090928145655.GB8192-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-28 15:35           ` Corrado Zoccolo
  0 siblings, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-09-28 15:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

[-- Attachment #1: Type: text/plain, Size: 5325 bytes --]

On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> >> Vivek Goyal wrote:
>> >> > Notes:
>> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >> >   Bring down its throughput and bump up latencies significantly.
>> >>
>> >>
>> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> >> too.
>> >>
>> >> I'm basing this assumption on the observations I made on both OpenSuse
>> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> >> titled: "Poor desktop responsiveness with background I/O-operations" of
>> >> 2009-09-20.
>> >> (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org)
>> >>
>> >>
>> >> Thus, I'm posting this to show that your work is greatly appreciated,
>> >> given the rather disappointig status quo of Linux's fairness when it
>> >> comes to disk IO time.
>> >>
>> >> I hope that your efforts lead to a change in performance of current
>> >> userland applications, the sooner, the better.
>> >>
>> > [Please don't remove people from original CC list. I am putting them back.]
>> >
>> > Hi Ulrich,
>> >
>> > I quicky went through that mail thread and I tried following on my
>> > desktop.
>> >
>> > ##########################################
>> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> > sleep 5
>> > time firefox
>> > # close firefox once gui pops up.
>> > ##########################################
>> >
>> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
>> > following.
>> >
>> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>> >
>> > (Results do vary across runs, especially if system is booted fresh. Don't
>> >  know why...).
>> >
>> >
>> > Then I tried putting both the applications in separate groups and assign
>> > them weights 200 each.
>> >
>> > ##########################################
>> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> > echo $! > /cgroup/io/test1/tasks
>> > sleep 5
>> > echo $$ > /cgroup/io/test2/tasks
>> > time firefox
>> > # close firefox once gui pops up.
>> > ##########################################
>> >
>> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>> >
>> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>> >
>> > Notice that throughput of dd also improved.
>> >
>> > I ran the block trace and noticed in many a cases firefox threads
>> > immediately preempted the "dd". Probably because it was a file system
>> > request. So in this case latency will arise from seek time.
>> >
>> > In some other cases, threads had to wait for up to 100ms because dd was
>> > not preempted. In this case latency will arise both from waiting on queue
>> > as well as seek time.
>>
>> I think cfq should already be doing something similar, i.e. giving
>> 100ms slices to firefox, that alternate with dd, unless:
>> * firefox is too seeky (in this case, the idle window will be too small)
>> * firefox has too much think time.
>>
>
Hi Vivek,
> Hi Corrado,
>
> "firefox" is the shell script to setup the environment and launch the
> broser. It seems to be a group of threads. Some of them run in parallel
> and some of these seems to be running one after the other (once previous
> process or threads finished).

Ok.

>
>> To rule out the first case, what happens if you run the test with your
>> "fairness for seeky processes" patch?
>
> I applied that patch and it helps a lot.
>
> http://lwn.net/Articles/341032/
>
> With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.

Great.
Can you try the attached patch (on top of 2.6.31)?
It implements the alternative approach we discussed privately in july,
and it addresses the possible latency increase that could happen with
your patch.

To summarize for everyone, we separate sync sequential queues, sync
seeky queues and async queues in three separate RR strucutres, and
alternate servicing requests between them.

When servicing seeky queues (the ones that are usually penalized by
cfq, for which no fairness is usually provided), we do not idle
between them, but we do idle for the last queue (the idle can be
exited when any seeky queue has requests). This allows us to allocate
disk time globally for all seeky processes, and to reduce seeky
processes latencies.

I tested with 'konsole -e exit', while doing a sequential write with
dd, and the start up time reduced from 37s to 7s, on an old laptop
disk.

Thanks,
Corrado

>
>> To rule out the first case, what happens if you run the test with your
>> "fairness for seeky processes" patch?
>
> I applied that patch and it helps a lot.
>
> http://lwn.net/Articles/341032/
>
> With above patchset applied, and fairness=1, firefox pops up in 27-28
> seconds.
>
> So it looks like if we don't disable idle window for seeky processes on
> hardware supporting command queuing, it helps in this particular case.
>
> Thanks
> Vivek
>

[-- Attachment #2: cfq.patch --]
[-- Type: application/octet-stream, Size: 24221 bytes --]

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd7080e..064f4fb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 10;
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
+static int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static int cfq_hist_divisor = 4;
+/*
+ * Number of times that other workloads can be scheduled before async
+ */
+static const unsigned int cfq_async_penalty = 4;
 
 /*
  * offset from end of service tree
@@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125;
 /*
  * below this threshold, we consider thinktime immediate
  */
-#define CFQ_MIN_TT		(2)
+#define CFQ_MIN_TT		(1)
 
 #define CFQ_SLICE_SCALE		(5)
 #define CFQ_HW_QUEUE_MIN	(5)
@@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 struct cfq_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
+	unsigned count;
 };
-#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, }
+#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, 0, }
 
 /*
  * Per process-grouping structure
@@ -113,6 +120,21 @@ struct cfq_queue {
 	unsigned short ioprio_class, org_ioprio_class;
 
 	pid_t pid;
+
+	struct cfq_rb_root *service_tree;
+	struct cfq_io_context *cic;
+};
+
+enum wl_prio_t {
+	IDLE_WL = -1,
+	BE_WL = 0,
+	RT_WL = 1
+};
+
+enum wl_type_t {
+	ASYNC_WL = 0,
+	SYNC_NOIDLE_WL = 1,
+	SYNC_WL = 2
 };
 
 /*
@@ -124,7 +146,13 @@ struct cfq_data {
 	/*
 	 * rr list of queues with requests and the count of them
 	 */
-	struct cfq_rb_root service_tree;
+	struct cfq_rb_root service_trees[2][3];
+	struct cfq_rb_root service_tree_idle;
+
+	enum wl_prio_t serving_prio;
+	enum wl_type_t serving_type;
+	unsigned long workload_expires;
+	unsigned int async_starved;
 
 	/*
 	 * Each priority tree is sorted by next_request position.  These
@@ -134,14 +162,11 @@ struct cfq_data {
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
 	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
+	unsigned int busy_queues_avg[2];
 
-	int rq_in_driver;
+	int rq_in_driver[2];
 	int sync_flight;
+	int reads_delayed;
 
 	/*
 	 * queue-depth detection
@@ -178,6 +203,9 @@ struct cfq_data {
 	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
+	unsigned int cfq_target_latency;
+	unsigned int cfq_hist_divisor;
+	unsigned int cfq_async_penalty;
 
 	struct list_head cic_list;
 
@@ -187,11 +215,15 @@ struct cfq_data {
 	struct cfq_queue oom_cfqq;
 };
 
+static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type,
+							  struct cfq_data *cfqd) {
+	return prio == IDLE_WL ? &cfqd->service_tree_idle :  &cfqd->service_trees[prio][type];
+}
+
 enum cfqq_state_flags {
 	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
 	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
 	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
-	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
 	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
@@ -218,7 +250,6 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 CFQ_CFQQ_FNS(on_rr);
 CFQ_CFQQ_FNS(wait_request);
 CFQ_CFQQ_FNS(must_dispatch);
-CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
 CFQ_CFQQ_FNS(idle_window);
@@ -233,12 +264,28 @@ CFQ_CFQQ_FNS(coop);
 #define cfq_log(cfqd, fmt, args...)	\
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
+#define CIC_SEEK_THR	1024
+#define CIC_SEEKY(cic)	((cic)->seek_mean > CIC_SEEK_THR)
+#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic))
+
+static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) {
+	return wl==IDLE_WL? cfqd->service_tree_idle.count :
+		cfqd->service_trees[wl][ASYNC_WL].count
+		+ cfqd->service_trees[wl][SYNC_NOIDLE_WL].count
+		+ cfqd->service_trees[wl][SYNC_WL].count;
+}
+
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
 				       struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
+static inline int rq_in_driver(struct cfq_data *cfqd)
+{
+	return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1];
+}
+
 static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 					    int is_sync)
 {
@@ -249,6 +296,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
 				struct cfq_queue *cfqq, int is_sync)
 {
 	cic->cfqq[!!is_sync] = cfqq;
+	cfqq->cic = cic;
 }
 
 /*
@@ -257,7 +305,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
  */
 static inline int cfq_bio_sync(struct bio *bio)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
+	if (bio_data_dir(bio) == READ || bio_rw_flagged(bio, BIO_RW_SYNCIO))
 		return 1;
 
 	return 0;
@@ -303,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
 }
 
+static inline unsigned
+cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
+	unsigned min_q, max_q;
+	unsigned mult  = cfqd->cfq_hist_divisor - 1;
+	unsigned round = cfqd->cfq_hist_divisor / 2;
+	unsigned busy  = cfq_busy_queues_wl(rt, cfqd);
+	min_q = min(cfqd->busy_queues_avg[rt], busy);
+	max_q = max(cfqd->busy_queues_avg[rt], busy);
+	cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+		cfqd->cfq_hist_divisor;
+	return cfqd->busy_queues_avg[rt];
+}
+
 static inline void
 cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+	unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1];
+	unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
+	unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
+
+	if (iq > process_thr) {
+		unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle
+			/ cfqd->cfq_slice[1];
+		slice = max(slice * process_thr / iq, min(slice, low_slice));
+	}
+
+	cfqq->slice_end = jiffies + slice;
 	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
 }
 
@@ -445,6 +516,7 @@ static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
 	if (root->left == n)
 		root->left = NULL;
 	rb_erase_init(n, &root->rb);
+	--root->count;
 }
 
 /*
@@ -485,46 +557,56 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
 }
 
 /*
- * The cfqd->service_tree holds all pending cfq_queue's that have
+ * The cfqd->service_trees holds all pending cfq_queue's that have
  * requests waiting to be processed. It is sorted in the order that
  * we will service the queues.
  */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
+static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	struct rb_node **p, *parent;
 	struct cfq_queue *__cfqq;
 	unsigned long rb_key;
+	struct cfq_rb_root *service_tree;
 	int left;
 
 	if (cfq_class_idle(cfqq)) {
 		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
+		service_tree = &cfqd->service_tree_idle;
+		parent = rb_last(&service_tree->rb);
 		if (parent && parent != &cfqq->rb_node) {
 			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 			rb_key += __cfqq->rb_key;
 		} else
 			rb_key += jiffies;
-	} else if (!add_front) {
+	} else {
+		enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL;
+		enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL;
+
 		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
 		rb_key += cfqq->slice_resid;
 		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
+
+		if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq)))
+			type = SYNC_NOIDLE_WL;
+
+		service_tree = service_tree_for(prio, type, cfqd);
+	}
 
 	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
 		/*
 		 * same position, nothing more to do
 		 */
-		if (rb_key == cfqq->rb_key)
+		if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree)
 			return;
 
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+		cfqq->service_tree = NULL;
 	}
 
 	left = 1;
 	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
+	cfqq->service_tree = service_tree;
+	p = &service_tree->rb.rb_node;
 	while (*p) {
 		struct rb_node **n;
 
@@ -556,11 +638,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	}
 
 	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
+		service_tree->left = &cfqq->rb_node;
 
 	cfqq->rb_key = rb_key;
 	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
+	rb_insert_color(&cfqq->rb_node, &service_tree->rb);
+	service_tree->count++;
 }
 
 static struct cfq_queue *
@@ -633,7 +716,7 @@ static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	 * Resorting requires the cfqq to be on the RR list already.
 	 */
 	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
+		cfq_service_tree_add(cfqd, cfqq);
 		cfq_prio_tree_add(cfqd, cfqq);
 	}
 }
@@ -648,8 +731,6 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	BUG_ON(cfq_cfqq_on_rr(cfqq));
 	cfq_mark_cfqq_on_rr(cfqq);
 	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
 
 	cfq_resort_rr_list(cfqd, cfqq);
 }
@@ -664,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	BUG_ON(!cfq_cfqq_on_rr(cfqq));
 	cfq_clear_cfqq_on_rr(cfqq);
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+		cfqq->service_tree = NULL;
+	}
 	if (cfqq->p_root) {
 		rb_erase(&cfqq->p_node, cfqq->p_root);
 		cfqq->p_root = NULL;
@@ -673,8 +756,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 
 	BUG_ON(!cfqd->busy_queues);
 	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
 }
 
 /*
@@ -760,9 +841,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
+	cfqd->rq_in_driver[rq_is_sync(rq)]++;
 	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
+						rq_in_driver(cfqd));
 
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
 }
@@ -770,11 +851,12 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
+	const int sync = rq_is_sync(rq);
 
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
+	WARN_ON(!cfqd->rq_in_driver[sync]);
+	cfqd->rq_in_driver[sync]--;
 	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
+						rq_in_driver(cfqd));
 }
 
 static void cfq_remove_request(struct request *rq)
@@ -928,10 +1010,11 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
+	struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd);
 
-	return cfq_rb_first(&cfqd->service_tree);
+	if (RB_EMPTY_ROOT(&service_tree->rb))
+		return NULL;
+	return cfq_rb_first(service_tree);
 }
 
 /*
@@ -959,9 +1042,6 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
 		return cfqd->last_position - blk_rq_pos(rq);
 }
 
-#define CIC_SEEK_THR	8 * 1024
-#define CIC_SEEKY(cic)	((cic)->seek_mean > CIC_SEEK_THR)
-
 static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
 {
 	struct cfq_io_context *cic = cfqd->active_cic;
@@ -1049,6 +1129,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 	if (cfq_cfqq_coop(cfqq))
 		return NULL;
 
+	/* we don't want to mix processes with different characteristics */
+	if (cfqq->service_tree != cur_cfqq->service_tree)
+		return NULL;
+
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
 	return cfqq;
@@ -1080,7 +1164,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	/*
 	 * still requests with the driver, don't idle
 	 */
-	if (cfqd->rq_in_driver)
+	if (rq_in_driver(cfqd))
 		return;
 
 	/*
@@ -1092,14 +1176,15 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 
 	cfq_mark_cfqq_wait_request(cfqq);
 
-	/*
-	 * we don't want to idle for seeks, but we do want to allow
-	 * fair distribution of slice time for a process doing back-to-back
-	 * seeks. so allow a little bit of time for him to submit a new rq
-	 */
-	sl = cfqd->cfq_slice_idle;
-	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
+	sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies);
+
+	/* very small idle if we are serving noidle trees, and there are more trees */
+	if (cfqd->serving_type == SYNC_NOIDLE_WL &&
+	    service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) {
+		if (blk_queue_nonrot(cfqd->queue))
+			return;
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
+	}
 
 	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
@@ -1115,6 +1200,12 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
 
+	if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) {
+		cfqd->reads_delayed = max_t(int, cfqd->reads_delayed,
+					    (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2));
+	}
+
+	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
 	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
@@ -1160,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
 }
 
+enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) {
+	struct cfq_queue *id, *ni;
+	ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd));
+	id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd));
+	if (id && ni && id->rb_key < ni->rb_key)
+		return SYNC_WL;
+	if (!ni) return SYNC_WL;
+	return SYNC_NOIDLE_WL;
+}
+
 /*
  * Select a queue for service. If we have a current active queue,
  * check whether to continue servicing it, or retrieve and set a new one.
@@ -1179,20 +1280,6 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 		goto expire;
 
 	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
 	 * The active queue has requests and isn't expired, allow it to
 	 * dispatch.
 	 */
@@ -1214,15 +1301,68 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 	 * flight or is idling for a new request, allow either of these
 	 * conditions to happen (or time out) before selecting a new queue.
 	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
+	if (timer_pending(&cfqd->idle_slice_timer) || 
 	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
 		cfqq = NULL;
 		goto keep_queue;
 	}
-
 expire:
 	cfq_slice_expired(cfqd, 0);
 new_queue:
+	if (!new_cfqq) {
+		enum wl_prio_t previous_prio = cfqd->serving_prio;
+
+		if (cfq_busy_queues_wl(RT_WL, cfqd))
+			cfqd->serving_prio = RT_WL;
+		else if (cfq_busy_queues_wl(BE_WL, cfqd))
+			cfqd->serving_prio = BE_WL;
+		else {
+			cfqd->serving_prio = IDLE_WL;
+			cfqd->workload_expires = jiffies + 1;
+			cfqd->reads_delayed = 0;
+		}
+
+		if (cfqd->serving_prio != IDLE_WL) {
+			int counts[]={
+				service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count,
+				service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count,
+				service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count
+			};
+			int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2];
+
+			if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) {
+				cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL;
+				cfqd->async_starved = 0;
+				cfqd->reads_delayed = 0;
+			} else {
+				if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) {
+					if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] &&
+					    cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed))
+						cfqd->serving_type = ASYNC_WL;
+					else 
+						cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio);
+				} else
+					goto same_wl;
+			}
+
+			{
+				unsigned slice = cfqd->cfq_target_latency;
+				slice = slice * counts[cfqd->serving_type] /
+					max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio],
+					      counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]);
+					    
+				if (cfqd->serving_type == ASYNC_WL)
+					slice = max(1U, (slice / (1 + cfqd->reads_delayed))
+						    * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]);
+				else
+					slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle));
+
+				cfqd->workload_expires = jiffies + slice;
+				cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL);
+			}
+		}
+	}
+ same_wl:
 	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
 keep_queue:
 	return cfqq;
@@ -1249,8 +1389,13 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 {
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
+	int i,j;
+	for (i = 0; i < 2; ++i)
+		for (j = 0; j < 3; ++j)
+			while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL)
+				dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
 	cfq_slice_expired(cfqd, 0);
@@ -1312,6 +1457,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		return 0;
 
 	/*
+	 * Drain async requests before we start sync IO
+	 */
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+		return 0;
+
+	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
@@ -1362,7 +1513,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		cfq_slice_expired(cfqd, 0);
 	}
 
-	cfq_log(cfqd, "dispatched a request");
+	cfq_log_cfqq(cfqd, cfqq, "dispatched a request");
 	return 1;
 }
 
@@ -2004,18 +2155,8 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (cfq_class_idle(cfqq))
 		return 1;
 
-	/*
-	 * if the new request is sync, but the currently running queue is
-	 * not, let the sync request have priority.
-	 */
-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
-		return 1;
-
-	/*
-	 * So both queues are sync. Let the new request get disk time if
-	 * it's a metadata request and the current queue is doing regular IO.
-	 */
-	if (rq_is_meta(rq) && !cfqq->meta_pending)
+	if (cfqd->serving_type == SYNC_NOIDLE_WL
+	    && new_cfqq->service_tree == cfqq->service_tree)
 		return 1;
 
 	/*
@@ -2046,13 +2187,9 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
 	cfq_slice_expired(cfqd, 1);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
 	BUG_ON(!cfq_cfqq_on_rr(cfqq));
 
-	cfq_service_tree_add(cfqd, cfqq, 1);
+	cfq_service_tree_add(cfqd, cfqq);
 
 	cfqq->slice_end = 0;
 	cfq_mark_cfqq_slice_new(cfqq);
@@ -2130,11 +2267,11 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_update_hw_tag(struct cfq_data *cfqd)
 {
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
+	if (rq_in_driver(cfqd) > cfqd->rq_in_driver_peak)
+		cfqd->rq_in_driver_peak = rq_in_driver(cfqd);
 
 	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
+	    rq_in_driver(cfqd) <= CFQ_HW_QUEUE_MIN)
 		return;
 
 	if (cfqd->hw_tag_samples++ < 50)
@@ -2161,9 +2298,9 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	cfq_update_hw_tag(cfqd);
 
-	WARN_ON(!cfqd->rq_in_driver);
+	WARN_ON(!cfqd->rq_in_driver[sync]);
 	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
+	cfqd->rq_in_driver[sync]--;
 	cfqq->dispatched--;
 
 	if (cfq_cfqq_sync(cfqq))
@@ -2197,7 +2334,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 			cfq_arm_slice_timer(cfqd);
 	}
 
-	if (!cfqd->rq_in_driver)
+	if (!rq_in_driver(cfqd))
 		cfq_schedule_dispatch(cfqd);
 }
 
@@ -2229,8 +2366,7 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if (cfq_cfqq_wait_request(cfqq) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2317,7 +2453,6 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	}
 
 	cfqq->allocated[rw]++;
-	cfq_clear_cfqq_must_alloc(cfqq);
 	atomic_inc(&cfqq->ref);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -2451,13 +2586,16 @@ static void cfq_exit_queue(struct elevator_queue *e)
 static void *cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
-	int i;
+	int i,j;
 
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
+	for (i = 0; i < 2; ++i)
+		for (j = 0; j < 3; ++j)
+			cfqd->service_trees[i][j] = CFQ_RB_ROOT;
+	cfqd->service_tree_idle = CFQ_RB_ROOT;
 
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
@@ -2494,6 +2632,9 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
+	cfqd->cfq_target_latency = cfq_target_latency;
+	cfqd->cfq_hist_divisor = cfq_hist_divisor;
+	cfqd->cfq_async_penalty = cfq_async_penalty;
 	cfqd->hw_tag = 1;
 
 	return cfqd;
@@ -2530,6 +2671,7 @@ fail:
 /*
  * sysfs parts below -->
  */
+
 static ssize_t
 cfq_var_show(unsigned int var, char *page)
 {
@@ -2563,6 +2705,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
+SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
+SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0);
+SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2594,6 +2739,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
+
+STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1);
+STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0);
+STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0);
+
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2609,6 +2759,9 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	CFQ_ATTR(target_latency),
+	CFQ_ATTR(hist_divisor),
+	CFQ_ATTR(async_penalty),
 	__ATTR_NULL
 };
 

[-- Attachment #3: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28 14:56         ` Vivek Goyal
  (?)
  (?)
@ 2009-09-28 15:35         ` Corrado Zoccolo
  2009-09-28 17:14             ` Vivek Goyal
                             ` (2 more replies)
  -1 siblings, 3 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-09-28 15:35 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker

[-- Attachment #1: Type: text/plain, Size: 5235 bytes --]

On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
>> Hi Vivek,
>> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> >> Vivek Goyal wrote:
>> >> > Notes:
>> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >> >   Bring down its throughput and bump up latencies significantly.
>> >>
>> >>
>> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> >> too.
>> >>
>> >> I'm basing this assumption on the observations I made on both OpenSuse
>> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> >> titled: "Poor desktop responsiveness with background I/O-operations" of
>> >> 2009-09-20.
>> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
>> >>
>> >>
>> >> Thus, I'm posting this to show that your work is greatly appreciated,
>> >> given the rather disappointig status quo of Linux's fairness when it
>> >> comes to disk IO time.
>> >>
>> >> I hope that your efforts lead to a change in performance of current
>> >> userland applications, the sooner, the better.
>> >>
>> > [Please don't remove people from original CC list. I am putting them back.]
>> >
>> > Hi Ulrich,
>> >
>> > I quicky went through that mail thread and I tried following on my
>> > desktop.
>> >
>> > ##########################################
>> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> > sleep 5
>> > time firefox
>> > # close firefox once gui pops up.
>> > ##########################################
>> >
>> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
>> > following.
>> >
>> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>> >
>> > (Results do vary across runs, especially if system is booted fresh. Don't
>> >  know why...).
>> >
>> >
>> > Then I tried putting both the applications in separate groups and assign
>> > them weights 200 each.
>> >
>> > ##########################################
>> > dd if=/home/vgoyal/4G-file of=/dev/null &
>> > echo $! > /cgroup/io/test1/tasks
>> > sleep 5
>> > echo $$ > /cgroup/io/test2/tasks
>> > time firefox
>> > # close firefox once gui pops up.
>> > ##########################################
>> >
>> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>> >
>> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>> >
>> > Notice that throughput of dd also improved.
>> >
>> > I ran the block trace and noticed in many a cases firefox threads
>> > immediately preempted the "dd". Probably because it was a file system
>> > request. So in this case latency will arise from seek time.
>> >
>> > In some other cases, threads had to wait for up to 100ms because dd was
>> > not preempted. In this case latency will arise both from waiting on queue
>> > as well as seek time.
>>
>> I think cfq should already be doing something similar, i.e. giving
>> 100ms slices to firefox, that alternate with dd, unless:
>> * firefox is too seeky (in this case, the idle window will be too small)
>> * firefox has too much think time.
>>
>
Hi Vivek,
> Hi Corrado,
>
> "firefox" is the shell script to setup the environment and launch the
> broser. It seems to be a group of threads. Some of them run in parallel
> and some of these seems to be running one after the other (once previous
> process or threads finished).

Ok.

>
>> To rule out the first case, what happens if you run the test with your
>> "fairness for seeky processes" patch?
>
> I applied that patch and it helps a lot.
>
> http://lwn.net/Articles/341032/
>
> With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds.

Great.
Can you try the attached patch (on top of 2.6.31)?
It implements the alternative approach we discussed privately in july,
and it addresses the possible latency increase that could happen with
your patch.

To summarize for everyone, we separate sync sequential queues, sync
seeky queues and async queues in three separate RR strucutres, and
alternate servicing requests between them.

When servicing seeky queues (the ones that are usually penalized by
cfq, for which no fairness is usually provided), we do not idle
between them, but we do idle for the last queue (the idle can be
exited when any seeky queue has requests). This allows us to allocate
disk time globally for all seeky processes, and to reduce seeky
processes latencies.

I tested with 'konsole -e exit', while doing a sequential write with
dd, and the start up time reduced from 37s to 7s, on an old laptop
disk.

Thanks,
Corrado

>
>> To rule out the first case, what happens if you run the test with your
>> "fairness for seeky processes" patch?
>
> I applied that patch and it helps a lot.
>
> http://lwn.net/Articles/341032/
>
> With above patchset applied, and fairness=1, firefox pops up in 27-28
> seconds.
>
> So it looks like if we don't disable idle window for seeky processes on
> hardware supporting command queuing, it helps in this particular case.
>
> Thanks
> Vivek
>

[-- Attachment #2: cfq.patch --]
[-- Type: application/octet-stream, Size: 24221 bytes --]

diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index fd7080e..064f4fb 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 10;
 static int cfq_slice_async = HZ / 25;
 static const int cfq_slice_async_rq = 2;
 static int cfq_slice_idle = HZ / 125;
+static int cfq_target_latency = HZ * 3/10; /* 300 ms */
+static int cfq_hist_divisor = 4;
+/*
+ * Number of times that other workloads can be scheduled before async
+ */
+static const unsigned int cfq_async_penalty = 4;
 
 /*
  * offset from end of service tree
@@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125;
 /*
  * below this threshold, we consider thinktime immediate
  */
-#define CFQ_MIN_TT		(2)
+#define CFQ_MIN_TT		(1)
 
 #define CFQ_SLICE_SCALE		(5)
 #define CFQ_HW_QUEUE_MIN	(5)
@@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock);
 struct cfq_rb_root {
 	struct rb_root rb;
 	struct rb_node *left;
+	unsigned count;
 };
-#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, }
+#define CFQ_RB_ROOT	(struct cfq_rb_root) { RB_ROOT, NULL, 0, }
 
 /*
  * Per process-grouping structure
@@ -113,6 +120,21 @@ struct cfq_queue {
 	unsigned short ioprio_class, org_ioprio_class;
 
 	pid_t pid;
+
+	struct cfq_rb_root *service_tree;
+	struct cfq_io_context *cic;
+};
+
+enum wl_prio_t {
+	IDLE_WL = -1,
+	BE_WL = 0,
+	RT_WL = 1
+};
+
+enum wl_type_t {
+	ASYNC_WL = 0,
+	SYNC_NOIDLE_WL = 1,
+	SYNC_WL = 2
 };
 
 /*
@@ -124,7 +146,13 @@ struct cfq_data {
 	/*
 	 * rr list of queues with requests and the count of them
 	 */
-	struct cfq_rb_root service_tree;
+	struct cfq_rb_root service_trees[2][3];
+	struct cfq_rb_root service_tree_idle;
+
+	enum wl_prio_t serving_prio;
+	enum wl_type_t serving_type;
+	unsigned long workload_expires;
+	unsigned int async_starved;
 
 	/*
 	 * Each priority tree is sorted by next_request position.  These
@@ -134,14 +162,11 @@ struct cfq_data {
 	struct rb_root prio_trees[CFQ_PRIO_LISTS];
 
 	unsigned int busy_queues;
-	/*
-	 * Used to track any pending rt requests so we can pre-empt current
-	 * non-RT cfqq in service when this value is non-zero.
-	 */
-	unsigned int busy_rt_queues;
+	unsigned int busy_queues_avg[2];
 
-	int rq_in_driver;
+	int rq_in_driver[2];
 	int sync_flight;
+	int reads_delayed;
 
 	/*
 	 * queue-depth detection
@@ -178,6 +203,9 @@ struct cfq_data {
 	unsigned int cfq_slice[2];
 	unsigned int cfq_slice_async_rq;
 	unsigned int cfq_slice_idle;
+	unsigned int cfq_target_latency;
+	unsigned int cfq_hist_divisor;
+	unsigned int cfq_async_penalty;
 
 	struct list_head cic_list;
 
@@ -187,11 +215,15 @@ struct cfq_data {
 	struct cfq_queue oom_cfqq;
 };
 
+static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type,
+							  struct cfq_data *cfqd) {
+	return prio == IDLE_WL ? &cfqd->service_tree_idle :  &cfqd->service_trees[prio][type];
+}
+
 enum cfqq_state_flags {
 	CFQ_CFQQ_FLAG_on_rr = 0,	/* on round-robin busy list */
 	CFQ_CFQQ_FLAG_wait_request,	/* waiting for a request */
 	CFQ_CFQQ_FLAG_must_dispatch,	/* must be allowed a dispatch */
-	CFQ_CFQQ_FLAG_must_alloc,	/* must be allowed rq alloc */
 	CFQ_CFQQ_FLAG_must_alloc_slice,	/* per-slice must_alloc flag */
 	CFQ_CFQQ_FLAG_fifo_expire,	/* FIFO checked in this slice */
 	CFQ_CFQQ_FLAG_idle_window,	/* slice idling enabled */
@@ -218,7 +250,6 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq)		\
 CFQ_CFQQ_FNS(on_rr);
 CFQ_CFQQ_FNS(wait_request);
 CFQ_CFQQ_FNS(must_dispatch);
-CFQ_CFQQ_FNS(must_alloc);
 CFQ_CFQQ_FNS(must_alloc_slice);
 CFQ_CFQQ_FNS(fifo_expire);
 CFQ_CFQQ_FNS(idle_window);
@@ -233,12 +264,28 @@ CFQ_CFQQ_FNS(coop);
 #define cfq_log(cfqd, fmt, args...)	\
 	blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args)
 
+#define CIC_SEEK_THR	1024
+#define CIC_SEEKY(cic)	((cic)->seek_mean > CIC_SEEK_THR)
+#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic))
+
+static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) {
+	return wl==IDLE_WL? cfqd->service_tree_idle.count :
+		cfqd->service_trees[wl][ASYNC_WL].count
+		+ cfqd->service_trees[wl][SYNC_NOIDLE_WL].count
+		+ cfqd->service_trees[wl][SYNC_WL].count;
+}
+
 static void cfq_dispatch_insert(struct request_queue *, struct request *);
 static struct cfq_queue *cfq_get_queue(struct cfq_data *, int,
 				       struct io_context *, gfp_t);
 static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *,
 						struct io_context *);
 
+static inline int rq_in_driver(struct cfq_data *cfqd)
+{
+	return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1];
+}
+
 static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic,
 					    int is_sync)
 {
@@ -249,6 +296,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
 				struct cfq_queue *cfqq, int is_sync)
 {
 	cic->cfqq[!!is_sync] = cfqq;
+	cfqq->cic = cic;
 }
 
 /*
@@ -257,7 +305,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic,
  */
 static inline int cfq_bio_sync(struct bio *bio)
 {
-	if (bio_data_dir(bio) == READ || bio_sync(bio))
+	if (bio_data_dir(bio) == READ || bio_rw_flagged(bio, BIO_RW_SYNCIO))
 		return 1;
 
 	return 0;
@@ -303,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio);
 }
 
+static inline unsigned
+cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) {
+	unsigned min_q, max_q;
+	unsigned mult  = cfqd->cfq_hist_divisor - 1;
+	unsigned round = cfqd->cfq_hist_divisor / 2;
+	unsigned busy  = cfq_busy_queues_wl(rt, cfqd);
+	min_q = min(cfqd->busy_queues_avg[rt], busy);
+	max_q = max(cfqd->busy_queues_avg[rt], busy);
+	cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) /
+		cfqd->cfq_hist_divisor;
+	return cfqd->busy_queues_avg[rt];
+}
+
 static inline void
 cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
-	cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies;
+	unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1];
+	unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq));
+	unsigned slice = cfq_prio_to_slice(cfqd, cfqq);
+
+	if (iq > process_thr) {
+		unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle
+			/ cfqd->cfq_slice[1];
+		slice = max(slice * process_thr / iq, min(slice, low_slice));
+	}
+
+	cfqq->slice_end = jiffies + slice;
 	cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies);
 }
 
@@ -445,6 +516,7 @@ static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root)
 	if (root->left == n)
 		root->left = NULL;
 	rb_erase_init(n, &root->rb);
+	--root->count;
 }
 
 /*
@@ -485,46 +557,56 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd,
 }
 
 /*
- * The cfqd->service_tree holds all pending cfq_queue's that have
+ * The cfqd->service_trees holds all pending cfq_queue's that have
  * requests waiting to be processed. It is sorted in the order that
  * we will service the queues.
  */
-static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
-				 int add_front)
+static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 {
 	struct rb_node **p, *parent;
 	struct cfq_queue *__cfqq;
 	unsigned long rb_key;
+	struct cfq_rb_root *service_tree;
 	int left;
 
 	if (cfq_class_idle(cfqq)) {
 		rb_key = CFQ_IDLE_DELAY;
-		parent = rb_last(&cfqd->service_tree.rb);
+		service_tree = &cfqd->service_tree_idle;
+		parent = rb_last(&service_tree->rb);
 		if (parent && parent != &cfqq->rb_node) {
 			__cfqq = rb_entry(parent, struct cfq_queue, rb_node);
 			rb_key += __cfqq->rb_key;
 		} else
 			rb_key += jiffies;
-	} else if (!add_front) {
+	} else {
+		enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL;
+		enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL;
+
 		rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies;
 		rb_key += cfqq->slice_resid;
 		cfqq->slice_resid = 0;
-	} else
-		rb_key = 0;
+
+		if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq)))
+			type = SYNC_NOIDLE_WL;
+
+		service_tree = service_tree_for(prio, type, cfqd);
+	}
 
 	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
 		/*
 		 * same position, nothing more to do
 		 */
-		if (rb_key == cfqq->rb_key)
+		if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree)
 			return;
 
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+		cfqq->service_tree = NULL;
 	}
 
 	left = 1;
 	parent = NULL;
-	p = &cfqd->service_tree.rb.rb_node;
+	cfqq->service_tree = service_tree;
+	p = &service_tree->rb.rb_node;
 	while (*p) {
 		struct rb_node **n;
 
@@ -556,11 +638,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq,
 	}
 
 	if (left)
-		cfqd->service_tree.left = &cfqq->rb_node;
+		service_tree->left = &cfqq->rb_node;
 
 	cfqq->rb_key = rb_key;
 	rb_link_node(&cfqq->rb_node, parent, p);
-	rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb);
+	rb_insert_color(&cfqq->rb_node, &service_tree->rb);
+	service_tree->count++;
 }
 
 static struct cfq_queue *
@@ -633,7 +716,7 @@ static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	 * Resorting requires the cfqq to be on the RR list already.
 	 */
 	if (cfq_cfqq_on_rr(cfqq)) {
-		cfq_service_tree_add(cfqd, cfqq, 0);
+		cfq_service_tree_add(cfqd, cfqq);
 		cfq_prio_tree_add(cfqd, cfqq);
 	}
 }
@@ -648,8 +731,6 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	BUG_ON(cfq_cfqq_on_rr(cfqq));
 	cfq_mark_cfqq_on_rr(cfqq);
 	cfqd->busy_queues++;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues++;
 
 	cfq_resort_rr_list(cfqd, cfqq);
 }
@@ -664,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	BUG_ON(!cfq_cfqq_on_rr(cfqq));
 	cfq_clear_cfqq_on_rr(cfqq);
 
-	if (!RB_EMPTY_NODE(&cfqq->rb_node))
-		cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree);
+	if (!RB_EMPTY_NODE(&cfqq->rb_node)) {
+		cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree);
+		cfqq->service_tree = NULL;
+	}
 	if (cfqq->p_root) {
 		rb_erase(&cfqq->p_node, cfqq->p_root);
 		cfqq->p_root = NULL;
@@ -673,8 +756,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 
 	BUG_ON(!cfqd->busy_queues);
 	cfqd->busy_queues--;
-	if (cfq_class_rt(cfqq))
-		cfqd->busy_rt_queues--;
 }
 
 /*
@@ -760,9 +841,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
 
-	cfqd->rq_in_driver++;
+	cfqd->rq_in_driver[rq_is_sync(rq)]++;
 	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d",
-						cfqd->rq_in_driver);
+						rq_in_driver(cfqd));
 
 	cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq);
 }
@@ -770,11 +851,12 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq)
 static void cfq_deactivate_request(struct request_queue *q, struct request *rq)
 {
 	struct cfq_data *cfqd = q->elevator->elevator_data;
+	const int sync = rq_is_sync(rq);
 
-	WARN_ON(!cfqd->rq_in_driver);
-	cfqd->rq_in_driver--;
+	WARN_ON(!cfqd->rq_in_driver[sync]);
+	cfqd->rq_in_driver[sync]--;
 	cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d",
-						cfqd->rq_in_driver);
+						rq_in_driver(cfqd));
 }
 
 static void cfq_remove_request(struct request *rq)
@@ -928,10 +1010,11 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out)
  */
 static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd)
 {
-	if (RB_EMPTY_ROOT(&cfqd->service_tree.rb))
-		return NULL;
+	struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd);
 
-	return cfq_rb_first(&cfqd->service_tree);
+	if (RB_EMPTY_ROOT(&service_tree->rb))
+		return NULL;
+	return cfq_rb_first(service_tree);
 }
 
 /*
@@ -959,9 +1042,6 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd,
 		return cfqd->last_position - blk_rq_pos(rq);
 }
 
-#define CIC_SEEK_THR	8 * 1024
-#define CIC_SEEKY(cic)	((cic)->seek_mean > CIC_SEEK_THR)
-
 static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq)
 {
 	struct cfq_io_context *cic = cfqd->active_cic;
@@ -1049,6 +1129,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd,
 	if (cfq_cfqq_coop(cfqq))
 		return NULL;
 
+	/* we don't want to mix processes with different characteristics */
+	if (cfqq->service_tree != cur_cfqq->service_tree)
+		return NULL;
+
 	if (!probe)
 		cfq_mark_cfqq_coop(cfqq);
 	return cfqq;
@@ -1080,7 +1164,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 	/*
 	 * still requests with the driver, don't idle
 	 */
-	if (cfqd->rq_in_driver)
+	if (rq_in_driver(cfqd))
 		return;
 
 	/*
@@ -1092,14 +1176,15 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd)
 
 	cfq_mark_cfqq_wait_request(cfqq);
 
-	/*
-	 * we don't want to idle for seeks, but we do want to allow
-	 * fair distribution of slice time for a process doing back-to-back
-	 * seeks. so allow a little bit of time for him to submit a new rq
-	 */
-	sl = cfqd->cfq_slice_idle;
-	if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic))
+	sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies);
+
+	/* very small idle if we are serving noidle trees, and there are more trees */
+	if (cfqd->serving_type == SYNC_NOIDLE_WL &&
+	    service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) {
+		if (blk_queue_nonrot(cfqd->queue))
+			return;
 		sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT));
+	}
 
 	mod_timer(&cfqd->idle_slice_timer, jiffies + sl);
 	cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl);
@@ -1115,6 +1200,12 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq)
 
 	cfq_log_cfqq(cfqd, cfqq, "dispatch_insert");
 
+	if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) {
+		cfqd->reads_delayed = max_t(int, cfqd->reads_delayed,
+					    (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2));
+	}
+
+	cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq);
 	cfq_remove_request(rq);
 	cfqq->dispatched++;
 	elv_dispatch_sort(q, rq);
@@ -1160,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio));
 }
 
+enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) {
+	struct cfq_queue *id, *ni;
+	ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd));
+	id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd));
+	if (id && ni && id->rb_key < ni->rb_key)
+		return SYNC_WL;
+	if (!ni) return SYNC_WL;
+	return SYNC_NOIDLE_WL;
+}
+
 /*
  * Select a queue for service. If we have a current active queue,
  * check whether to continue servicing it, or retrieve and set a new one.
@@ -1179,20 +1280,6 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 		goto expire;
 
 	/*
-	 * If we have a RT cfqq waiting, then we pre-empt the current non-rt
-	 * cfqq.
-	 */
-	if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) {
-		/*
-		 * We simulate this as cfqq timed out so that it gets to bank
-		 * the remaining of its time slice.
-		 */
-		cfq_log_cfqq(cfqd, cfqq, "preempt");
-		cfq_slice_expired(cfqd, 1);
-		goto new_queue;
-	}
-
-	/*
 	 * The active queue has requests and isn't expired, allow it to
 	 * dispatch.
 	 */
@@ -1214,15 +1301,68 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd)
 	 * flight or is idling for a new request, allow either of these
 	 * conditions to happen (or time out) before selecting a new queue.
 	 */
-	if (timer_pending(&cfqd->idle_slice_timer) ||
+	if (timer_pending(&cfqd->idle_slice_timer) || 
 	    (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) {
 		cfqq = NULL;
 		goto keep_queue;
 	}
-
 expire:
 	cfq_slice_expired(cfqd, 0);
 new_queue:
+	if (!new_cfqq) {
+		enum wl_prio_t previous_prio = cfqd->serving_prio;
+
+		if (cfq_busy_queues_wl(RT_WL, cfqd))
+			cfqd->serving_prio = RT_WL;
+		else if (cfq_busy_queues_wl(BE_WL, cfqd))
+			cfqd->serving_prio = BE_WL;
+		else {
+			cfqd->serving_prio = IDLE_WL;
+			cfqd->workload_expires = jiffies + 1;
+			cfqd->reads_delayed = 0;
+		}
+
+		if (cfqd->serving_prio != IDLE_WL) {
+			int counts[]={
+				service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count,
+				service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count,
+				service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count
+			};
+			int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2];
+
+			if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) {
+				cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL;
+				cfqd->async_starved = 0;
+				cfqd->reads_delayed = 0;
+			} else {
+				if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) {
+					if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] &&
+					    cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed))
+						cfqd->serving_type = ASYNC_WL;
+					else 
+						cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio);
+				} else
+					goto same_wl;
+			}
+
+			{
+				unsigned slice = cfqd->cfq_target_latency;
+				slice = slice * counts[cfqd->serving_type] /
+					max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio],
+					      counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]);
+					    
+				if (cfqd->serving_type == ASYNC_WL)
+					slice = max(1U, (slice / (1 + cfqd->reads_delayed))
+						    * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]);
+				else
+					slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle));
+
+				cfqd->workload_expires = jiffies + slice;
+				cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL);
+			}
+		}
+	}
+ same_wl:
 	cfqq = cfq_set_active_queue(cfqd, new_cfqq);
 keep_queue:
 	return cfqq;
@@ -1249,8 +1389,13 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd)
 {
 	struct cfq_queue *cfqq;
 	int dispatched = 0;
+	int i,j;
+	for (i = 0; i < 2; ++i)
+		for (j = 0; j < 3; ++j)
+			while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL)
+				dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
-	while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL)
+	while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL)
 		dispatched += __cfq_forced_dispatch_cfqq(cfqq);
 
 	cfq_slice_expired(cfqd, 0);
@@ -1312,6 +1457,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		return 0;
 
 	/*
+	 * Drain async requests before we start sync IO
+	 */
+	if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC])
+		return 0;
+
+	/*
 	 * If this is an async queue and we have sync IO in flight, let it wait
 	 */
 	if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq))
@@ -1362,7 +1513,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force)
 		cfq_slice_expired(cfqd, 0);
 	}
 
-	cfq_log(cfqd, "dispatched a request");
+	cfq_log_cfqq(cfqd, cfqq, "dispatched a request");
 	return 1;
 }
 
@@ -2004,18 +2155,8 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq,
 	if (cfq_class_idle(cfqq))
 		return 1;
 
-	/*
-	 * if the new request is sync, but the currently running queue is
-	 * not, let the sync request have priority.
-	 */
-	if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq))
-		return 1;
-
-	/*
-	 * So both queues are sync. Let the new request get disk time if
-	 * it's a metadata request and the current queue is doing regular IO.
-	 */
-	if (rq_is_meta(rq) && !cfqq->meta_pending)
+	if (cfqd->serving_type == SYNC_NOIDLE_WL
+	    && new_cfqq->service_tree == cfqq->service_tree)
 		return 1;
 
 	/*
@@ -2046,13 +2187,9 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq)
 	cfq_log_cfqq(cfqd, cfqq, "preempt");
 	cfq_slice_expired(cfqd, 1);
 
-	/*
-	 * Put the new queue at the front of the of the current list,
-	 * so we know that it will be selected next.
-	 */
 	BUG_ON(!cfq_cfqq_on_rr(cfqq));
 
-	cfq_service_tree_add(cfqd, cfqq, 1);
+	cfq_service_tree_add(cfqd, cfqq);
 
 	cfqq->slice_end = 0;
 	cfq_mark_cfqq_slice_new(cfqq);
@@ -2130,11 +2267,11 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq)
  */
 static void cfq_update_hw_tag(struct cfq_data *cfqd)
 {
-	if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak)
-		cfqd->rq_in_driver_peak = cfqd->rq_in_driver;
+	if (rq_in_driver(cfqd) > cfqd->rq_in_driver_peak)
+		cfqd->rq_in_driver_peak = rq_in_driver(cfqd);
 
 	if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN &&
-	    cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN)
+	    rq_in_driver(cfqd) <= CFQ_HW_QUEUE_MIN)
 		return;
 
 	if (cfqd->hw_tag_samples++ < 50)
@@ -2161,9 +2298,9 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 
 	cfq_update_hw_tag(cfqd);
 
-	WARN_ON(!cfqd->rq_in_driver);
+	WARN_ON(!cfqd->rq_in_driver[sync]);
 	WARN_ON(!cfqq->dispatched);
-	cfqd->rq_in_driver--;
+	cfqd->rq_in_driver[sync]--;
 	cfqq->dispatched--;
 
 	if (cfq_cfqq_sync(cfqq))
@@ -2197,7 +2334,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq)
 			cfq_arm_slice_timer(cfqd);
 	}
 
-	if (!cfqd->rq_in_driver)
+	if (!rq_in_driver(cfqd))
 		cfq_schedule_dispatch(cfqd);
 }
 
@@ -2229,8 +2366,7 @@ static void cfq_prio_boost(struct cfq_queue *cfqq)
 
 static inline int __cfq_may_queue(struct cfq_queue *cfqq)
 {
-	if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) &&
-	    !cfq_cfqq_must_alloc_slice(cfqq)) {
+	if (cfq_cfqq_wait_request(cfqq) && !cfq_cfqq_must_alloc_slice(cfqq)) {
 		cfq_mark_cfqq_must_alloc_slice(cfqq);
 		return ELV_MQUEUE_MUST;
 	}
@@ -2317,7 +2453,6 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask)
 	}
 
 	cfqq->allocated[rw]++;
-	cfq_clear_cfqq_must_alloc(cfqq);
 	atomic_inc(&cfqq->ref);
 
 	spin_unlock_irqrestore(q->queue_lock, flags);
@@ -2451,13 +2586,16 @@ static void cfq_exit_queue(struct elevator_queue *e)
 static void *cfq_init_queue(struct request_queue *q)
 {
 	struct cfq_data *cfqd;
-	int i;
+	int i,j;
 
 	cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node);
 	if (!cfqd)
 		return NULL;
 
-	cfqd->service_tree = CFQ_RB_ROOT;
+	for (i = 0; i < 2; ++i)
+		for (j = 0; j < 3; ++j)
+			cfqd->service_trees[i][j] = CFQ_RB_ROOT;
+	cfqd->service_tree_idle = CFQ_RB_ROOT;
 
 	/*
 	 * Not strictly needed (since RB_ROOT just clears the node and we
@@ -2494,6 +2632,9 @@ static void *cfq_init_queue(struct request_queue *q)
 	cfqd->cfq_slice[1] = cfq_slice_sync;
 	cfqd->cfq_slice_async_rq = cfq_slice_async_rq;
 	cfqd->cfq_slice_idle = cfq_slice_idle;
+	cfqd->cfq_target_latency = cfq_target_latency;
+	cfqd->cfq_hist_divisor = cfq_hist_divisor;
+	cfqd->cfq_async_penalty = cfq_async_penalty;
 	cfqd->hw_tag = 1;
 
 	return cfqd;
@@ -2530,6 +2671,7 @@ fail:
 /*
  * sysfs parts below -->
  */
+
 static ssize_t
 cfq_var_show(unsigned int var, char *page)
 {
@@ -2563,6 +2705,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1);
 SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1);
 SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1);
 SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0);
+SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1);
+SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0);
+SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0);
 #undef SHOW_FUNCTION
 
 #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV)			\
@@ -2594,6 +2739,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1);
 STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1,
 		UINT_MAX, 0);
+
+STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1);
+STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0);
+STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0);
+
 #undef STORE_FUNCTION
 
 #define CFQ_ATTR(name) \
@@ -2609,6 +2759,9 @@ static struct elv_fs_entry cfq_attrs[] = {
 	CFQ_ATTR(slice_async),
 	CFQ_ATTR(slice_async_rq),
 	CFQ_ATTR(slice_idle),
+	CFQ_ATTR(target_latency),
+	CFQ_ATTR(hist_divisor),
+	CFQ_ATTR(async_penalty),
 	__ATTR_NULL
 };
 

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-09-28 14:56         ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 14:56 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> >> Vivek Goyal wrote:
> >> > Notes:
> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >> >   Bring down its throughput and bump up latencies significantly.
> >>
> >>
> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> >> too.
> >>
> >> I'm basing this assumption on the observations I made on both OpenSuse
> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> >> titled: "Poor desktop responsiveness with background I/O-operations" of
> >> 2009-09-20.
> >> (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org)
> >>
> >>
> >> Thus, I'm posting this to show that your work is greatly appreciated,
> >> given the rather disappointig status quo of Linux's fairness when it
> >> comes to disk IO time.
> >>
> >> I hope that your efforts lead to a change in performance of current
> >> userland applications, the sooner, the better.
> >>
> > [Please don't remove people from original CC list. I am putting them back.]
> >
> > Hi Ulrich,
> >
> > I quicky went through that mail thread and I tried following on my
> > desktop.
> >
> > ##########################################
> > dd if=/home/vgoyal/4G-file of=/dev/null &
> > sleep 5
> > time firefox
> > # close firefox once gui pops up.
> > ##########################################
> >
> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
> > following.
> >
> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> >
> > (Results do vary across runs, especially if system is booted fresh. Don't
> >  know why...).
> >
> >
> > Then I tried putting both the applications in separate groups and assign
> > them weights 200 each.
> >
> > ##########################################
> > dd if=/home/vgoyal/4G-file of=/dev/null &
> > echo $! > /cgroup/io/test1/tasks
> > sleep 5
> > echo $$ > /cgroup/io/test2/tasks
> > time firefox
> > # close firefox once gui pops up.
> > ##########################################
> >
> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> >
> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> >
> > Notice that throughput of dd also improved.
> >
> > I ran the block trace and noticed in many a cases firefox threads
> > immediately preempted the "dd". Probably because it was a file system
> > request. So in this case latency will arise from seek time.
> >
> > In some other cases, threads had to wait for up to 100ms because dd was
> > not preempted. In this case latency will arise both from waiting on queue
> > as well as seek time.
> 
> I think cfq should already be doing something similar, i.e. giving
> 100ms slices to firefox, that alternate with dd, unless:
> * firefox is too seeky (in this case, the idle window will be too small)
> * firefox has too much think time.
> 

Hi Corrado,

"firefox" is the shell script to setup the environment and launch the
broser. It seems to be a group of threads. Some of them run in parallel
and some of these seems to be running one after the other (once previous
process or threads finished).


> To rule out the first case, what happens if you run the test with your
> "fairness for seeky processes" patch?

I applied that patch and it helps a lot.

http://lwn.net/Articles/341032/

With above patchset applied, and fairness=1, firefox pops up in 27-28
seconds.

So it looks like if we don't disable idle window for seeky processes on 
hardware supporting command queuing, it helps in this particular case.

Thanks
Vivek


 
> To rule out the second case, what happens if you increase the slice_idle?
> 
> Thanks,
> Corrado
> 
> >
> > With cgroup thing, We will run 100ms slice for the group in which firefox
> > is being launched and then give 100ms uninterrupted time slice to dd. So
> > it should cut down on number of seeks happening and that's why we probably
> > see this improvement.
> >
> > So grouping can help in such cases. May be you can move your X session in
> > one group and launch the big IO in other group. Most likely you should
> > have better desktop experience without compromising on dd thread output.
> 
> > Thanks
> > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-27 17:00     ` Corrado Zoccolo
@ 2009-09-28 14:56         ` Vivek Goyal
       [not found]       ` <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 14:56 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel, jens.axboe

On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> >> Vivek Goyal wrote:
> >> > Notes:
> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >> >   Bring down its throughput and bump up latencies significantly.
> >>
> >>
> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> >> too.
> >>
> >> I'm basing this assumption on the observations I made on both OpenSuse
> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> >> titled: "Poor desktop responsiveness with background I/O-operations" of
> >> 2009-09-20.
> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
> >>
> >>
> >> Thus, I'm posting this to show that your work is greatly appreciated,
> >> given the rather disappointig status quo of Linux's fairness when it
> >> comes to disk IO time.
> >>
> >> I hope that your efforts lead to a change in performance of current
> >> userland applications, the sooner, the better.
> >>
> > [Please don't remove people from original CC list. I am putting them back.]
> >
> > Hi Ulrich,
> >
> > I quicky went through that mail thread and I tried following on my
> > desktop.
> >
> > ##########################################
> > dd if=/home/vgoyal/4G-file of=/dev/null &
> > sleep 5
> > time firefox
> > # close firefox once gui pops up.
> > ##########################################
> >
> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
> > following.
> >
> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> >
> > (Results do vary across runs, especially if system is booted fresh. Don't
> >  know why...).
> >
> >
> > Then I tried putting both the applications in separate groups and assign
> > them weights 200 each.
> >
> > ##########################################
> > dd if=/home/vgoyal/4G-file of=/dev/null &
> > echo $! > /cgroup/io/test1/tasks
> > sleep 5
> > echo $$ > /cgroup/io/test2/tasks
> > time firefox
> > # close firefox once gui pops up.
> > ##########################################
> >
> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> >
> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> >
> > Notice that throughput of dd also improved.
> >
> > I ran the block trace and noticed in many a cases firefox threads
> > immediately preempted the "dd". Probably because it was a file system
> > request. So in this case latency will arise from seek time.
> >
> > In some other cases, threads had to wait for up to 100ms because dd was
> > not preempted. In this case latency will arise both from waiting on queue
> > as well as seek time.
> 
> I think cfq should already be doing something similar, i.e. giving
> 100ms slices to firefox, that alternate with dd, unless:
> * firefox is too seeky (in this case, the idle window will be too small)
> * firefox has too much think time.
> 

Hi Corrado,

"firefox" is the shell script to setup the environment and launch the
broser. It seems to be a group of threads. Some of them run in parallel
and some of these seems to be running one after the other (once previous
process or threads finished).


> To rule out the first case, what happens if you run the test with your
> "fairness for seeky processes" patch?

I applied that patch and it helps a lot.

http://lwn.net/Articles/341032/

With above patchset applied, and fairness=1, firefox pops up in 27-28
seconds.

So it looks like if we don't disable idle window for seeky processes on 
hardware supporting command queuing, it helps in this particular case.

Thanks
Vivek


 
> To rule out the second case, what happens if you increase the slice_idle?
> 
> Thanks,
> Corrado
> 
> >
> > With cgroup thing, We will run 100ms slice for the group in which firefox
> > is being launched and then give 100ms uninterrupted time slice to dd. So
> > it should cut down on number of seeks happening and that's why we probably
> > see this improvement.
> >
> > So grouping can help in such cases. May be you can move your X session in
> > one group and launch the big IO in other group. Most likely you should
> > have better desktop experience without compromising on dd thread output.
> 
> > Thanks
> > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-28 14:56         ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-28 14:56 UTC (permalink / raw)
  To: Corrado Zoccolo
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers,
	linux-kernel, akpm, righi.andrea, torvalds

On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote:
> Hi Vivek,
> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> >> Vivek Goyal wrote:
> >> > Notes:
> >> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >> >   Bring down its throughput and bump up latencies significantly.
> >>
> >>
> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> >> too.
> >>
> >> I'm basing this assumption on the observations I made on both OpenSuse
> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> >> titled: "Poor desktop responsiveness with background I/O-operations" of
> >> 2009-09-20.
> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
> >>
> >>
> >> Thus, I'm posting this to show that your work is greatly appreciated,
> >> given the rather disappointig status quo of Linux's fairness when it
> >> comes to disk IO time.
> >>
> >> I hope that your efforts lead to a change in performance of current
> >> userland applications, the sooner, the better.
> >>
> > [Please don't remove people from original CC list. I am putting them back.]
> >
> > Hi Ulrich,
> >
> > I quicky went through that mail thread and I tried following on my
> > desktop.
> >
> > ##########################################
> > dd if=/home/vgoyal/4G-file of=/dev/null &
> > sleep 5
> > time firefox
> > # close firefox once gui pops up.
> > ##########################################
> >
> > It was taking close to 1 minute 30 seconds to launch firefox and dd got
> > following.
> >
> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> >
> > (Results do vary across runs, especially if system is booted fresh. Don't
> >  know why...).
> >
> >
> > Then I tried putting both the applications in separate groups and assign
> > them weights 200 each.
> >
> > ##########################################
> > dd if=/home/vgoyal/4G-file of=/dev/null &
> > echo $! > /cgroup/io/test1/tasks
> > sleep 5
> > echo $$ > /cgroup/io/test2/tasks
> > time firefox
> > # close firefox once gui pops up.
> > ##########################################
> >
> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> >
> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> >
> > Notice that throughput of dd also improved.
> >
> > I ran the block trace and noticed in many a cases firefox threads
> > immediately preempted the "dd". Probably because it was a file system
> > request. So in this case latency will arise from seek time.
> >
> > In some other cases, threads had to wait for up to 100ms because dd was
> > not preempted. In this case latency will arise both from waiting on queue
> > as well as seek time.
> 
> I think cfq should already be doing something similar, i.e. giving
> 100ms slices to firefox, that alternate with dd, unless:
> * firefox is too seeky (in this case, the idle window will be too small)
> * firefox has too much think time.
> 

Hi Corrado,

"firefox" is the shell script to setup the environment and launch the
broser. It seems to be a group of threads. Some of them run in parallel
and some of these seems to be running one after the other (once previous
process or threads finished).


> To rule out the first case, what happens if you run the test with your
> "fairness for seeky processes" patch?

I applied that patch and it helps a lot.

http://lwn.net/Articles/341032/

With above patchset applied, and fairness=1, firefox pops up in 27-28
seconds.

So it looks like if we don't disable idle window for seeky processes on 
hardware supporting command queuing, it helps in this particular case.

Thanks
Vivek


 
> To rule out the second case, what happens if you increase the slice_idle?
> 
> Thanks,
> Corrado
> 
> >
> > With cgroup thing, We will run 100ms slice for the group in which firefox
> > is being launched and then give 100ms uninterrupted time slice to dd. So
> > it should cut down on number of seeks happening and that's why we probably
> > see this improvement.
> >
> > So grouping can help in such cases. May be you can move your X session in
> > one group and launch the big IO in other group. Most likely you should
> > have better desktop experience without compromising on dd thread output.
> 
> > Thanks
> > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
> 
> 
> 
> -- 
> __________________________________________________________________________
> 
> dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
> PhD - Department of Computer Science - University of Pisa, Italy
> --------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]         ` <4ABCDBFF.1020203-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-28  7:38           ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-28  7:38 UTC (permalink / raw)
  To: riel-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Rik,

Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Ryo Tsuruta wrote:
> 
> > Because dm-ioband provides faireness in terms of how many IO requests
> > are issued or how many bytes are transferred, so this behaviour is to
> > be expected. Do you think fairness in terms of IO requests and size is
> > not fair?
> 
> When there are two workloads competing for the same
> resources, I would expect each of the workloads to
> run at about 50% of the speed at which it would run
> on an uncontended system.
> 
> Having one of the workloads run at 95% of the
> uncontended speed and the other workload at 5%
> is "not fair" (to put it diplomatically).

As I wrote in the mail to Vivek, I think that providing multiple
policies, on a per disk time basis, on a per iosize basis, maximum
rate limiting or etc would be good for users.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25 15:04         ` Rik van Riel
@ 2009-09-28  7:38           ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-28  7:38 UTC (permalink / raw)
  To: riel
  Cc: vgoyal, akpm, linux-kernel, jens.axboe, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo

Hi Rik,

Rik van Riel <riel@redhat.com> wrote:
> Ryo Tsuruta wrote:
> 
> > Because dm-ioband provides faireness in terms of how many IO requests
> > are issued or how many bytes are transferred, so this behaviour is to
> > be expected. Do you think fairness in terms of IO requests and size is
> > not fair?
> 
> When there are two workloads competing for the same
> resources, I would expect each of the workloads to
> run at about 50% of the speed at which it would run
> on an uncontended system.
> 
> Having one of the workloads run at 95% of the
> uncontended speed and the other workload at 5%
> is "not fair" (to put it diplomatically).

As I wrote in the mail to Vivek, I think that providing multiple
policies, on a per disk time basis, on a per iosize basis, maximum
rate limiting or etc would be good for users.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-28  7:38           ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-28  7:38 UTC (permalink / raw)
  To: riel
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, vgoyal, m-ikeda, lizf, fchecconi, akpm,
	containers, linux-kernel, s-uchida, righi.andrea, torvalds

Hi Rik,

Rik van Riel <riel@redhat.com> wrote:
> Ryo Tsuruta wrote:
> 
> > Because dm-ioband provides faireness in terms of how many IO requests
> > are issued or how many bytes are transferred, so this behaviour is to
> > be expected. Do you think fairness in terms of IO requests and size is
> > not fair?
> 
> When there are two workloads competing for the same
> resources, I would expect each of the workloads to
> run at about 50% of the speed at which it would run
> on an uncontended system.
> 
> Having one of the workloads run at 95% of the
> uncontended speed and the other workload at 5%
> is "not fair" (to put it diplomatically).

As I wrote in the mail to Vivek, I think that providing multiple
policies, on a per disk time basis, on a per iosize basis, maximum
rate limiting or etc would be good for users.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]         ` <20090925143337.GA15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-28  7:30           ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-28  7:30 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Because dm-ioband provides faireness in terms of how many IO requests
> > are issued or how many bytes are transferred, so this behaviour is to
> > be expected. Do you think fairness in terms of IO requests and size is
> > not fair?
> > 
> 
> Hi Ryo,
> 
> Fairness in terms of size of IO or number of requests is probably not the
> best thing to do on rotational media where seek latencies are significant.
> 
> It probably should work just well on media with very low seek latencies
> like SSD.
> 
> So on rotational media, either you will not provide fairness to random 
> readers because they are too slow or you will choke the sequential readers
> in other group and also bring down the overall disk throughput.
> 
> If you don't decide to choke/throttle sequential reader group for the sake
> of random reader in other group then you will not have a good control
> on random reader latencies. Because now IO scheduler sees the IO from both
> sequential reader as well as random reader and sequential readers have not
> been throttled. So the dispatch pattern/time slices will again look like..
> 
> 	SR1 SR2 SR3 SR4 SR5 RR.....
> 
> 	instead  of
> 
> 	SR1 RR SR2 RR SR3 RR SR4 RR ....
>  
> SR --> sequential reader,  RR --> random reader

Thank you for elaborating. However, I think that fairness in terms of
disk time has a similar problem. The below is a benchmark result of
randread vs seqread I posted before, rand-readers and seq-readers ran
on individual groups and their weights were equally assigned.

                   Throughput [KiB/s]
             io-controller  dm-ioband
randread         161          314
seqread         9556          631

I know that dm-ioband is needed to improvement on the seqread
throughput, but I don't think that io-controller seems quite fair,
even the disk times of each group are equal, why randread can't get
more bandwidth. So I think that this is how users think about
faireness, and it would be good thing to provide multiple policies of
bandwidth control for uses.

> > The write-starve-reads on dm-ioband, that you pointed out before, was
> > not caused by FIFO release, it was caused by IO flow control in
> > dm-ioband. When I turned off the flow control, then the read
> > throughput was quite improved.
> 
> What was flow control doing?

dm-ioband gives a limit on each IO group. When the number of IO
requests backlogged in a group exceeds the limit, processes which are
going to issue IO requests to the group are made sleep until all the
backlogged requests are flushed out.

> > Now I'm considering separating dm-ioband's internal queue into sync
> > and async and giving a certain priority of dispatch to async IOs.
> 
> Even if you maintain separate queues for sync and async, in what ratio will
> you dispatch reads and writes to underlying layer once fresh tokens become
> available to the group and you decide to unthrottle the group.

Now I'm thinking that It's according to the requested order, but
when the number of in-flight sync IOs exceeds io_limit (io_limit is
calculated based on nr_requests of underlying block device), dm-ioband
dispatches only async IOs until the number of in-flight sync IOs are
below the io_limit, and vice versa. At least it could solve the
write-starve-read issue which you pointed out.
 
> Whatever policy you adopt for read and write dispatch, it might not match
> with policy of underlying IO scheduler because every IO scheduler seems to
> have its own way of determining how reads and writes should be dispatched.

I think that this is a matter of users choise, which a user would
like to give priority to bandwidth or IO scheduler's policy.

> Now somebody might start complaining that my job inside the group is not
> getting same reader/writer ratio as it was getting outside the group.
> 
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25 14:33         ` Vivek Goyal
  (?)
  (?)
@ 2009-09-28  7:30         ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-28  7:30 UTC (permalink / raw)
  To: vgoyal
  Cc: akpm, linux-kernel, jens.axboe, containers, dm-devel, nauman,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo,
	riel

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> > Because dm-ioband provides faireness in terms of how many IO requests
> > are issued or how many bytes are transferred, so this behaviour is to
> > be expected. Do you think fairness in terms of IO requests and size is
> > not fair?
> > 
> 
> Hi Ryo,
> 
> Fairness in terms of size of IO or number of requests is probably not the
> best thing to do on rotational media where seek latencies are significant.
> 
> It probably should work just well on media with very low seek latencies
> like SSD.
> 
> So on rotational media, either you will not provide fairness to random 
> readers because they are too slow or you will choke the sequential readers
> in other group and also bring down the overall disk throughput.
> 
> If you don't decide to choke/throttle sequential reader group for the sake
> of random reader in other group then you will not have a good control
> on random reader latencies. Because now IO scheduler sees the IO from both
> sequential reader as well as random reader and sequential readers have not
> been throttled. So the dispatch pattern/time slices will again look like..
> 
> 	SR1 SR2 SR3 SR4 SR5 RR.....
> 
> 	instead  of
> 
> 	SR1 RR SR2 RR SR3 RR SR4 RR ....
>  
> SR --> sequential reader,  RR --> random reader

Thank you for elaborating. However, I think that fairness in terms of
disk time has a similar problem. The below is a benchmark result of
randread vs seqread I posted before, rand-readers and seq-readers ran
on individual groups and their weights were equally assigned.

                   Throughput [KiB/s]
             io-controller  dm-ioband
randread         161          314
seqread         9556          631

I know that dm-ioband is needed to improvement on the seqread
throughput, but I don't think that io-controller seems quite fair,
even the disk times of each group are equal, why randread can't get
more bandwidth. So I think that this is how users think about
faireness, and it would be good thing to provide multiple policies of
bandwidth control for uses.

> > The write-starve-reads on dm-ioband, that you pointed out before, was
> > not caused by FIFO release, it was caused by IO flow control in
> > dm-ioband. When I turned off the flow control, then the read
> > throughput was quite improved.
> 
> What was flow control doing?

dm-ioband gives a limit on each IO group. When the number of IO
requests backlogged in a group exceeds the limit, processes which are
going to issue IO requests to the group are made sleep until all the
backlogged requests are flushed out.

> > Now I'm considering separating dm-ioband's internal queue into sync
> > and async and giving a certain priority of dispatch to async IOs.
> 
> Even if you maintain separate queues for sync and async, in what ratio will
> you dispatch reads and writes to underlying layer once fresh tokens become
> available to the group and you decide to unthrottle the group.

Now I'm thinking that It's according to the requested order, but
when the number of in-flight sync IOs exceeds io_limit (io_limit is
calculated based on nr_requests of underlying block device), dm-ioband
dispatches only async IOs until the number of in-flight sync IOs are
below the io_limit, and vice versa. At least it could solve the
write-starve-read issue which you pointed out.
 
> Whatever policy you adopt for read and write dispatch, it might not match
> with policy of underlying IO scheduler because every IO scheduler seems to
> have its own way of determining how reads and writes should be dispatched.

I think that this is a matter of users choise, which a user would
like to give priority to bandwidth or IO scheduler's policy.

> Now somebody might start complaining that my job inside the group is not
> getting same reader/writer ratio as it was getting outside the group.
> 
> Thanks
> Vivek

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]               ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-28  5:55                 ` Mike Galbraith
  2009-09-28 17:48                 ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28  5:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

P.S.

On Mon, 2009-09-28 at 06:04 +0200, Mike Galbraith wrote:

> - = virgin tip v2.6.31-10215-ga3c9602
> + = with patchlet
>                                                             Avg
> dd pre         67.4     70.9     65.4     68.9     66.2     67.7-
>                65.9     68.5     69.8     65.2     65.8     67.0-     Avg
>                70.4     70.3     65.1     66.4     70.1     68.4-     67.7-
>                73.1     64.6     65.3     65.3     64.9     66.6+     65.6+     .968
>                63.8     67.9     65.2     65.1     64.4     65.2+
>                64.9     66.3     64.1     65.2     64.8     65.0+
> perf stat      8.66    16.29     9.65    14.88     9.45     11.7-
>               15.36     9.71    15.47    10.44    12.93     12.7-
>               10.55    15.11    10.22    15.35    10.32     12.3-     12.2-
>                9.87     7.53    10.62     7.51     9.95      9.0+      9.1+     .745
>                7.73    10.12     8.19    11.87     8.07      9.1+
>               11.04     7.62    10.14     8.13    10.23      9.4+
> dd post        63.4     60.5     66.7     64.5     67.3     64.4-
>                64.4     66.8     64.3     61.5     62.0     63.8-
>                63.8     64.9     66.2     65.6     66.9     65.4-     64.5-
>                60.9     63.4     60.2     63.4     65.5     62.6+     61.8+     .958
>                63.3     59.9     61.9     62.7     61.2     61.8+
>                60.1     63.7     59.5     61.5     60.6     61.0+

Deadline and noop fsc^W are less than wonderful choices for this load.

perf stat     12.82     7.19     8.49     5.76      9.32   anticipatory
              16.24   175.82   154.38   228.97    147.16   noop
              43.23    57.39    96.13   148.25    180.09   deadline
              28.65   167.40   195.95   183.69    178.61   deadline v2.6.27.35

	-Mike

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-28  4:04             ` Mike Galbraith
@ 2009-09-28  5:55               ` Mike Galbraith
  2009-09-28 17:48                 ` Vivek Goyal
       [not found]               ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28  5:55 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

P.S.

On Mon, 2009-09-28 at 06:04 +0200, Mike Galbraith wrote:

> - = virgin tip v2.6.31-10215-ga3c9602
> + = with patchlet
>                                                             Avg
> dd pre         67.4     70.9     65.4     68.9     66.2     67.7-
>                65.9     68.5     69.8     65.2     65.8     67.0-     Avg
>                70.4     70.3     65.1     66.4     70.1     68.4-     67.7-
>                73.1     64.6     65.3     65.3     64.9     66.6+     65.6+     .968
>                63.8     67.9     65.2     65.1     64.4     65.2+
>                64.9     66.3     64.1     65.2     64.8     65.0+
> perf stat      8.66    16.29     9.65    14.88     9.45     11.7-
>               15.36     9.71    15.47    10.44    12.93     12.7-
>               10.55    15.11    10.22    15.35    10.32     12.3-     12.2-
>                9.87     7.53    10.62     7.51     9.95      9.0+      9.1+     .745
>                7.73    10.12     8.19    11.87     8.07      9.1+
>               11.04     7.62    10.14     8.13    10.23      9.4+
> dd post        63.4     60.5     66.7     64.5     67.3     64.4-
>                64.4     66.8     64.3     61.5     62.0     63.8-
>                63.8     64.9     66.2     65.6     66.9     65.4-     64.5-
>                60.9     63.4     60.2     63.4     65.5     62.6+     61.8+     .958
>                63.3     59.9     61.9     62.7     61.2     61.8+
>                60.1     63.7     59.5     61.5     60.6     61.0+

Deadline and noop fsc^W are less than wonderful choices for this load.

perf stat     12.82     7.19     8.49     5.76      9.32   anticipatory
              16.24   175.82   154.38   228.97    147.16   noop
              43.23    57.39    96.13   148.25    180.09   deadline
              28.65   167.40   195.95   183.69    178.61   deadline v2.6.27.35

	-Mike



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]             ` <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-28  4:04               ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28  4:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote:
> On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:

> I'll give it a shot first thing in the A.M.

> > diff --git a/block/elevator.c b/block/elevator.c
> > index 1975b61..d00a72b 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> >  	 * See if our hash lookup can find a potential backmerge.
> >  	 */
> >  	__rq = elv_rqhash_find(q, bio->bi_sector);
> > -	if (__rq && elv_rq_merge_ok(__rq, bio)) {
> > -		*req = __rq;
> > -		return ELEVATOR_BACK_MERGE;
> > +	if (__rq) {
> > +		/*
> > +		 * If requests are queued behind this one, disallow merge. This
> > +		 * prevents streaming IO from continually passing new IO.
> > +		 */
> > +		if (elv_latter_request(q, __rq))
> > +			return ELEVATOR_NO_MERGE;
> > +		if (elv_rq_merge_ok(__rq, bio)) {
> > +			*req = __rq;
> > +			return ELEVATOR_BACK_MERGE;
> > +		}
> >  	}
> >  
> >  	if (e->ops->elevator_merge_fn)

- = virgin tip v2.6.31-10215-ga3c9602
+ = with patchlet
                                                            Avg
dd pre         67.4     70.9     65.4     68.9     66.2     67.7-
               65.9     68.5     69.8     65.2     65.8     67.0-     Avg
               70.4     70.3     65.1     66.4     70.1     68.4-     67.7-
               73.1     64.6     65.3     65.3     64.9     66.6+     65.6+     .968
               63.8     67.9     65.2     65.1     64.4     65.2+
               64.9     66.3     64.1     65.2     64.8     65.0+
perf stat      8.66    16.29     9.65    14.88     9.45     11.7-
              15.36     9.71    15.47    10.44    12.93     12.7-
              10.55    15.11    10.22    15.35    10.32     12.3-     12.2-
               9.87     7.53    10.62     7.51     9.95      9.0+      9.1+     .745
               7.73    10.12     8.19    11.87     8.07      9.1+
              11.04     7.62    10.14     8.13    10.23      9.4+
dd post        63.4     60.5     66.7     64.5     67.3     64.4-
               64.4     66.8     64.3     61.5     62.0     63.8-
               63.8     64.9     66.2     65.6     66.9     65.4-     64.5-
               60.9     63.4     60.2     63.4     65.5     62.6+     61.8+     .958
               63.3     59.9     61.9     62.7     61.2     61.8+
               60.1     63.7     59.5     61.5     60.6     61.0+

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-27 18:15           ` Mike Galbraith
@ 2009-09-28  4:04             ` Mike Galbraith
  2009-09-28  5:55               ` Mike Galbraith
                                 ` (2 more replies)
       [not found]             ` <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  1 sibling, 3 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-28  4:04 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote:
> On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:

> I'll give it a shot first thing in the A.M.

> > diff --git a/block/elevator.c b/block/elevator.c
> > index 1975b61..d00a72b 100644
> > --- a/block/elevator.c
> > +++ b/block/elevator.c
> > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
> >  	 * See if our hash lookup can find a potential backmerge.
> >  	 */
> >  	__rq = elv_rqhash_find(q, bio->bi_sector);
> > -	if (__rq && elv_rq_merge_ok(__rq, bio)) {
> > -		*req = __rq;
> > -		return ELEVATOR_BACK_MERGE;
> > +	if (__rq) {
> > +		/*
> > +		 * If requests are queued behind this one, disallow merge. This
> > +		 * prevents streaming IO from continually passing new IO.
> > +		 */
> > +		if (elv_latter_request(q, __rq))
> > +			return ELEVATOR_NO_MERGE;
> > +		if (elv_rq_merge_ok(__rq, bio)) {
> > +			*req = __rq;
> > +			return ELEVATOR_BACK_MERGE;
> > +		}
> >  	}
> >  
> >  	if (e->ops->elevator_merge_fn)

- = virgin tip v2.6.31-10215-ga3c9602
+ = with patchlet
                                                            Avg
dd pre         67.4     70.9     65.4     68.9     66.2     67.7-
               65.9     68.5     69.8     65.2     65.8     67.0-     Avg
               70.4     70.3     65.1     66.4     70.1     68.4-     67.7-
               73.1     64.6     65.3     65.3     64.9     66.6+     65.6+     .968
               63.8     67.9     65.2     65.1     64.4     65.2+
               64.9     66.3     64.1     65.2     64.8     65.0+
perf stat      8.66    16.29     9.65    14.88     9.45     11.7-
              15.36     9.71    15.47    10.44    12.93     12.7-
              10.55    15.11    10.22    15.35    10.32     12.3-     12.2-
               9.87     7.53    10.62     7.51     9.95      9.0+      9.1+     .745
               7.73    10.12     8.19    11.87     8.07      9.1+
              11.04     7.62    10.14     8.13    10.23      9.4+
dd post        63.4     60.5     66.7     64.5     67.3     64.4-
               64.4     66.8     64.3     61.5     62.0     63.8-
               63.8     64.9     66.2     65.6     66.9     65.4-     64.5-
               60.9     63.4     60.2     63.4     65.5     62.6+     61.8+     .958
               63.3     59.9     61.9     62.7     61.2     61.8+
               60.1     63.7     59.5     61.5     60.6     61.0+



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]           ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-09-27 18:15             ` Mike Galbraith
  2009-09-30 19:58             ` Mike Galbraith
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-27 18:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:
> On Sun, Sep 27 2009, Mike Galbraith wrote:
> > My dd vs load non-cached binary woes seem to be coming from backmerge.
> > 
> > #if 0 /*MIKEDIDIT sand in gearbox?*/
> >         /*
> >          * See if our hash lookup can find a potential backmerge.
> >          */
> >         __rq = elv_rqhash_find(q, bio->bi_sector);
> >         if (__rq && elv_rq_merge_ok(__rq, bio)) {
> >                 *req = __rq;
> >                 return ELEVATOR_BACK_MERGE;
> >         }
> > #endif
> 
> It's a given that not merging will provide better latency.

Yeah, absolutely everything I've diddled that reduces the size of queued
data improves the situation, which makes perfect sense.  This one was a
bit unexpected.  Front merges didn't hurt at all, back merges did, and
lots.  After diddling the code a bit, I had the "well _duh_" moment.

>  We can't
> disable that or performance will suffer A LOT on some systems. There are
> ways to make it better, though. One would be to make the max request
> size smaller, but that would also hurt for streamed workloads. Can you
> try whether the below patch makes a difference? It will basically
> disallow merges to a request that isn't the last one.

That's what all the looking I've done ends up at.  Either you let the
disk be all it can be, and you pay in latency, or you don't, and you pay
in throughput.

> below wont work well for two (or more) streamed cases. I'll think a bit
> about that.

Cool, think away.  I've been eyeballing and pondering how to know when
latency is going to become paramount.  Absolutely nothing is happening,
even for "it's my root".

> Note this is totally untested!

I'll give it a shot first thing in the A.M.

Note: I tested my stable of kernels today (22->), and we are better off
dd vs read today than ever in this time span at least.

(i can't recall ever seeing a system where beating snot outta root
didn't hurt really bad... would be very nice though;)

> diff --git a/block/elevator.c b/block/elevator.c
> index 1975b61..d00a72b 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
>  	 * See if our hash lookup can find a potential backmerge.
>  	 */
>  	__rq = elv_rqhash_find(q, bio->bi_sector);
> -	if (__rq && elv_rq_merge_ok(__rq, bio)) {
> -		*req = __rq;
> -		return ELEVATOR_BACK_MERGE;
> +	if (__rq) {
> +		/*
> +		 * If requests are queued behind this one, disallow merge. This
> +		 * prevents streaming IO from continually passing new IO.
> +		 */
> +		if (elv_latter_request(q, __rq))
> +			return ELEVATOR_NO_MERGE;
> +		if (elv_rq_merge_ok(__rq, bio)) {
> +			*req = __rq;
> +			return ELEVATOR_BACK_MERGE;
> +		}
>  	}
>  
>  	if (e->ops->elevator_merge_fn)
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-27 16:42         ` Jens Axboe
       [not found]           ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
@ 2009-09-27 18:15           ` Mike Galbraith
  2009-09-28  4:04             ` Mike Galbraith
       [not found]             ` <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-09-30 19:58           ` Mike Galbraith
  2 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-27 18:15 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote:
> On Sun, Sep 27 2009, Mike Galbraith wrote:
> > My dd vs load non-cached binary woes seem to be coming from backmerge.
> > 
> > #if 0 /*MIKEDIDIT sand in gearbox?*/
> >         /*
> >          * See if our hash lookup can find a potential backmerge.
> >          */
> >         __rq = elv_rqhash_find(q, bio->bi_sector);
> >         if (__rq && elv_rq_merge_ok(__rq, bio)) {
> >                 *req = __rq;
> >                 return ELEVATOR_BACK_MERGE;
> >         }
> > #endif
> 
> It's a given that not merging will provide better latency.

Yeah, absolutely everything I've diddled that reduces the size of queued
data improves the situation, which makes perfect sense.  This one was a
bit unexpected.  Front merges didn't hurt at all, back merges did, and
lots.  After diddling the code a bit, I had the "well _duh_" moment.

>  We can't
> disable that or performance will suffer A LOT on some systems. There are
> ways to make it better, though. One would be to make the max request
> size smaller, but that would also hurt for streamed workloads. Can you
> try whether the below patch makes a difference? It will basically
> disallow merges to a request that isn't the last one.

That's what all the looking I've done ends up at.  Either you let the
disk be all it can be, and you pay in latency, or you don't, and you pay
in throughput.

> below wont work well for two (or more) streamed cases. I'll think a bit
> about that.

Cool, think away.  I've been eyeballing and pondering how to know when
latency is going to become paramount.  Absolutely nothing is happening,
even for "it's my root".

> Note this is totally untested!

I'll give it a shot first thing in the A.M.

Note: I tested my stable of kernels today (22->), and we are better off
dd vs read today than ever in this time span at least.

(i can't recall ever seeing a system where beating snot outta root
didn't hurt really bad... would be very nice though;)

> diff --git a/block/elevator.c b/block/elevator.c
> index 1975b61..d00a72b 100644
> --- a/block/elevator.c
> +++ b/block/elevator.c
> @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
>  	 * See if our hash lookup can find a potential backmerge.
>  	 */
>  	__rq = elv_rqhash_find(q, bio->bi_sector);
> -	if (__rq && elv_rq_merge_ok(__rq, bio)) {
> -		*req = __rq;
> -		return ELEVATOR_BACK_MERGE;
> +	if (__rq) {
> +		/*
> +		 * If requests are queued behind this one, disallow merge. This
> +		 * prevents streaming IO from continually passing new IO.
> +		 */
> +		if (elv_latter_request(q, __rq))
> +			return ELEVATOR_NO_MERGE;
> +		if (elv_rq_merge_ok(__rq, bio)) {
> +			*req = __rq;
> +			return ELEVATOR_BACK_MERGE;
> +		}
>  	}
>  
>  	if (e->ops->elevator_merge_fn)
> 


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]     ` <20090925202636.GC15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-26 14:51       ` Mike Galbraith
@ 2009-09-27 17:00       ` Corrado Zoccolo
  1 sibling, 0 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-09-27 17:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,
On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> Vivek Goyal wrote:
>> > Notes:
>> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >   Bring down its throughput and bump up latencies significantly.
>>
>>
>> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> too.
>>
>> I'm basing this assumption on the observations I made on both OpenSuse
>> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> titled: "Poor desktop responsiveness with background I/O-operations" of
>> 2009-09-20.
>> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
>>
>>
>> Thus, I'm posting this to show that your work is greatly appreciated,
>> given the rather disappointig status quo of Linux's fairness when it
>> comes to disk IO time.
>>
>> I hope that your efforts lead to a change in performance of current
>> userland applications, the sooner, the better.
>>
> [Please don't remove people from original CC list. I am putting them back.]
>
> Hi Ulrich,
>
> I quicky went through that mail thread and I tried following on my
> desktop.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> sleep 5
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> It was taking close to 1 minute 30 seconds to launch firefox and dd got
> following.
>
> 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>
> (Results do vary across runs, especially if system is booted fresh. Don't
>  know why...).
>
>
> Then I tried putting both the applications in separate groups and assign
> them weights 200 each.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> echo $! > /cgroup/io/test1/tasks
> sleep 5
> echo $$ > /cgroup/io/test2/tasks
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>
> 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>
> Notice that throughput of dd also improved.
>
> I ran the block trace and noticed in many a cases firefox threads
> immediately preempted the "dd". Probably because it was a file system
> request. So in this case latency will arise from seek time.
>
> In some other cases, threads had to wait for up to 100ms because dd was
> not preempted. In this case latency will arise both from waiting on queue
> as well as seek time.

I think cfq should already be doing something similar, i.e. giving
100ms slices to firefox, that alternate with dd, unless:
* firefox is too seeky (in this case, the idle window will be too small)
* firefox has too much think time.

To rule out the first case, what happens if you run the test with your
"fairness for seeky processes" patch?
To rule out the second case, what happens if you increase the slice_idle?

Thanks,
Corrado

>
> With cgroup thing, We will run 100ms slice for the group in which firefox
> is being launched and then give 100ms uninterrupted time slice to dd. So
> it should cut down on number of seeks happening and that's why we probably
> see this improvement.
>
> So grouping can help in such cases. May be you can move your X session in
> one group and launch the big IO in other group. Most likely you should
> have better desktop experience without compromising on dd thread output.

> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------
_______________________________________________
Containers mailing list
Containers@lists.linux-foundation.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25 20:26     ` Vivek Goyal
  (?)
  (?)
@ 2009-09-27 17:00     ` Corrado Zoccolo
  2009-09-28 14:56         ` Vivek Goyal
       [not found]       ` <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Corrado Zoccolo @ 2009-09-27 17:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel, jens.axboe

Hi Vivek,
On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
>> Vivek Goyal wrote:
>> > Notes:
>> > - With vanilla CFQ, random writers can overwhelm a random reader.
>> >   Bring down its throughput and bump up latencies significantly.
>>
>>
>> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
>> too.
>>
>> I'm basing this assumption on the observations I made on both OpenSuse
>> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
>> titled: "Poor desktop responsiveness with background I/O-operations" of
>> 2009-09-20.
>> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
>>
>>
>> Thus, I'm posting this to show that your work is greatly appreciated,
>> given the rather disappointig status quo of Linux's fairness when it
>> comes to disk IO time.
>>
>> I hope that your efforts lead to a change in performance of current
>> userland applications, the sooner, the better.
>>
> [Please don't remove people from original CC list. I am putting them back.]
>
> Hi Ulrich,
>
> I quicky went through that mail thread and I tried following on my
> desktop.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> sleep 5
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> It was taking close to 1 minute 30 seconds to launch firefox and dd got
> following.
>
> 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
>
> (Results do vary across runs, especially if system is booted fresh. Don't
>  know why...).
>
>
> Then I tried putting both the applications in separate groups and assign
> them weights 200 each.
>
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> echo $! > /cgroup/io/test1/tasks
> sleep 5
> echo $$ > /cgroup/io/test2/tasks
> time firefox
> # close firefox once gui pops up.
> ##########################################
>
> Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
>
> 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
>
> Notice that throughput of dd also improved.
>
> I ran the block trace and noticed in many a cases firefox threads
> immediately preempted the "dd". Probably because it was a file system
> request. So in this case latency will arise from seek time.
>
> In some other cases, threads had to wait for up to 100ms because dd was
> not preempted. In this case latency will arise both from waiting on queue
> as well as seek time.

I think cfq should already be doing something similar, i.e. giving
100ms slices to firefox, that alternate with dd, unless:
* firefox is too seeky (in this case, the idle window will be too small)
* firefox has too much think time.

To rule out the first case, what happens if you run the test with your
"fairness for seeky processes" patch?
To rule out the second case, what happens if you increase the slice_idle?

Thanks,
Corrado

>
> With cgroup thing, We will run 100ms slice for the group in which firefox
> is being launched and then give 100ms uninterrupted time slice to dd. So
> it should cut down on number of seeks happening and that's why we probably
> see this improvement.
>
> So grouping can help in such cases. May be you can move your X session in
> one group and launch the big IO in other group. Most likely you should
> have better desktop experience without compromising on dd thread output.

> Thanks
> Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



-- 
__________________________________________________________________________

dott. Corrado Zoccolo                          mailto:czoccolo@gmail.com
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]         ` <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-27 16:42           ` Jens Axboe
  0 siblings, 0 replies; 349+ messages in thread
From: Jens Axboe @ 2009-09-27 16:42 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Sun, Sep 27 2009, Mike Galbraith wrote:
> My dd vs load non-cached binary woes seem to be coming from backmerge.
> 
> #if 0 /*MIKEDIDIT sand in gearbox?*/
>         /*
>          * See if our hash lookup can find a potential backmerge.
>          */
>         __rq = elv_rqhash_find(q, bio->bi_sector);
>         if (__rq && elv_rq_merge_ok(__rq, bio)) {
>                 *req = __rq;
>                 return ELEVATOR_BACK_MERGE;
>         }
> #endif

It's a given that not merging will provide better latency. We can't
disable that or performance will suffer A LOT on some systems. There are
ways to make it better, though. One would be to make the max request
size smaller, but that would also hurt for streamed workloads. Can you
try whether the below patch makes a difference? It will basically
disallow merges to a request that isn't the last one.

We should probably make the merging logic a bit more clever, since the
below wont work well for two (or more) streamed cases. I'll think a bit
about that.

Note this is totally untested!

diff --git a/block/elevator.c b/block/elevator.c
index 1975b61..d00a72b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	 * See if our hash lookup can find a potential backmerge.
 	 */
 	__rq = elv_rqhash_find(q, bio->bi_sector);
-	if (__rq && elv_rq_merge_ok(__rq, bio)) {
-		*req = __rq;
-		return ELEVATOR_BACK_MERGE;
+	if (__rq) {
+		/*
+		 * If requests are queued behind this one, disallow merge. This
+		 * prevents streaming IO from continually passing new IO.
+		 */
+		if (elv_latter_request(q, __rq))
+			return ELEVATOR_NO_MERGE;
+		if (elv_rq_merge_ok(__rq, bio)) {
+			*req = __rq;
+			return ELEVATOR_BACK_MERGE;
+		}
 	}
 
 	if (e->ops->elevator_merge_fn)

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-27  6:55       ` Mike Galbraith
       [not found]         ` <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-27 16:42         ` Jens Axboe
       [not found]           ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
                             ` (2 more replies)
  1 sibling, 3 replies; 349+ messages in thread
From: Jens Axboe @ 2009-09-27 16:42 UTC (permalink / raw)
  To: Mike Galbraith
  Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk,
	akpm, peterz, jmarchan, torvalds, mingo, riel

On Sun, Sep 27 2009, Mike Galbraith wrote:
> My dd vs load non-cached binary woes seem to be coming from backmerge.
> 
> #if 0 /*MIKEDIDIT sand in gearbox?*/
>         /*
>          * See if our hash lookup can find a potential backmerge.
>          */
>         __rq = elv_rqhash_find(q, bio->bi_sector);
>         if (__rq && elv_rq_merge_ok(__rq, bio)) {
>                 *req = __rq;
>                 return ELEVATOR_BACK_MERGE;
>         }
> #endif

It's a given that not merging will provide better latency. We can't
disable that or performance will suffer A LOT on some systems. There are
ways to make it better, though. One would be to make the max request
size smaller, but that would also hurt for streamed workloads. Can you
try whether the below patch makes a difference? It will basically
disallow merges to a request that isn't the last one.

We should probably make the merging logic a bit more clever, since the
below wont work well for two (or more) streamed cases. I'll think a bit
about that.

Note this is totally untested!

diff --git a/block/elevator.c b/block/elevator.c
index 1975b61..d00a72b 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio)
 	 * See if our hash lookup can find a potential backmerge.
 	 */
 	__rq = elv_rqhash_find(q, bio->bi_sector);
-	if (__rq && elv_rq_merge_ok(__rq, bio)) {
-		*req = __rq;
-		return ELEVATOR_BACK_MERGE;
+	if (__rq) {
+		/*
+		 * If requests are queued behind this one, disallow merge. This
+		 * prevents streaming IO from continually passing new IO.
+		 */
+		if (elv_latter_request(q, __rq))
+			return ELEVATOR_NO_MERGE;
+		if (elv_rq_merge_ok(__rq, bio)) {
+			*req = __rq;
+			return ELEVATOR_BACK_MERGE;
+		}
 	}
 
 	if (e->ops->elevator_merge_fn)

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-27  6:55         ` Mike Galbraith
  0 siblings, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-27  6:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

My dd vs load non-cached binary woes seem to be coming from backmerge.

#if 0 /*MIKEDIDIT sand in gearbox?*/
        /*
         * See if our hash lookup can find a potential backmerge.
         */
        __rq = elv_rqhash_find(q, bio->bi_sector);
        if (__rq && elv_rq_merge_ok(__rq, bio)) {
                *req = __rq;
                return ELEVATOR_BACK_MERGE;
        }
#endif

- = stock = 0
+ = /sys/block/sdb/queue/nomerges = 1
x = backmerge disabled

quantum = 1                                                  Avg
dd pre         58.4     52.5     56.1     61.6     52.3     56.1-  MB/s   virgin/foo
               59.6     54.4     53.0     56.1     58.6     56.3+           1.003
               53.8     56.6     54.7     50.7     59.3     55.0x            .980
perf stat      2.87     0.91     1.64     1.41     0.90      1.5-  Sec
               2.61     1.14     1.45     1.43     1.47      1.6+           1.066
               1.07     1.19     1.20     1.24     1.37      1.2x            .800
dd post        56.6     61.0     66.3     64.7     60.9     61.9-
               54.0     59.3     61.1     58.3     58.9     58.3+            .941
               54.3     60.2     59.6     60.6     60.3     59.0x            .953

quantum = 2
dd pre         59.7     62.4     58.9     65.3     60.3     61.3-
               49.4     51.9     58.7     49.3     52.4     52.3+            .853
               58.3     52.8     53.1     50.4     59.9     54.9x            .895
perf stat      5.81     6.09     6.24    10.13     6.21      6.8-
               2.48     2.10     3.23     2.29     2.31      2.4+            .352
               2.09     2.73     1.72     1.96     1.83      2.0x            .294
dd post        64.0     62.6     64.2     60.4     61.1     62.4-
               52.9     56.2     49.6     51.3     51.2     52.2+            .836
               54.7     60.9     56.0     54.0     55.4     56.2x            .900

quantum = 3
dd pre         65.5     57.7     54.5     51.1     56.3     57.0-
               58.1     53.9     52.2     58.2     51.8     54.8+            .961
               60.5     56.5     56.7     55.3     54.6     56.7x            .994
perf stat     14.01    13.71     8.35     5.35     8.57      9.9-
               1.84     2.30     2.14     2.10     2.45      2.1+            .212
               2.12     1.63     2.54     2.23     2.29      2.1x            .212
dd post        59.2     49.1     58.8     62.3     62.1     58.3-
               59.8     53.2     55.2     50.9     53.7     54.5+            .934
               56.1     61.9     51.9     54.3     53.1     55.4x            .950

quantun = 4
dd pre         57.2     52.1     56.8     55.2     61.6     56.5-
               48.7     55.4     51.3     49.7     54.5     51.9+            .918
               55.8     54.5     50.3     56.4     49.3     53.2x            .941
perf stat     11.98     1.61     9.63    16.21    11.13     10.1-
               2.29     1.94     2.68     2.46     2.45      2.3+            .227
               3.01     1.84     2.11     2.27     2.30      2.3x            .227
dd post        57.2     52.6     62.2     49.3     50.2     54.3-
               50.1     54.5     58.4     54.1     49.0     53.2+            .979
               52.9     53.2     50.6     53.2     50.5     52.0x            .957

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-26 14:51     ` Mike Galbraith
       [not found]       ` <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
@ 2009-09-27  6:55       ` Mike Galbraith
       [not found]         ` <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-09-27 16:42         ` Jens Axboe
  1 sibling, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-27  6:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel, jens.axboe

My dd vs load non-cached binary woes seem to be coming from backmerge.

#if 0 /*MIKEDIDIT sand in gearbox?*/
        /*
         * See if our hash lookup can find a potential backmerge.
         */
        __rq = elv_rqhash_find(q, bio->bi_sector);
        if (__rq && elv_rq_merge_ok(__rq, bio)) {
                *req = __rq;
                return ELEVATOR_BACK_MERGE;
        }
#endif

- = stock = 0
+ = /sys/block/sdb/queue/nomerges = 1
x = backmerge disabled

quantum = 1                                                  Avg
dd pre         58.4     52.5     56.1     61.6     52.3     56.1-  MB/s   virgin/foo
               59.6     54.4     53.0     56.1     58.6     56.3+           1.003
               53.8     56.6     54.7     50.7     59.3     55.0x            .980
perf stat      2.87     0.91     1.64     1.41     0.90      1.5-  Sec
               2.61     1.14     1.45     1.43     1.47      1.6+           1.066
               1.07     1.19     1.20     1.24     1.37      1.2x            .800
dd post        56.6     61.0     66.3     64.7     60.9     61.9-
               54.0     59.3     61.1     58.3     58.9     58.3+            .941
               54.3     60.2     59.6     60.6     60.3     59.0x            .953

quantum = 2
dd pre         59.7     62.4     58.9     65.3     60.3     61.3-
               49.4     51.9     58.7     49.3     52.4     52.3+            .853
               58.3     52.8     53.1     50.4     59.9     54.9x            .895
perf stat      5.81     6.09     6.24    10.13     6.21      6.8-
               2.48     2.10     3.23     2.29     2.31      2.4+            .352
               2.09     2.73     1.72     1.96     1.83      2.0x            .294
dd post        64.0     62.6     64.2     60.4     61.1     62.4-
               52.9     56.2     49.6     51.3     51.2     52.2+            .836
               54.7     60.9     56.0     54.0     55.4     56.2x            .900

quantum = 3
dd pre         65.5     57.7     54.5     51.1     56.3     57.0-
               58.1     53.9     52.2     58.2     51.8     54.8+            .961
               60.5     56.5     56.7     55.3     54.6     56.7x            .994
perf stat     14.01    13.71     8.35     5.35     8.57      9.9-
               1.84     2.30     2.14     2.10     2.45      2.1+            .212
               2.12     1.63     2.54     2.23     2.29      2.1x            .212
dd post        59.2     49.1     58.8     62.3     62.1     58.3-
               59.8     53.2     55.2     50.9     53.7     54.5+            .934
               56.1     61.9     51.9     54.3     53.1     55.4x            .950

quantun = 4
dd pre         57.2     52.1     56.8     55.2     61.6     56.5-
               48.7     55.4     51.3     49.7     54.5     51.9+            .918
               55.8     54.5     50.3     56.4     49.3     53.2x            .941
perf stat     11.98     1.61     9.63    16.21    11.13     10.1-
               2.29     1.94     2.68     2.46     2.45      2.3+            .227
               3.01     1.84     2.11     2.27     2.30      2.3x            .227
dd post        57.2     52.6     62.2     49.3     50.2     54.3-
               50.1     54.5     58.4     54.1     49.0     53.2+            .979
               52.9     53.2     50.6     53.2     50.5     52.0x            .957



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]     ` <20090925202636.GC15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-26 14:51       ` Mike Galbraith
  2009-09-27 17:00       ` Corrado Zoccolo
  1 sibling, 0 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-26 14:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 2009-09-25 at 16:26 -0400, Vivek Goyal wrote:
> On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> > Vivek Goyal wrote:
> > > Notes:
> > > - With vanilla CFQ, random writers can overwhelm a random reader.
> > >   Bring down its throughput and bump up latencies significantly.
> > 
> > 
> > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> > too.
> > 
> > I'm basing this assumption on the observations I made on both OpenSuse
> > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> > titled: "Poor desktop responsiveness with background I/O-operations" of
> > 2009-09-20.
> > (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org)
> > 
> > 
> > Thus, I'm posting this to show that your work is greatly appreciated,
> > given the rather disappointig status quo of Linux's fairness when it
> > comes to disk IO time.
> > 
> > I hope that your efforts lead to a change in performance of current
> > userland applications, the sooner, the better.
> > 
> [Please don't remove people from original CC list. I am putting them back.]
> 
> Hi Ulrich,
> 
> I quicky went through that mail thread and I tried following on my
> desktop.
> 
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> sleep 5
> time firefox
> # close firefox once gui pops up.
> ##########################################
> 
> It was taking close to 1 minute 30 seconds to launch firefox and dd got 
> following.
> 
> 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> 
> (Results do vary across runs, especially if system is booted fresh. Don't
>  know why...).
> 
> 
> Then I tried putting both the applications in separate groups and assign
> them weights 200 each.
> 
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> echo $! > /cgroup/io/test1/tasks
> sleep 5
> echo $$ > /cgroup/io/test2/tasks
> time firefox
> # close firefox once gui pops up.
> ##########################################
> 
> Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> 
> 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> 
> Notice that throughput of dd also improved.
> 
> I ran the block trace and noticed in many a cases firefox threads
> immediately preempted the "dd". Probably because it was a file system
> request. So in this case latency will arise from seek time.
> 
> In some other cases, threads had to wait for up to 100ms because dd was
> not preempted. In this case latency will arise both from waiting on queue
> as well as seek time.

Hm, with tip, I see ~10ms max wakeup latency running scriptlet below.

> With cgroup thing, We will run 100ms slice for the group in which firefox
> is being launched and then give 100ms uninterrupted time slice to dd. So
> it should cut down on number of seeks happening and that's why we probably
> see this improvement.

I'm not testing with group IO/CPU, but my numbers kinda agree that it's
seek latency that's THE killer.  What the compiled numbers below from
the cheezy script below that _seem_ to be telling me is that the default
setting of CFQ quantum is allowing too many write requests through,
inflicting too much read latency... for the disk where my binaries live.
The longer the seeky burst, the more it hurts both reader/writer, so
cutting down the max requests queueable helps the reader (which i think
can't queue anything near per unit time that the writer can) finish and
get out of the writer's way sooner.

'nuff possibly useless words, onward to possibly useless numbers :)

dd pre == number dd emits upon receiving USR1 before execing perf.
perf stat == time to load/execute perf stat konsole -e exit.
dd post == same after dd number, after perf finishes.

quantum = 1                                                  Avg
dd pre         58.4     52.5     56.1     61.6     52.3     56.1  MB/s
perf stat      2.87     0.91     1.64     1.41     0.90      1.5  Sec
dd post        56.6     61.0     66.3     64.7     60.9     61.9

quantum = 2
dd pre         59.7     62.4     58.9     65.3     60.3     61.3
perf stat      5.81     6.09     6.24    10.13     6.21      6.8
dd post        64.0     62.6     64.2     60.4     61.1     62.4

quantum = 3
dd pre         65.5     57.7     54.5     51.1     56.3     57.0
perf stat     14.01    13.71     8.35     5.35     8.57      9.9
dd post        59.2     49.1     58.8     62.3     62.1     58.3

quantum = 4
dd pre         57.2     52.1     56.8     55.2     61.6     56.5
perf stat     11.98     1.61     9.63    16.21    11.13     10.1
dd post        57.2     52.6     62.2     49.3     50.2     54.3

Nothing pinned btw, 4 cores available, but only 1 drive.

#!/bin/sh

DISK=sdb
QUANTUM=/sys/block/$DISK/queue/iosched/quantum
END=$(cat $QUANTUM)

for q in `seq 1 $END`; do
	echo $q > $QUANTUM
	LOGFILE=quantum_log_$q
	rm -f $LOGFILE
	for i in `seq 1 5`; do
		echo 2 > /proc/sys/vm/drop_caches
		sh -c "dd if=/dev/zero of=./deleteme.dd 2>&1|tee -a $LOGFILE" &
		sleep 30
		sh -c "echo quantum $(cat $QUANTUM) loop $i" 2>&1|tee -a $LOGFILE
		perf stat -- killlall -q get_stuf_into_ram >/dev/null 2>&1
		sleep 1
		killall -q -USR1 dd &
		sleep 1
		sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE
		sleep 1
		killall -q -USR1 dd &
		sleep 5
		killall -qw dd
		rm -f ./deleteme.dd
		sync
		sh -c "echo" 2>&1|tee -a $LOGFILE
	done;
done;

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25 20:26     ` Vivek Goyal
  (?)
@ 2009-09-26 14:51     ` Mike Galbraith
       [not found]       ` <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
  2009-09-27  6:55       ` Mike Galbraith
  -1 siblings, 2 replies; 349+ messages in thread
From: Mike Galbraith @ 2009-09-26 14:51 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer,
	dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz,
	jmarchan, torvalds, mingo, riel, jens.axboe

On Fri, 2009-09-25 at 16:26 -0400, Vivek Goyal wrote:
> On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> > Vivek Goyal wrote:
> > > Notes:
> > > - With vanilla CFQ, random writers can overwhelm a random reader.
> > >   Bring down its throughput and bump up latencies significantly.
> > 
> > 
> > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> > too.
> > 
> > I'm basing this assumption on the observations I made on both OpenSuse
> > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> > titled: "Poor desktop responsiveness with background I/O-operations" of
> > 2009-09-20.
> > (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
> > 
> > 
> > Thus, I'm posting this to show that your work is greatly appreciated,
> > given the rather disappointig status quo of Linux's fairness when it
> > comes to disk IO time.
> > 
> > I hope that your efforts lead to a change in performance of current
> > userland applications, the sooner, the better.
> > 
> [Please don't remove people from original CC list. I am putting them back.]
> 
> Hi Ulrich,
> 
> I quicky went through that mail thread and I tried following on my
> desktop.
> 
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> sleep 5
> time firefox
> # close firefox once gui pops up.
> ##########################################
> 
> It was taking close to 1 minute 30 seconds to launch firefox and dd got 
> following.
> 
> 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s
> 
> (Results do vary across runs, especially if system is booted fresh. Don't
>  know why...).
> 
> 
> Then I tried putting both the applications in separate groups and assign
> them weights 200 each.
> 
> ##########################################
> dd if=/home/vgoyal/4G-file of=/dev/null &
> echo $! > /cgroup/io/test1/tasks
> sleep 5
> echo $$ > /cgroup/io/test2/tasks
> time firefox
> # close firefox once gui pops up.
> ##########################################
> 
> Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.
> 
> 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s
> 
> Notice that throughput of dd also improved.
> 
> I ran the block trace and noticed in many a cases firefox threads
> immediately preempted the "dd". Probably because it was a file system
> request. So in this case latency will arise from seek time.
> 
> In some other cases, threads had to wait for up to 100ms because dd was
> not preempted. In this case latency will arise both from waiting on queue
> as well as seek time.

Hm, with tip, I see ~10ms max wakeup latency running scriptlet below.

> With cgroup thing, We will run 100ms slice for the group in which firefox
> is being launched and then give 100ms uninterrupted time slice to dd. So
> it should cut down on number of seeks happening and that's why we probably
> see this improvement.

I'm not testing with group IO/CPU, but my numbers kinda agree that it's
seek latency that's THE killer.  What the compiled numbers below from
the cheezy script below that _seem_ to be telling me is that the default
setting of CFQ quantum is allowing too many write requests through,
inflicting too much read latency... for the disk where my binaries live.
The longer the seeky burst, the more it hurts both reader/writer, so
cutting down the max requests queueable helps the reader (which i think
can't queue anything near per unit time that the writer can) finish and
get out of the writer's way sooner.

'nuff possibly useless words, onward to possibly useless numbers :)

dd pre == number dd emits upon receiving USR1 before execing perf.
perf stat == time to load/execute perf stat konsole -e exit.
dd post == same after dd number, after perf finishes.

quantum = 1                                                  Avg
dd pre         58.4     52.5     56.1     61.6     52.3     56.1  MB/s
perf stat      2.87     0.91     1.64     1.41     0.90      1.5  Sec
dd post        56.6     61.0     66.3     64.7     60.9     61.9

quantum = 2
dd pre         59.7     62.4     58.9     65.3     60.3     61.3
perf stat      5.81     6.09     6.24    10.13     6.21      6.8
dd post        64.0     62.6     64.2     60.4     61.1     62.4

quantum = 3
dd pre         65.5     57.7     54.5     51.1     56.3     57.0
perf stat     14.01    13.71     8.35     5.35     8.57      9.9
dd post        59.2     49.1     58.8     62.3     62.1     58.3

quantum = 4
dd pre         57.2     52.1     56.8     55.2     61.6     56.5
perf stat     11.98     1.61     9.63    16.21    11.13     10.1
dd post        57.2     52.6     62.2     49.3     50.2     54.3

Nothing pinned btw, 4 cores available, but only 1 drive.

#!/bin/sh

DISK=sdb
QUANTUM=/sys/block/$DISK/queue/iosched/quantum
END=$(cat $QUANTUM)

for q in `seq 1 $END`; do
	echo $q > $QUANTUM
	LOGFILE=quantum_log_$q
	rm -f $LOGFILE
	for i in `seq 1 5`; do
		echo 2 > /proc/sys/vm/drop_caches
		sh -c "dd if=/dev/zero of=./deleteme.dd 2>&1|tee -a $LOGFILE" &
		sleep 30
		sh -c "echo quantum $(cat $QUANTUM) loop $i" 2>&1|tee -a $LOGFILE
		perf stat -- killlall -q get_stuf_into_ram >/dev/null 2>&1
		sleep 1
		killall -q -USR1 dd &
		sleep 1
		sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE
		sleep 1
		killall -q -USR1 dd &
		sleep 5
		killall -qw dd
		rm -f ./deleteme.dd
		sync
		sh -c "echo" 2>&1|tee -a $LOGFILE
	done;
done;



^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]   ` <4ABC28DE.7050809-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org>
@ 2009-09-25 20:26     ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25 20:26 UTC (permalink / raw)
  To: Ulrich Lukas
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> Vivek Goyal wrote:
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >   Bring down its throughput and bump up latencies significantly.
> 
> 
> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> too.
> 
> I'm basing this assumption on the observations I made on both OpenSuse
> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> titled: "Poor desktop responsiveness with background I/O-operations" of
> 2009-09-20.
> (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org)
> 
> 
> Thus, I'm posting this to show that your work is greatly appreciated,
> given the rather disappointig status quo of Linux's fairness when it
> comes to disk IO time.
> 
> I hope that your efforts lead to a change in performance of current
> userland applications, the sooner, the better.
> 
[Please don't remove people from original CC list. I am putting them back.]

Hi Ulrich,

I quicky went through that mail thread and I tried following on my
desktop.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
sleep 5
time firefox
# close firefox once gui pops up.
##########################################

It was taking close to 1 minute 30 seconds to launch firefox and dd got 
following.

4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s

(Results do vary across runs, especially if system is booted fresh. Don't
 know why...).


Then I tried putting both the applications in separate groups and assign
them weights 200 each.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
echo $! > /cgroup/io/test1/tasks
sleep 5
echo $$ > /cgroup/io/test2/tasks
time firefox
# close firefox once gui pops up.
##########################################

Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.

4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s

Notice that throughput of dd also improved.

I ran the block trace and noticed in many a cases firefox threads
immediately preempted the "dd". Probably because it was a file system
request. So in this case latency will arise from seek time.

In some other cases, threads had to wait for up to 100ms because dd was
not preempted. In this case latency will arise both from waiting on queue
as well as seek time.

With cgroup thing, We will run 100ms slice for the group in which firefox
is being launched and then give 100ms uninterrupted time slice to dd. So
it should cut down on number of seeks happening and that's why we probably
see this improvement.

So grouping can help in such cases. May be you can move your X session in
one group and launch the big IO in other group. Most likely you should
have better desktop experience without compromising on dd thread output.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25  2:20 ` Ulrich Lukas
@ 2009-09-25 20:26     ` Vivek Goyal
  2009-09-25 20:26     ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25 20:26 UTC (permalink / raw)
  To: Ulrich Lukas
  Cc: linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew,
	fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds,
	mingo, riel, jens.axboe

On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> Vivek Goyal wrote:
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >   Bring down its throughput and bump up latencies significantly.
> 
> 
> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> too.
> 
> I'm basing this assumption on the observations I made on both OpenSuse
> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> titled: "Poor desktop responsiveness with background I/O-operations" of
> 2009-09-20.
> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
> 
> 
> Thus, I'm posting this to show that your work is greatly appreciated,
> given the rather disappointig status quo of Linux's fairness when it
> comes to disk IO time.
> 
> I hope that your efforts lead to a change in performance of current
> userland applications, the sooner, the better.
> 
[Please don't remove people from original CC list. I am putting them back.]

Hi Ulrich,

I quicky went through that mail thread and I tried following on my
desktop.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
sleep 5
time firefox
# close firefox once gui pops up.
##########################################

It was taking close to 1 minute 30 seconds to launch firefox and dd got 
following.

4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s

(Results do vary across runs, especially if system is booted fresh. Don't
 know why...).


Then I tried putting both the applications in separate groups and assign
them weights 200 each.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
echo $! > /cgroup/io/test1/tasks
sleep 5
echo $$ > /cgroup/io/test2/tasks
time firefox
# close firefox once gui pops up.
##########################################

Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.

4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s

Notice that throughput of dd also improved.

I ran the block trace and noticed in many a cases firefox threads
immediately preempted the "dd". Probably because it was a file system
request. So in this case latency will arise from seek time.

In some other cases, threads had to wait for up to 100ms because dd was
not preempted. In this case latency will arise both from waiting on queue
as well as seek time.

With cgroup thing, We will run 100ms slice for the group in which firefox
is being launched and then give 100ms uninterrupted time slice to dd. So
it should cut down on number of seeks happening and that's why we probably
see this improvement.

So grouping can help in such cases. May be you can move your X session in
one group and launch the big IO in other group. Most likely you should
have better desktop experience without compromising on dd thread output.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-25 20:26     ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25 20:26 UTC (permalink / raw)
  To: Ulrich Lukas
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo,
	m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm,
	righi.andrea, torvalds

On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote:
> Vivek Goyal wrote:
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader.
> >   Bring down its throughput and bump up latencies significantly.
> 
> 
> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
> too.
> 
> I'm basing this assumption on the observations I made on both OpenSuse
> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
> titled: "Poor desktop responsiveness with background I/O-operations" of
> 2009-09-20.
> (Message ID: 4AB59CBB.8090907@datenparkplatz.de)
> 
> 
> Thus, I'm posting this to show that your work is greatly appreciated,
> given the rather disappointig status quo of Linux's fairness when it
> comes to disk IO time.
> 
> I hope that your efforts lead to a change in performance of current
> userland applications, the sooner, the better.
> 
[Please don't remove people from original CC list. I am putting them back.]

Hi Ulrich,

I quicky went through that mail thread and I tried following on my
desktop.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
sleep 5
time firefox
# close firefox once gui pops up.
##########################################

It was taking close to 1 minute 30 seconds to launch firefox and dd got 
following.

4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s

(Results do vary across runs, especially if system is booted fresh. Don't
 know why...).


Then I tried putting both the applications in separate groups and assign
them weights 200 each.

##########################################
dd if=/home/vgoyal/4G-file of=/dev/null &
echo $! > /cgroup/io/test1/tasks
sleep 5
echo $$ > /cgroup/io/test2/tasks
time firefox
# close firefox once gui pops up.
##########################################

Now I firefox pops up in 27 seconds. So it cut down the time by 2/3.

4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s

Notice that throughput of dd also improved.

I ran the block trace and noticed in many a cases firefox threads
immediately preempted the "dd". Probably because it was a file system
request. So in this case latency will arise from seek time.

In some other cases, threads had to wait for up to 100ms because dd was
not preempted. In this case latency will arise both from waiting on queue
as well as seek time.

With cgroup thing, We will run 100ms slice for the group in which firefox
is being launched and then give 100ms uninterrupted time slice to dd. So
it should cut down on number of seeks happening and that's why we probably
see this improvement.

So grouping can help in such cases. May be you can move your X session in
one group and launch the big IO in other group. Most likely you should
have better desktop experience without compromising on dd thread output.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <20090925.180724.104041942.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
  2009-09-25 14:33         ` Vivek Goyal
@ 2009-09-25 15:04         ` Rik van Riel
  1 sibling, 0 replies; 349+ messages in thread
From: Rik van Riel @ 2009-09-25 15:04 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Ryo Tsuruta wrote:

> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?

When there are two workloads competing for the same
resources, I would expect each of the workloads to
run at about 50% of the speed at which it would run
on an uncontended system.

Having one of the workloads run at 95% of the
uncontended speed and the other workload at 5%
is "not fair" (to put it diplomatically).

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25  9:07       ` Ryo Tsuruta
@ 2009-09-25 15:04         ` Rik van Riel
  -1 siblings, 0 replies; 349+ messages in thread
From: Rik van Riel @ 2009-09-25 15:04 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: vgoyal, akpm, linux-kernel, jens.axboe, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo

Ryo Tsuruta wrote:

> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?

When there are two workloads competing for the same
resources, I would expect each of the workloads to
run at about 50% of the speed at which it would run
on an uncontended system.

Having one of the workloads run at 95% of the
uncontended speed and the other workload at 5%
is "not fair" (to put it diplomatically).

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-25 15:04         ` Rik van Riel
  0 siblings, 0 replies; 349+ messages in thread
From: Rik van Riel @ 2009-09-25 15:04 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, vgoyal, m-ikeda, lizf, fchecconi, akpm,
	containers, linux-kernel, s-uchida, righi.andrea, torvalds

Ryo Tsuruta wrote:

> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?

When there are two workloads competing for the same
resources, I would expect each of the workloads to
run at about 50% of the speed at which it would run
on an uncontended system.

Having one of the workloads run at 95% of the
uncontended speed and the other workload at 5%
is "not fair" (to put it diplomatically).

-- 
All rights reversed.

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <20090925.180724.104041942.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
@ 2009-09-25 14:33         ` Vivek Goyal
  2009-09-25 15:04         ` Rik van Riel
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25 14:33 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Higher level solutions are not keeping track of time slices. Time slices will
> > be allocated by CFQ which does not have any idea about grouping. Higher
> > level controller just keeps track of size of IO done at group level and
> > then run either a leaky bucket or token bucket algorithm.
> > 
> > IO throttling is a max BW controller, so it will not even care about what is
> > happening in other group. It will just be concerned with rate of IO in one
> > particular group and if we exceed specified limit, throttle it. So until and
> > unless sequential reader group hits it max bw limit, it will keep sending
> > reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
> > 
> > dm-ioband will not try to choke the high throughput sequential reader group
> > for the slow random reader group because that would just kill the throughput
> > of rotational media. Every sequential reader will run for few ms and then 
> > be throttled and this goes on. Disk will soon be seek bound.
> 
> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?
> 

Hi Ryo,

Fairness in terms of size of IO or number of requests is probably not the
best thing to do on rotational media where seek latencies are significant.

It probably should work just well on media with very low seek latencies
like SSD.

So on rotational media, either you will not provide fairness to random 
readers because they are too slow or you will choke the sequential readers
in other group and also bring down the overall disk throughput.

If you don't decide to choke/throttle sequential reader group for the sake
of random reader in other group then you will not have a good control
on random reader latencies. Because now IO scheduler sees the IO from both
sequential reader as well as random reader and sequential readers have not
been throttled. So the dispatch pattern/time slices will again look like..

	SR1 SR2 SR3 SR4 SR5 RR.....

	instead  of

	SR1 RR SR2 RR SR3 RR SR4 RR ....
 
SR --> sequential reader,  RR --> random reader

> > > >   Buffering at higher layer can delay read requests for more than slice idle
> > > >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > > >   for a request from the queue but it is buffered at higher layer and then idle
> > > >   timer will fire. It means that queue will losse its share at the same time
> > > >   overall throughput will be impacted as we lost those 8 ms.
> > > 
> > > That sounds like a bug.
> > > 
> > 
> > Actually this probably is a limitation of higher level controller. It most
> > likely is sitting so high in IO stack that it has no idea what underlying
> > IO scheduler is and what are IO scheduler's policies. So it can't keep up
> > with IO scheduler's policies. Secondly, it might be a low weight group and
> > tokens might not be available fast enough to release the request.
> >
> > > >   Read Vs Write
> > > >   -------------
> > > >   Writes can overwhelm readers hence second level controller FIFO release
> > > >   will run into issue here. If there is a single queue maintained then reads
> > > >   will suffer large latencies. If there separate queues for reads and writes
> > > >   then it will be hard to decide in what ratio to dispatch reads and writes as
> > > >   it is IO scheduler's decision to decide when and how much read/write to
> > > >   dispatch. This is another place where higher level controller will not be in
> > > >   sync with lower level io scheduler and can change the effective policies of
> > > >   underlying io scheduler.
> > > 
> > > The IO schedulers already take care of read-vs-write and already take
> > > care of preventing large writes-starve-reads latencies (or at least,
> > > they're supposed to).
> > 
> > True. Actually this is a limitation of higher level controller. A higher
> > level controller will most likely implement some of kind of queuing/buffering
> > mechanism where it will buffer requeuests when it decides to throttle the
> > group. Now once a fair number read and requests are buffered, and if
> > controller is ready to dispatch some requests from the group, which
> > requests/bio should it dispatch? reads first or writes first or reads and
> > writes in certain ratio?
> 
> The write-starve-reads on dm-ioband, that you pointed out before, was
> not caused by FIFO release, it was caused by IO flow control in
> dm-ioband. When I turned off the flow control, then the read
> throughput was quite improved.

What was flow control doing?

> 
> Now I'm considering separating dm-ioband's internal queue into sync
> and async and giving a certain priority of dispatch to async IOs.

Even if you maintain separate queues for sync and async, in what ratio will
you dispatch reads and writes to underlying layer once fresh tokens become
available to the group and you decide to unthrottle the group.

Whatever policy you adopt for read and write dispatch, it might not match
with policy of underlying IO scheduler because every IO scheduler seems to
have its own way of determining how reads and writes should be dispatched.

Now somebody might start complaining that my job inside the group is not
getting same reader/writer ratio as it was getting outside the group.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25  9:07       ` Ryo Tsuruta
@ 2009-09-25 14:33         ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25 14:33 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: akpm, linux-kernel, jens.axboe, containers, dm-devel, nauman,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo,
	riel

On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > Higher level solutions are not keeping track of time slices. Time slices will
> > be allocated by CFQ which does not have any idea about grouping. Higher
> > level controller just keeps track of size of IO done at group level and
> > then run either a leaky bucket or token bucket algorithm.
> > 
> > IO throttling is a max BW controller, so it will not even care about what is
> > happening in other group. It will just be concerned with rate of IO in one
> > particular group and if we exceed specified limit, throttle it. So until and
> > unless sequential reader group hits it max bw limit, it will keep sending
> > reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
> > 
> > dm-ioband will not try to choke the high throughput sequential reader group
> > for the slow random reader group because that would just kill the throughput
> > of rotational media. Every sequential reader will run for few ms and then 
> > be throttled and this goes on. Disk will soon be seek bound.
> 
> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?
> 

Hi Ryo,

Fairness in terms of size of IO or number of requests is probably not the
best thing to do on rotational media where seek latencies are significant.

It probably should work just well on media with very low seek latencies
like SSD.

So on rotational media, either you will not provide fairness to random 
readers because they are too slow or you will choke the sequential readers
in other group and also bring down the overall disk throughput.

If you don't decide to choke/throttle sequential reader group for the sake
of random reader in other group then you will not have a good control
on random reader latencies. Because now IO scheduler sees the IO from both
sequential reader as well as random reader and sequential readers have not
been throttled. So the dispatch pattern/time slices will again look like..

	SR1 SR2 SR3 SR4 SR5 RR.....

	instead  of

	SR1 RR SR2 RR SR3 RR SR4 RR ....
 
SR --> sequential reader,  RR --> random reader

> > > >   Buffering at higher layer can delay read requests for more than slice idle
> > > >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > > >   for a request from the queue but it is buffered at higher layer and then idle
> > > >   timer will fire. It means that queue will losse its share at the same time
> > > >   overall throughput will be impacted as we lost those 8 ms.
> > > 
> > > That sounds like a bug.
> > > 
> > 
> > Actually this probably is a limitation of higher level controller. It most
> > likely is sitting so high in IO stack that it has no idea what underlying
> > IO scheduler is and what are IO scheduler's policies. So it can't keep up
> > with IO scheduler's policies. Secondly, it might be a low weight group and
> > tokens might not be available fast enough to release the request.
> >
> > > >   Read Vs Write
> > > >   -------------
> > > >   Writes can overwhelm readers hence second level controller FIFO release
> > > >   will run into issue here. If there is a single queue maintained then reads
> > > >   will suffer large latencies. If there separate queues for reads and writes
> > > >   then it will be hard to decide in what ratio to dispatch reads and writes as
> > > >   it is IO scheduler's decision to decide when and how much read/write to
> > > >   dispatch. This is another place where higher level controller will not be in
> > > >   sync with lower level io scheduler and can change the effective policies of
> > > >   underlying io scheduler.
> > > 
> > > The IO schedulers already take care of read-vs-write and already take
> > > care of preventing large writes-starve-reads latencies (or at least,
> > > they're supposed to).
> > 
> > True. Actually this is a limitation of higher level controller. A higher
> > level controller will most likely implement some of kind of queuing/buffering
> > mechanism where it will buffer requeuests when it decides to throttle the
> > group. Now once a fair number read and requests are buffered, and if
> > controller is ready to dispatch some requests from the group, which
> > requests/bio should it dispatch? reads first or writes first or reads and
> > writes in certain ratio?
> 
> The write-starve-reads on dm-ioband, that you pointed out before, was
> not caused by FIFO release, it was caused by IO flow control in
> dm-ioband. When I turned off the flow control, then the read
> throughput was quite improved.

What was flow control doing?

> 
> Now I'm considering separating dm-ioband's internal queue into sync
> and async and giving a certain priority of dispatch to async IOs.

Even if you maintain separate queues for sync and async, in what ratio will
you dispatch reads and writes to underlying layer once fresh tokens become
available to the group and you decide to unthrottle the group.

Whatever policy you adopt for read and write dispatch, it might not match
with policy of underlying IO scheduler because every IO scheduler seems to
have its own way of determining how reads and writes should be dispatched.

Now somebody might start complaining that my job inside the group is not
getting same reader/writer ratio as it was getting outside the group.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-25 14:33         ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25 14:33 UTC (permalink / raw)
  To: Ryo Tsuruta
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, akpm, containers,
	linux-kernel, s-uchida, righi.andrea, torvalds

On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote:
> Hi Vivek,
> 
> Vivek Goyal <vgoyal@redhat.com> wrote:
> > Higher level solutions are not keeping track of time slices. Time slices will
> > be allocated by CFQ which does not have any idea about grouping. Higher
> > level controller just keeps track of size of IO done at group level and
> > then run either a leaky bucket or token bucket algorithm.
> > 
> > IO throttling is a max BW controller, so it will not even care about what is
> > happening in other group. It will just be concerned with rate of IO in one
> > particular group and if we exceed specified limit, throttle it. So until and
> > unless sequential reader group hits it max bw limit, it will keep sending
> > reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
> > 
> > dm-ioband will not try to choke the high throughput sequential reader group
> > for the slow random reader group because that would just kill the throughput
> > of rotational media. Every sequential reader will run for few ms and then 
> > be throttled and this goes on. Disk will soon be seek bound.
> 
> Because dm-ioband provides faireness in terms of how many IO requests
> are issued or how many bytes are transferred, so this behaviour is to
> be expected. Do you think fairness in terms of IO requests and size is
> not fair?
> 

Hi Ryo,

Fairness in terms of size of IO or number of requests is probably not the
best thing to do on rotational media where seek latencies are significant.

It probably should work just well on media with very low seek latencies
like SSD.

So on rotational media, either you will not provide fairness to random 
readers because they are too slow or you will choke the sequential readers
in other group and also bring down the overall disk throughput.

If you don't decide to choke/throttle sequential reader group for the sake
of random reader in other group then you will not have a good control
on random reader latencies. Because now IO scheduler sees the IO from both
sequential reader as well as random reader and sequential readers have not
been throttled. So the dispatch pattern/time slices will again look like..

	SR1 SR2 SR3 SR4 SR5 RR.....

	instead  of

	SR1 RR SR2 RR SR3 RR SR4 RR ....
 
SR --> sequential reader,  RR --> random reader

> > > >   Buffering at higher layer can delay read requests for more than slice idle
> > > >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > > >   for a request from the queue but it is buffered at higher layer and then idle
> > > >   timer will fire. It means that queue will losse its share at the same time
> > > >   overall throughput will be impacted as we lost those 8 ms.
> > > 
> > > That sounds like a bug.
> > > 
> > 
> > Actually this probably is a limitation of higher level controller. It most
> > likely is sitting so high in IO stack that it has no idea what underlying
> > IO scheduler is and what are IO scheduler's policies. So it can't keep up
> > with IO scheduler's policies. Secondly, it might be a low weight group and
> > tokens might not be available fast enough to release the request.
> >
> > > >   Read Vs Write
> > > >   -------------
> > > >   Writes can overwhelm readers hence second level controller FIFO release
> > > >   will run into issue here. If there is a single queue maintained then reads
> > > >   will suffer large latencies. If there separate queues for reads and writes
> > > >   then it will be hard to decide in what ratio to dispatch reads and writes as
> > > >   it is IO scheduler's decision to decide when and how much read/write to
> > > >   dispatch. This is another place where higher level controller will not be in
> > > >   sync with lower level io scheduler and can change the effective policies of
> > > >   underlying io scheduler.
> > > 
> > > The IO schedulers already take care of read-vs-write and already take
> > > care of preventing large writes-starve-reads latencies (or at least,
> > > they're supposed to).
> > 
> > True. Actually this is a limitation of higher level controller. A higher
> > level controller will most likely implement some of kind of queuing/buffering
> > mechanism where it will buffer requeuests when it decides to throttle the
> > group. Now once a fair number read and requests are buffered, and if
> > controller is ready to dispatch some requests from the group, which
> > requests/bio should it dispatch? reads first or writes first or reads and
> > writes in certain ratio?
> 
> The write-starve-reads on dm-ioband, that you pointed out before, was
> not caused by FIFO release, it was caused by IO flow control in
> dm-ioband. When I turned off the flow control, then the read
> throughput was quite improved.

What was flow control doing?

> 
> Now I'm considering separating dm-ioband's internal queue into sync
> and async and giving a certain priority of dispatch to async IOs.

Even if you maintain separate queues for sync and async, in what ratio will
you dispatch reads and writes to underlying layer once fresh tokens become
available to the group and you decide to unthrottle the group.

Whatever policy you adopt for read and write dispatch, it might not match
with policy of underlying IO scheduler because every IO scheduler seems to
have its own way of determining how reads and writes should be dispatched.

Now somebody might start complaining that my job inside the group is not
getting same reader/writer ratio as it was getting outside the group.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]     ` <20090925050429.GB12555-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-25  9:07       ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-25  9:07 UTC (permalink / raw)
  To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi Vivek,

Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> Higher level solutions are not keeping track of time slices. Time slices will
> be allocated by CFQ which does not have any idea about grouping. Higher
> level controller just keeps track of size of IO done at group level and
> then run either a leaky bucket or token bucket algorithm.
> 
> IO throttling is a max BW controller, so it will not even care about what is
> happening in other group. It will just be concerned with rate of IO in one
> particular group and if we exceed specified limit, throttle it. So until and
> unless sequential reader group hits it max bw limit, it will keep sending
> reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
> 
> dm-ioband will not try to choke the high throughput sequential reader group
> for the slow random reader group because that would just kill the throughput
> of rotational media. Every sequential reader will run for few ms and then 
> be throttled and this goes on. Disk will soon be seek bound.

Because dm-ioband provides faireness in terms of how many IO requests
are issued or how many bytes are transferred, so this behaviour is to
be expected. Do you think fairness in terms of IO requests and size is
not fair?

> > >   Buffering at higher layer can delay read requests for more than slice idle
> > >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > >   for a request from the queue but it is buffered at higher layer and then idle
> > >   timer will fire. It means that queue will losse its share at the same time
> > >   overall throughput will be impacted as we lost those 8 ms.
> > 
> > That sounds like a bug.
> > 
> 
> Actually this probably is a limitation of higher level controller. It most
> likely is sitting so high in IO stack that it has no idea what underlying
> IO scheduler is and what are IO scheduler's policies. So it can't keep up
> with IO scheduler's policies. Secondly, it might be a low weight group and
> tokens might not be available fast enough to release the request.
>
> > >   Read Vs Write
> > >   -------------
> > >   Writes can overwhelm readers hence second level controller FIFO release
> > >   will run into issue here. If there is a single queue maintained then reads
> > >   will suffer large latencies. If there separate queues for reads and writes
> > >   then it will be hard to decide in what ratio to dispatch reads and writes as
> > >   it is IO scheduler's decision to decide when and how much read/write to
> > >   dispatch. This is another place where higher level controller will not be in
> > >   sync with lower level io scheduler and can change the effective policies of
> > >   underlying io scheduler.
> > 
> > The IO schedulers already take care of read-vs-write and already take
> > care of preventing large writes-starve-reads latencies (or at least,
> > they're supposed to).
> 
> True. Actually this is a limitation of higher level controller. A higher
> level controller will most likely implement some of kind of queuing/buffering
> mechanism where it will buffer requeuests when it decides to throttle the
> group. Now once a fair number read and requests are buffered, and if
> controller is ready to dispatch some requests from the group, which
> requests/bio should it dispatch? reads first or writes first or reads and
> writes in certain ratio?

The write-starve-reads on dm-ioband, that you pointed out before, was
not caused by FIFO release, it was caused by IO flow control in
dm-ioband. When I turned off the flow control, then the read
throughput was quite improved.

Now I'm considering separating dm-ioband's internal queue into sync
and async and giving a certain priority of dispatch to async IOs.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25  5:04     ` Vivek Goyal
@ 2009-09-25  9:07       ` Ryo Tsuruta
  -1 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-25  9:07 UTC (permalink / raw)
  To: vgoyal
  Cc: akpm, linux-kernel, jens.axboe, containers, dm-devel, nauman,
	dpshah, lizf, mikew, fchecconi, paolo.valente, fernando,
	s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo,
	riel

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> Higher level solutions are not keeping track of time slices. Time slices will
> be allocated by CFQ which does not have any idea about grouping. Higher
> level controller just keeps track of size of IO done at group level and
> then run either a leaky bucket or token bucket algorithm.
> 
> IO throttling is a max BW controller, so it will not even care about what is
> happening in other group. It will just be concerned with rate of IO in one
> particular group and if we exceed specified limit, throttle it. So until and
> unless sequential reader group hits it max bw limit, it will keep sending
> reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
> 
> dm-ioband will not try to choke the high throughput sequential reader group
> for the slow random reader group because that would just kill the throughput
> of rotational media. Every sequential reader will run for few ms and then 
> be throttled and this goes on. Disk will soon be seek bound.

Because dm-ioband provides faireness in terms of how many IO requests
are issued or how many bytes are transferred, so this behaviour is to
be expected. Do you think fairness in terms of IO requests and size is
not fair?

> > >   Buffering at higher layer can delay read requests for more than slice idle
> > >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > >   for a request from the queue but it is buffered at higher layer and then idle
> > >   timer will fire. It means that queue will losse its share at the same time
> > >   overall throughput will be impacted as we lost those 8 ms.
> > 
> > That sounds like a bug.
> > 
> 
> Actually this probably is a limitation of higher level controller. It most
> likely is sitting so high in IO stack that it has no idea what underlying
> IO scheduler is and what are IO scheduler's policies. So it can't keep up
> with IO scheduler's policies. Secondly, it might be a low weight group and
> tokens might not be available fast enough to release the request.
>
> > >   Read Vs Write
> > >   -------------
> > >   Writes can overwhelm readers hence second level controller FIFO release
> > >   will run into issue here. If there is a single queue maintained then reads
> > >   will suffer large latencies. If there separate queues for reads and writes
> > >   then it will be hard to decide in what ratio to dispatch reads and writes as
> > >   it is IO scheduler's decision to decide when and how much read/write to
> > >   dispatch. This is another place where higher level controller will not be in
> > >   sync with lower level io scheduler and can change the effective policies of
> > >   underlying io scheduler.
> > 
> > The IO schedulers already take care of read-vs-write and already take
> > care of preventing large writes-starve-reads latencies (or at least,
> > they're supposed to).
> 
> True. Actually this is a limitation of higher level controller. A higher
> level controller will most likely implement some of kind of queuing/buffering
> mechanism where it will buffer requeuests when it decides to throttle the
> group. Now once a fair number read and requests are buffered, and if
> controller is ready to dispatch some requests from the group, which
> requests/bio should it dispatch? reads first or writes first or reads and
> writes in certain ratio?

The write-starve-reads on dm-ioband, that you pointed out before, was
not caused by FIFO release, it was caused by IO flow control in
dm-ioband. When I turned off the flow control, then the read
throughput was quite improved.

Now I'm considering separating dm-ioband's internal queue into sync
and async and giving a certain priority of dispatch to async IOs.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-25  9:07       ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-25  9:07 UTC (permalink / raw)
  To: vgoyal
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, akpm, containers,
	linux-kernel, s-uchida, righi.andrea, torvalds

Hi Vivek,

Vivek Goyal <vgoyal@redhat.com> wrote:
> Higher level solutions are not keeping track of time slices. Time slices will
> be allocated by CFQ which does not have any idea about grouping. Higher
> level controller just keeps track of size of IO done at group level and
> then run either a leaky bucket or token bucket algorithm.
> 
> IO throttling is a max BW controller, so it will not even care about what is
> happening in other group. It will just be concerned with rate of IO in one
> particular group and if we exceed specified limit, throttle it. So until and
> unless sequential reader group hits it max bw limit, it will keep sending
> reads down to CFQ, and CFQ will happily assign 100ms slices to readers.
> 
> dm-ioband will not try to choke the high throughput sequential reader group
> for the slow random reader group because that would just kill the throughput
> of rotational media. Every sequential reader will run for few ms and then 
> be throttled and this goes on. Disk will soon be seek bound.

Because dm-ioband provides faireness in terms of how many IO requests
are issued or how many bytes are transferred, so this behaviour is to
be expected. Do you think fairness in terms of IO requests and size is
not fair?

> > >   Buffering at higher layer can delay read requests for more than slice idle
> > >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> > >   for a request from the queue but it is buffered at higher layer and then idle
> > >   timer will fire. It means that queue will losse its share at the same time
> > >   overall throughput will be impacted as we lost those 8 ms.
> > 
> > That sounds like a bug.
> > 
> 
> Actually this probably is a limitation of higher level controller. It most
> likely is sitting so high in IO stack that it has no idea what underlying
> IO scheduler is and what are IO scheduler's policies. So it can't keep up
> with IO scheduler's policies. Secondly, it might be a low weight group and
> tokens might not be available fast enough to release the request.
>
> > >   Read Vs Write
> > >   -------------
> > >   Writes can overwhelm readers hence second level controller FIFO release
> > >   will run into issue here. If there is a single queue maintained then reads
> > >   will suffer large latencies. If there separate queues for reads and writes
> > >   then it will be hard to decide in what ratio to dispatch reads and writes as
> > >   it is IO scheduler's decision to decide when and how much read/write to
> > >   dispatch. This is another place where higher level controller will not be in
> > >   sync with lower level io scheduler and can change the effective policies of
> > >   underlying io scheduler.
> > 
> > The IO schedulers already take care of read-vs-write and already take
> > care of preventing large writes-starve-reads latencies (or at least,
> > they're supposed to).
> 
> True. Actually this is a limitation of higher level controller. A higher
> level controller will most likely implement some of kind of queuing/buffering
> mechanism where it will buffer requeuests when it decides to throttle the
> group. Now once a fair number read and requests are buffered, and if
> controller is ready to dispatch some requests from the group, which
> requests/bio should it dispatch? reads first or writes first or reads and
> writes in certain ratio?

The write-starve-reads on dm-ioband, that you pointed out before, was
not caused by FIFO release, it was caused by IO flow control in
dm-ioband. When I turned off the flow control, then the read
throughput was quite improved.

Now I'm considering separating dm-ioband's internal queue into sync
and async and giving a certain priority of dispatch to async IOs.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]         ` <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
@ 2009-09-25  7:09           ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-25  7:09 UTC (permalink / raw)
  To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

Hi,

Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > > I think I must support dirty-ratio in memcg layer. But not yet.
> > 
> 
> We need to add this to the TODO list.
> 
> > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
> > And add a control knob as
> >   bufferred_write.nr_dirty_thresh
> > to limit the number of dirty pages generetad via a cgroup.
> > 
> > Because memcg just records a owner of pages but not records who makes them
> > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
> > cgroup code.
> 
> Very good point, this is crucial for shared pages.
> 
> > 
> > But I'm not sure how I should treat I/Os generated out by kswapd.
> >
> 
> Account them to process 0 :)

How about accounting them to processes who make pages dirty? I think
that a process which consumes more memory should get penalty. However,
this allows a page request process to use other's bandwidth, but If
a user doesn't want to swap-out the memory, the user should allocate
enough memory for the process by using memcg in advance.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25  5:29       ` Balbir Singh
@ 2009-09-25  7:09           ` Ryo Tsuruta
       [not found]         ` <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  1 sibling, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-25  7:09 UTC (permalink / raw)
  To: balbir
  Cc: kamezawa.hiroyu, akpm, vgoyal, linux-kernel, jens.axboe,
	containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer,
	dhaval, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds,
	mingo, riel

Hi,

Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > I think I must support dirty-ratio in memcg layer. But not yet.
> > 
> 
> We need to add this to the TODO list.
> 
> > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
> > And add a control knob as
> >   bufferred_write.nr_dirty_thresh
> > to limit the number of dirty pages generetad via a cgroup.
> > 
> > Because memcg just records a owner of pages but not records who makes them
> > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
> > cgroup code.
> 
> Very good point, this is crucial for shared pages.
> 
> > 
> > But I'm not sure how I should treat I/Os generated out by kswapd.
> >
> 
> Account them to process 0 :)

How about accounting them to processes who make pages dirty? I think
that a process which consumes more memory should get penalty. However,
this allows a page request process to use other's bandwidth, but If
a user doesn't want to swap-out the memory, the user should allocate
enough memory for the process by using memcg in advance.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-25  7:09           ` Ryo Tsuruta
  0 siblings, 0 replies; 349+ messages in thread
From: Ryo Tsuruta @ 2009-09-25  7:09 UTC (permalink / raw)
  To: balbir
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, paolo.valente,
	jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo,
	vgoyal, m-ikeda, riel, lizf, fchecconi, akpm, kamezawa.hiroyu,
	containers, linux-kernel, s-uchida, righi.andrea, torvalds

Hi,

Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> > > I think I must support dirty-ratio in memcg layer. But not yet.
> > 
> 
> We need to add this to the TODO list.
> 
> > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
> > And add a control knob as
> >   bufferred_write.nr_dirty_thresh
> > to limit the number of dirty pages generetad via a cgroup.
> > 
> > Because memcg just records a owner of pages but not records who makes them
> > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
> > cgroup code.
> 
> Very good point, this is crucial for shared pages.
> 
> > 
> > But I'm not sure how I should treat I/Os generated out by kswapd.
> >
> 
> Account them to process 0 :)

How about accounting them to processes who make pages dirty? I think
that a process which consumes more memory should get penalty. However,
this allows a page request process to use other's bandwidth, but If
a user doesn't want to swap-out the memory, the user should allocate
enough memory for the process by using memcg in advance.

Thanks,
Ryo Tsuruta

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]       ` <20090925101821.1de8091a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2009-09-25  5:29         ` Balbir Singh
  0 siblings, 0 replies; 349+ messages in thread
From: Balbir Singh @ 2009-09-25  5:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

* KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> [2009-09-25 10:18:21]:

> On Fri, 25 Sep 2009 10:09:52 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:
> 
> > On Thu, 24 Sep 2009 14:33:15 -0700
> > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > > ===================================================================
> > > > Fairness for async writes is tricky and biggest reason is that async writes
> > > > are cached in higher layers (page cahe) as well as possibly in file system
> > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > > in proportional manner.
> > > > 
> > > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > > service differentation.
> > > > 
> > > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > > does not throw enought IO traffic at IO controller to keep the queue
> > > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > > intervals where higher weight queue is empty and in that duration lower weight
> > > > queue get lots of job done giving the impression that there was no service
> > > > differentiation.
> > > > 
> > > > In summary, from IO controller point of view async writes support is there.
> > > > Because page cache has not been designed in such a manner that higher 
> > > > prio/weight writer can do more write out as compared to lower prio/weight
> > > > writer, gettting service differentiation is hard and it is visible in some
> > > > cases and not visible in some cases.
> > > 
> > > Here's where it all falls to pieces.
> > > 
> > > For async writeback we just don't care about IO priorities.  Because
> > > from the point of view of the userspace task, the write was async!  It
> > > occurred at memory bandwidth speed.
> > > 
> > > It's only when the kernel's dirty memory thresholds start to get
> > > exceeded that we start to care about prioritisation.  And at that time,
> > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > > consumes just as much memory as a low-ioprio dirty page.
> > > 
> > > So when balance_dirty_pages() hits, what do we want to do?
> > > 
> > > I suppose that all we can do is to block low-ioprio processes more
> > > agressively at the VFS layer, to reduce the rate at which they're
> > > dirtying memory so as to give high-ioprio processes more of the disk
> > > bandwidth.
> > > 
> > > But you've gone and implemented all of this stuff at the io-controller
> > > level and not at the VFS level so you're, umm, screwed.
> > > 
> > 
> > I think I must support dirty-ratio in memcg layer. But not yet.
> 

We need to add this to the TODO list.

> OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
> And add a control knob as
>   bufferred_write.nr_dirty_thresh
> to limit the number of dirty pages generetad via a cgroup.
> 
> Because memcg just records a owner of pages but not records who makes them
> dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
> cgroup code.

Very good point, this is crucial for shared pages.

> 
> But I'm not sure how I should treat I/Os generated out by kswapd.
>

Account them to process 0 :)

-- 
	Balbir

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25  1:18       ` KAMEZAWA Hiroyuki
  (?)
@ 2009-09-25  5:29       ` Balbir Singh
  2009-09-25  7:09           ` Ryo Tsuruta
       [not found]         ` <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
  -1 siblings, 2 replies; 349+ messages in thread
From: Balbir Singh @ 2009-09-25  5:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Vivek Goyal, linux-kernel, jens.axboe, containers,
	dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval,
	righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo,
	riel

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-25 10:18:21]:

> On Fri, 25 Sep 2009 10:09:52 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> 
> > On Thu, 24 Sep 2009 14:33:15 -0700
> > Andrew Morton <akpm@linux-foundation.org> wrote:
> > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > > ===================================================================
> > > > Fairness for async writes is tricky and biggest reason is that async writes
> > > > are cached in higher layers (page cahe) as well as possibly in file system
> > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > > in proportional manner.
> > > > 
> > > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > > service differentation.
> > > > 
> > > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > > does not throw enought IO traffic at IO controller to keep the queue
> > > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > > intervals where higher weight queue is empty and in that duration lower weight
> > > > queue get lots of job done giving the impression that there was no service
> > > > differentiation.
> > > > 
> > > > In summary, from IO controller point of view async writes support is there.
> > > > Because page cache has not been designed in such a manner that higher 
> > > > prio/weight writer can do more write out as compared to lower prio/weight
> > > > writer, gettting service differentiation is hard and it is visible in some
> > > > cases and not visible in some cases.
> > > 
> > > Here's where it all falls to pieces.
> > > 
> > > For async writeback we just don't care about IO priorities.  Because
> > > from the point of view of the userspace task, the write was async!  It
> > > occurred at memory bandwidth speed.
> > > 
> > > It's only when the kernel's dirty memory thresholds start to get
> > > exceeded that we start to care about prioritisation.  And at that time,
> > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > > consumes just as much memory as a low-ioprio dirty page.
> > > 
> > > So when balance_dirty_pages() hits, what do we want to do?
> > > 
> > > I suppose that all we can do is to block low-ioprio processes more
> > > agressively at the VFS layer, to reduce the rate at which they're
> > > dirtying memory so as to give high-ioprio processes more of the disk
> > > bandwidth.
> > > 
> > > But you've gone and implemented all of this stuff at the io-controller
> > > level and not at the VFS level so you're, umm, screwed.
> > > 
> > 
> > I think I must support dirty-ratio in memcg layer. But not yet.
> 

We need to add this to the TODO list.

> OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
> And add a control knob as
>   bufferred_write.nr_dirty_thresh
> to limit the number of dirty pages generetad via a cgroup.
> 
> Because memcg just records a owner of pages but not records who makes them
> dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
> cgroup code.

Very good point, this is crucial for shared pages.

> 
> But I'm not sure how I should treat I/Os generated out by kswapd.
>

Account them to process 0 :)

-- 
	Balbir

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]   ` <20090924143315.781cd0ac.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
  2009-09-25  1:09     ` KAMEZAWA Hiroyuki
@ 2009-09-25  5:04     ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25  5:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:
> On Thu, 24 Sep 2009 15:25:04 -0400
> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V10 of the IO controller patches generated on top of 2.6.31.
> > 
> 
> Thanks for the writeup.  It really helps and is most worthwhile for a
> project of this importance, size and complexity.
> 
> 
> >  
> > What problem are we trying to solve
> > ===================================
> > Provide group IO scheduling feature in Linux along the lines of other resource
> > controllers like cpu.
> > 
> > IOW, provide facility so that a user can group applications using cgroups and
> > control the amount of disk time/bandwidth received by a group based on its
> > weight. 
> > 
> > How to solve the problem
> > =========================
> > 
> > Different people have solved the issue differetnly. So far looks it looks
> > like we seem to have following two core requirements when it comes to
> > fairness at group level.
> > 
> > - Control bandwidth seen by groups.
> > - Control on latencies when a request gets backlogged in group.
> > 
> > At least there are now three patchsets available (including this one).
> > 
> > IO throttling
> > -------------
> > This is a bandwidth controller which keeps track of IO rate of a group and
> > throttles the process in the group if it exceeds the user specified limit.
> > 
> > dm-ioband
> > ---------
> > This is a proportional bandwidth controller implemented as device mapper
> > driver and provides fair access in terms of amount of IO done (not in terms
> > of disk time as CFQ does).
> > 
> > So one will setup one or more dm-ioband devices on top of physical/logical
> > block device, configure the ioband device and pass information like grouping
> > etc. Now this device will keep track of bios flowing through it and control
> > the flow of bios based on group policies.
> > 
> > IO scheduler based IO controller
> > --------------------------------
> > Here we have viewed the problem of IO contoller as hierarchical group
> > scheduling (along the lines of CFS group scheduling) issue. Currently one can
> > view linux IO schedulers as flat where there is one root group and all the IO
> > belongs to that group.
> > 
> > This patchset basically modifies IO schedulers to also support hierarchical
> > group scheduling. CFQ already provides fairness among different processes. I 
> > have extended it support group IO schduling. Also took some of the code out
> > of CFQ and put in a common layer so that same group scheduling code can be
> > used by noop, deadline and AS to support group scheduling. 
> > 
> > Pros/Cons
> > =========
> > There are pros and cons to each of the approach. Following are some of the
> > thoughts.
> > 
> > Max bandwidth vs proportional bandwidth
> > ---------------------------------------
> > IO throttling is a max bandwidth controller and not a proportional one.
> > Additionaly it provides fairness in terms of amount of IO done (and not in
> > terms of disk time as CFQ does).
> > 
> > Personally, I think that proportional weight controller is useful to more
> > people than just max bandwidth controller. In addition, IO scheduler based
> > controller can also be enhanced to do max bandwidth control. So it can 
> > satisfy wider set of requirements.
> > 
> > Fairness in terms of disk time vs size of IO
> > ---------------------------------------------
> > An higher level controller will most likely be limited to providing fairness
> > in terms of size/number of IO done and will find it hard to provide fairness
> > in terms of disk time used (as CFQ provides between various prio levels). This
> > is because only IO scheduler knows how much disk time a queue has used and
> > information about queues and disk time used is not exported to higher
> > layers.
> > 
> > So a seeky application will still run away with lot of disk time and bring
> > down the overall throughput of the the disk.
> 
> But that's only true if the thing is poorly implemented.
> 
> A high-level controller will need some view of the busyness of the
> underlying device(s).  That could be "proportion of idle time", or
> "average length of queue" or "average request latency" or some mix of
> these or something else altogether.
> 
> But these things are simple to calculate, and are simple to feed back
> to the higher-level controller and probably don't require any changes
> to to IO scheduler at all, which is a great advantage.
> 
> 
> And I must say that high-level throttling based upon feedback from
> lower layers seems like a much better model to me than hacking away in
> the IO scheduler layer.  Both from an implementation point of view and
> from a "we can get it to work on things other than block devices" point
> of view.
> 

Hi Andrew,

Few thoughts.

- A higher level throttling approach suffers from the issue of unfair
  throttling. So if there are multiple tasks in the group, who do we
  throttle and how do we make sure that we did throttling in proportion
  to the prio of tasks. Andrea's IO throttling implementation suffered
  from these issues. I had run some tests where RT and BW tasks were 
  getting same BW with-in group or tasks of different prio were gettting
  same BW. 

  Even if we figure a way out to do fair throttling with-in group, underlying
  IO scheduler might not be CFQ at all and we should not have done so.

https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html

- Higher level throttling does not know where actually IO is going in 
  physical layer. So we might unnecessarily be throttling IO which are
  going to same logical device but at the end of day to different physical
  devices.

  Agreed that some people will want that behavior, especially in the case
  of max bandwidth control where one does not want to give you the BW
  because you did not pay for it.

  So higher level controller is good for max bw control but if it comes
  to optimal usage of resources and do control only if needed, then it
  probably is not the best thing.

About the feedback thing, I am not very sure. Are you saying that we will
run timed groups in higher layer and take feedback from underlying IO
scheduler about how much time a group consumed or something like that and
not do accounting in terms of size of IO?

 
> > Currently dm-ioband provides fairness in terms of number/size of IO.
> > 
> > Latencies and isolation between groups
> > --------------------------------------
> > An higher level controller is generally implementing a bandwidth throttling
> > solution where if a group exceeds either the max bandwidth or the proportional
> > share then throttle that group.
> > 
> > This kind of approach will probably not help in controlling latencies as it
> > will depend on underlying IO scheduler. Consider following scenario. 
> > 
> > Assume there are two groups. One group is running multiple sequential readers
> > and other group has a random reader. sequential readers will get a nice 100ms
> > slice
> 
> Do you refer to each reader within group1, or to all readers?  It would be
> daft if each reader in group1 were to get 100ms.
> 

All readers in the group should get 100ms each, both in IO throttling and
dm-ioband solution.

Higher level solutions are not keeping track of time slices. Time slices will
be allocated by CFQ which does not have any idea about grouping. Higher
level controller just keeps track of size of IO done at group level and
then run either a leaky bucket or token bucket algorithm.

IO throttling is a max BW controller, so it will not even care about what is
happening in other group. It will just be concerned with rate of IO in one
particular group and if we exceed specified limit, throttle it. So until and
unless sequential reader group hits it max bw limit, it will keep sending
reads down to CFQ, and CFQ will happily assign 100ms slices to readers.

dm-ioband will not try to choke the high throughput sequential reader group
for the slow random reader group because that would just kill the throughput
of rotational media. Every sequential reader will run for few ms and then 
be throttled and this goes on. Disk will soon be seek bound.

> > each and then a random reader from group2 will get to dispatch the
> > request. So latency of this random reader will depend on how many sequential
> > readers are running in other group and that is a weak isolation between groups.
> 
> And yet that is what you appear to mean.
> 
> But surely nobody would do that - the 100ms would be assigned to and
> distributed amongst all readers in group1?

Dividing 100ms to all the sequential readers might not be very good on
rotational media as each reader runs for small time and then seek happens.
This will increase number of seeks in the system. Think of 32 sequential
readers in the group and then each getting less than 3ms to run.

A better way probably is to give each queue 100ms in one run of group and
then switch group. Someting like following.

SR1 RR SR2 RR SR3 RR SR4 RR...

Now each sequential reader gets 100ms and disk is not seek bound at the
same time random reader latency limited by number of competing groups
and not by number of processes in the group. This is what IO scheduler
based IO controller is effectively doing currently.

> 
> > When we control things at IO scheduler level, we assign one time slice to one
> > group and then pick next entity to run. So effectively after one time slice
> > (max 180ms, if prio 0 sequential reader is running), random reader in other
> > group will get to run. Hence we achieve better isolation between groups as
> > response time of process in a differnt group is generally not dependent on
> > number of processes running in competing group.  
> 
> I don't understand why you're comparing this implementation with such
> an obviously dumb competing design!
> 
> > So a higher level solution is most likely limited to only shaping bandwidth
> > without any control on latencies.
> > 
> > Stacking group scheduler on top of CFQ can lead to issues
> > ---------------------------------------------------------
> > IO throttling and dm-ioband both are second level controller. That is these
> > controllers are implemented in higher layers than io schedulers. So they
> > control the IO at higher layer based on group policies and later IO
> > schedulers take care of dispatching these bios to disk.
> > 
> > Implementing a second level controller has the advantage of being able to
> > provide bandwidth control even on logical block devices in the IO stack
> > which don't have any IO schedulers attached to these. But they can also 
> > interefere with IO scheduling policy of underlying IO scheduler and change
> > the effective behavior. Following are some of the issues which I think
> > should be visible in second level controller in one form or other.
> > 
> >   Prio with-in group
> >   ------------------
> >   A second level controller can potentially interefere with behavior of
> >   different prio processes with-in a group. bios are buffered at higher layer
> >   in single queue and release of bios is FIFO and not proportionate to the
> >   ioprio of the process. This can result in a particular prio level not
> >   getting fair share.
> 
> That's an administrator error, isn't it?  Should have put the
> different-priority processes into different groups.
> 

I am thinking in practice it probably will be a mix of priority in each
group. For example, consider a hypothetical scenario where two students
on a university server are given two cgroups of certain weights so that IO
done by these students are limited in case of contention. Now these students
might want to throw in a mix of priority workload in their respective cgroup.
Admin would not have any idea what priority process students are running in 
respective cgroup.

> >   Buffering at higher layer can delay read requests for more than slice idle
> >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> >   for a request from the queue but it is buffered at higher layer and then idle
> >   timer will fire. It means that queue will losse its share at the same time
> >   overall throughput will be impacted as we lost those 8 ms.
> 
> That sounds like a bug.
> 

Actually this probably is a limitation of higher level controller. It most
likely is sitting so high in IO stack that it has no idea what underlying
IO scheduler is and what are IO scheduler's policies. So it can't keep up
with IO scheduler's policies. Secondly, it might be a low weight group and
tokens might not be available fast enough to release the request.

> >   Read Vs Write
> >   -------------
> >   Writes can overwhelm readers hence second level controller FIFO release
> >   will run into issue here. If there is a single queue maintained then reads
> >   will suffer large latencies. If there separate queues for reads and writes
> >   then it will be hard to decide in what ratio to dispatch reads and writes as
> >   it is IO scheduler's decision to decide when and how much read/write to
> >   dispatch. This is another place where higher level controller will not be in
> >   sync with lower level io scheduler and can change the effective policies of
> >   underlying io scheduler.
> 
> The IO schedulers already take care of read-vs-write and already take
> care of preventing large writes-starve-reads latencies (or at least,
> they're supposed to).

True. Actually this is a limitation of higher level controller. A higher
level controller will most likely implement some of kind of queuing/buffering
mechanism where it will buffer requeuests when it decides to throttle the
group. Now once a fair number read and requests are buffered, and if
controller is ready to dispatch some requests from the group, which
requests/bio should it dispatch? reads first or writes first or reads and
writes in certain ratio?

In what ratio reads and writes are dispatched is the property and decision of
IO scheduler. Now higher level controller will be taking this decision and
change the behavior of underlying io scheduler.

> 
> >   CFQ IO context Issues
> >   ---------------------
> >   Buffering at higher layer means submission of bios later with the help of
> >   a worker thread.
> 
> Why?
> 
> If it's a read, we just block the userspace process.
> 
> If it's a delayed write, the IO submission already happens in a kernel thread.

Is it ok to block pdflush on group. Some low weight group might block it
for long time and hence not allow flushing out other pages. Probably that's
the reason pdflush used to check if underlying device is congested or not
and if it is congested, we don't go ahead with submission of request.
With per bdi flusher thread things will change. 

I think btrfs also has some threds which don't want to block and if
underlying deivce is congested, it bails out. That's the reason I
implemented per group congestion interface where if a thread does not want
to block, it can check whether the group IO is going in is congested or
not and will it block. So for such threads, probably higher level
controller shall have to implement per group congestion interface so that
threads which don't want to block can check with the controller whether
it has sufficient BW to let it through and not block or may be start
buffering writes in group queue.

> 
> If it's a synchronous write, we have to block the userspace caller
> anyway.
> 
> Async reads might be an issue, dunno.
> 

I think async IO is one of the reason. IIRC, Andrea Righi, implemented the
policy of returning error for async IO if group did not have sufficient
tokens to dispatch the async IO and expected the application to retry
later. I am not sure if that is ok.

So yes, if we are not buffering any of the read requests and either
blocking the caller or returning an error (async IO) than CFQ io context is not
an issue.

> > This changes the io context information at CFQ layer which
> >   assigns the request to submitting thread. Change of io context info again
> >   leads to issues of idle timer expiry and issue of a process not getting fair
> >   share and reduced throughput.
> 
> But we already have that problem with delayed writeback, which is a
> huge thing - often it's the majority of IO.
> 

For delayed writes CFQ will not anticipate so increased anticipation timer
expiry is not an issue with writes. But it probably will be an issue with
reads where if higher level controller decides to block next read and 
CFQ is anticipating on that read. I am wondering that such kind of issues
must appear with all the higher level device mapper/software raid devices
also. How do they handle it. May be it is more theoritical and in practice
impact is not significant.

> >   Throughput with noop, deadline and AS
> >   ---------------------------------------------
> >   I think an higher level controller will result in reduced overall throughput
> >   (as compared to io scheduler based io controller) and more seeks with noop,
> >   deadline and AS.
> > 
> >   The reason being, that it is likely that IO with-in a group will be related
> >   and will be relatively close as compared to IO across the groups. For example,
> >   thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
> >   control, IO from various groups will go into a single queue at lower level
> >   controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
> >   G4....) causing more seeks and reduced throughput. (Agreed that merging will
> >   help up to some extent but still....).
> > 
> >   Instead, in case of lower level controller, IO scheduler maintains one queue
> >   per group hence there is no interleaving of IO between groups. And if IO is
> >   related with-in group, then we shoud get reduced number/amount of seek and
> >   higher throughput.
> > 
> >   Latency can be a concern but that can be controlled by reducing the time
> >   slice length of the queue.
> 
> Well maybe, maybe not.  If a group is throttled, it isn't submitting
> new IO.  The unthrottled group is doing the IO submitting and that IO
> will have decent locality.

But throttling will kick in ocassionaly. Rest of the time both the groups
will be dispatching bios at the same time. So for most part of it IO
scheduler will probably see IO from both the groups and there will be
small intervals where one group is completely throttled and IO scheduler
is busy dispatching requests only from a single group.

> 
> > Fairness at logical device level vs at physical device level
> > ------------------------------------------------------------
> > 
> > IO scheduler based controller has the limitation that it works only with the
> > bottom most devices in the IO stack where IO scheduler is attached.
> > 
> > For example, assume a user has created a logical device lv0 using three
> > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
> > in two groups doing IO on lv0. Also assume that weights of groups are in the
> > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
> > 
> > 			     T1    T2
> > 			       \   /
> > 			        lv0
> > 			      /  |  \
> > 			    sda sdb  sdc
> > 
> > 
> > Now resource control will take place only on devices sda, sdb and sdc and
> > not at lv0 level. So if IO from two tasks is relatively uniformly
> > distributed across the disks then T1 and T2 will see the throughput ratio
> > in proportion to weight specified. But if IO from T1 and T2 is going to
> > different disks and there is no contention then at higher level they both
> > will see same BW.
> > 
> > Here a second level controller can produce better fairness numbers at
> > logical device but most likely at redued overall throughput of the system,
> > because it will try to control IO even if there is no contention at phsical
> > possibly leaving diksks unused in the system.
> > 
> > Hence, question comes that how important it is to control bandwidth at
> > higher level logical devices also. The actual contention for resources is
> > at the leaf block device so it probably makes sense to do any kind of
> > control there and not at the intermediate devices. Secondly probably it
> > also means better use of available resources.
> 
> hm.  What will be the effects of this limitation in real-world use?

In some cases user/application will not see the bandwidth ratio between 
two groups in same proportion as assigned weights and primary reason for
that will be that this workload did not create enough contention for
physical resources unerneath.

So it all depends on what kind of bandwidth gurantees are we offering. If
we are saying that we provide good fairness numbers at logical devices
irrespective of whether resources are not used optimally, then it will be
irritating for the user. 

I think it also might become an issue once we implement max bandwidth
control. We will not be able to define max bandwidth on a logical device
and an application will get more than max bandwidth if it is doing IO to
different underlying devices.

I would say that leaf node control is good for optimal resource usage and
for proportional BW control, but not a good fit for max bandwidth control.

> 
> > Limited Fairness
> > ----------------
> > Currently CFQ idles on a sequential reader queue to make sure it gets its
> > fair share. A second level controller will find it tricky to anticipate.
> > Either it will not have any anticipation logic and in that case it will not
> > provide fairness to single readers in a group (as dm-ioband does) or if it
> > starts anticipating then we should run into these strange situations where
> > second level controller is anticipating on one queue/group and underlying
> > IO scheduler might be anticipating on something else.
> 
> It depends on the size of the inter-group timeslices.  If the amount of
> time for which a group is unthrottled is "large" comapred to the
> typical anticipation times, this issue fades away.
> 
> And those timeslices _should_ be large.  Because as you mentioned
> above, different groups are probably working different parts of the
> disk.
> 
> > Need of device mapper tools
> > ---------------------------
> > A device mapper based solution will require creation of a ioband device
> > on each physical/logical device one wants to control. So it requires usage
> > of device mapper tools even for the people who are not using device mapper.
> > At the same time creation of ioband device on each partition in the system to 
> > control the IO can be cumbersome and overwhelming if system has got lots of
> > disks and partitions with-in.
> > 
> > 
> > IMHO, IO scheduler based IO controller is a reasonable approach to solve the
> > problem of group bandwidth control, and can do hierarchical IO scheduling
> > more tightly and efficiently.
> > 
> > But I am all ears to alternative approaches and suggestions how doing things
> > can be done better and will be glad to implement it.
> > 
> > TODO
> > ====
> > - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > - More testing to make sure there are no regressions in CFQ.
> > 
> > Testing
> > =======
> > 
> > Environment
> > ==========
> > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
> 
> That's a bit of a toy.

Yes it is. :-)

> 
> Do we have testing results for more enterprisey hardware?  Big storage
> arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)

Not yet. I will try to get hold of some storage arrays and run some tests.

> 
> 
> > I am mostly
> > running fio jobs which have been limited to 30 seconds run and then monitored
> > the throughput and latency.
> >  
> > Test1: Random Reader Vs Random Writers
> > ======================================
> > Launched a random reader and then increasing number of random writers to see
> > the effect on random reader BW and max lantecies.
> > 
> > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> > 
> > [Vanilla CFQ, No groups]
> > <--------------random writers-------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   5737KiB/s   5737KiB/s   5737KiB/s   164K usec   503KiB/s    159K usec   
> > 2   2055KiB/s   1984KiB/s   4039KiB/s   1459K usec  150KiB/s    170K usec   
> > 4   1238KiB/s   932KiB/s    4419KiB/s   4332K usec  153KiB/s    225K usec   
> > 8   1059KiB/s   929KiB/s    7901KiB/s   1260K usec  118KiB/s    377K usec   
> > 16  604KiB/s    483KiB/s    8519KiB/s   3081K usec  47KiB/s     756K usec   
> > 32  367KiB/s    222KiB/s    9643KiB/s   5940K usec  22KiB/s     923K usec   
> > 
> > Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> > number of random writers in group1 and one random reader in group2 using fio.
> > 
> > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
> > <--------------random writers(group1)-------------> <-random reader(group2)->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   18115KiB/s  18115KiB/s  18115KiB/s  604K usec   345KiB/s    176K usec   
> > 2   3752KiB/s   3676KiB/s   7427KiB/s   4367K usec  402KiB/s    187K usec   
> > 4   1951KiB/s   1863KiB/s   7642KiB/s   1989K usec  384KiB/s    181K usec   
> > 8   755KiB/s    629KiB/s    5683KiB/s   2133K usec  366KiB/s    319K usec   
> > 16  418KiB/s    369KiB/s    6276KiB/s   1323K usec  352KiB/s    287K usec   
> > 32  236KiB/s    191KiB/s    6518KiB/s   1910K usec  337KiB/s    273K usec   
> 
> That's a good result.
> 
> > Also ran the same test with IO controller CFQ in flat mode to see if there
> > are any major deviations from Vanilla CFQ. Does not look like any.
> > 
> > [IO controller CFQ; No groups ]
> > <--------------random writers-------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   5696KiB/s   5696KiB/s   5696KiB/s   259K usec   500KiB/s    194K usec   
> > 2   2483KiB/s   2197KiB/s   4680KiB/s   887K usec   150KiB/s    159K usec   
> > 4   1471KiB/s   1433KiB/s   5817KiB/s   962K usec   126KiB/s    189K usec   
> > 8   691KiB/s    580KiB/s    5159KiB/s   2752K usec  197KiB/s    246K usec   
> > 16  781KiB/s    698KiB/s    11892KiB/s  943K usec   61KiB/s     529K usec   
> > 32  415KiB/s    324KiB/s    12461KiB/s  4614K usec  17KiB/s     737K usec   
> > 
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down
> >   its throughput and bump up latencies significantly.
> 
> Isn't that a CFQ shortcoming which we should address separately?  If
> so, the comparisons aren't presently valid because we're comparing with
> a CFQ which has known, should-be-fixed problems.

I am not sure if it is a CFQ issue. These are synchronous random writes.
These are equally important as random reader. So now CFQ has 33 synchronous
queues to serve. Becuase it does not know about groups, it has no choice but
to serve them in round robin manner. So it does not sound like a CFQ issue.
Just that CFQ can give random reader an advantage if it knows that random
reader is in a different group and that's where IO controller comes in to
picture.

> 
> > - With IO controller, one can provide isolation to the random reader group and
> >   maintain consitent view of bandwidth and latencies. 
> > 
> > Test2: Random Reader Vs Sequential Reader
> > ========================================
> > Launched a random reader and then increasing number of sequential readers to
> > see the effect on BW and latencies of random reader.
> > 
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> > 
> > [ Vanilla CFQ, No groups ]
> > <---------------seq readers---------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   23318KiB/s  23318KiB/s  23318KiB/s  55940 usec  36KiB/s     247K usec   
> > 2   14732KiB/s  11406KiB/s  26126KiB/s  142K usec   20KiB/s     446K usec   
> > 4   9417KiB/s   5169KiB/s   27338KiB/s  404K usec   10KiB/s     993K usec   
> > 8   3360KiB/s   3041KiB/s   25850KiB/s  954K usec   60KiB/s     956K usec   
> > 16  1888KiB/s   1457KiB/s   26763KiB/s  1871K usec  28KiB/s     1868K usec  
> > 
> > Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> > number of sequential readers in group1 and one random reader in group2 using
> > fio.
> > 
> > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
> > <---------------group1--------------------------->  <------group2--------->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   13733KiB/s  13733KiB/s  13733KiB/s  247K usec   330KiB/s    154K usec   
> > 2   8553KiB/s   4963KiB/s   13514KiB/s  472K usec   322KiB/s    174K usec   
> > 4   5045KiB/s   1367KiB/s   13134KiB/s  947K usec   318KiB/s    178K usec   
> > 8   1774KiB/s   1420KiB/s   13035KiB/s  1871K usec  323KiB/s    233K usec   
> > 16  959KiB/s    518KiB/s    12691KiB/s  3809K usec  324KiB/s    208K usec   
> > 
> > Also ran the same test with IO controller CFQ in flat mode to see if there
> > are any major deviations from Vanilla CFQ. Does not look like any.
> > 
> > [IO controller CFQ; No groups ]
> > <---------------seq readers---------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   23028KiB/s  23028KiB/s  23028KiB/s  47460 usec  36KiB/s     253K usec   
> > 2   14452KiB/s  11176KiB/s  25628KiB/s  145K usec   20KiB/s     447K usec   
> > 4   8815KiB/s   5720KiB/s   27121KiB/s  396K usec   10KiB/s     968K usec   
> > 8   3335KiB/s   2827KiB/s   24866KiB/s  960K usec   62KiB/s     955K usec   
> > 16  1784KiB/s   1311KiB/s   26537KiB/s  1883K usec  26KiB/s     1866K usec  
> > 
> > Notes:
> > - The BW and latencies of random reader in group 2 seems to be stable and
> >   bounded and does not get impacted much as number of sequential readers
> >   increase in group1. Hence provding good isolation.
> > 
> > - Throughput of sequential readers comes down and latencies go up as half
> >   of disk bandwidth (in terms of time) has been reserved for random reader
> >   group.
> > 
> > Test3: Sequential Reader Vs Sequential Reader
> > ============================================
> > Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
> > Launched increasing number of sequential readers in group1 and one sequential
> > reader in group2 using fio and monitored how bandwidth is being distributed
> > between two groups.
> > 
> > First 5 columns give stats about job in group1 and last two columns give
> > stats about job in group2.
> > 
> > <---------------group1--------------------------->  <------group2--------->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   8970KiB/s   8970KiB/s   8970KiB/s   230K usec   20681KiB/s  124K usec   
> > 2   6783KiB/s   3202KiB/s   9984KiB/s   546K usec   19682KiB/s  139K usec   
> > 4   4641KiB/s   1029KiB/s   9280KiB/s   1185K usec  19235KiB/s  172K usec   
> > 8   1435KiB/s   1079KiB/s   9926KiB/s   2461K usec  19501KiB/s  153K usec   
> > 16  764KiB/s    398KiB/s    9395KiB/s   4986K usec  19367KiB/s  172K usec   
> > 
> > Note: group2 is getting double the bandwidth of group1 even in the face
> > of increasing number of readers in group1.
> > 
> > Test4 (Isolation between two KVM virtual machines)
> > ==================================================
> > Created two KVM virtual machines. Partitioned a disk on host in two partitions
> > and gave one partition to each virtual machine. Put both the virtual machines
> > in two different cgroup of weight 1000 and 500 each. Virtual machines created
> > ext3 file system on the partitions exported from host and did buffered writes.
> > Host seems writes as synchronous and virtual machine with higher weight gets
> > double the disk time of virtual machine of lower weight. Used deadline
> > scheduler in this test case.
> > 
> > Some more details about configuration are in documentation patch.
> > 
> > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > ===================================================================
> > Fairness for async writes is tricky and biggest reason is that async writes
> > are cached in higher layers (page cahe) as well as possibly in file system
> > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > in proportional manner.
> > 
> > For example, consider two dd threads reading /dev/zero as input file and doing
> > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > be forced to write out some pages to disk before more pages can be dirtied. But
> > not necessarily dirty pages of same thread are picked. It can very well pick
> > the inode of lesser priority dd thread and do some writeout. So effectively
> > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > service differentation.
> > 
> > IOW, the core problem with buffered write fairness is that higher weight thread
> > does not throw enought IO traffic at IO controller to keep the queue
> > continuously backlogged. In my testing, there are many .2 to .8 second
> > intervals where higher weight queue is empty and in that duration lower weight
> > queue get lots of job done giving the impression that there was no service
> > differentiation.
> > 
> > In summary, from IO controller point of view async writes support is there.
> > Because page cache has not been designed in such a manner that higher 
> > prio/weight writer can do more write out as compared to lower prio/weight
> > writer, gettting service differentiation is hard and it is visible in some
> > cases and not visible in some cases.
> 
> Here's where it all falls to pieces.
> 
> For async writeback we just don't care about IO priorities.  Because
> from the point of view of the userspace task, the write was async!  It
> occurred at memory bandwidth speed.
> 
> It's only when the kernel's dirty memory thresholds start to get
> exceeded that we start to care about prioritisation.  And at that time,
> all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> consumes just as much memory as a low-ioprio dirty page.
> 
> So when balance_dirty_pages() hits, what do we want to do?
> 
> I suppose that all we can do is to block low-ioprio processes more
> agressively at the VFS layer, to reduce the rate at which they're
> dirtying memory so as to give high-ioprio processes more of the disk
> bandwidth.
> 
> But you've gone and implemented all of this stuff at the io-controller
> level and not at the VFS level so you're, umm, screwed.

True that's an issue. For async writes we don't create parallel IO paths
from user space to IO scheduler hence it is hard to provide fairness in
all the cases. I think part of the problem is page cache and some
serialization also comes from kjournald.

How about coming up with another cgroup controller for buffered writes or
clubbing it with memory controller as KAMEZAWA Hiroyuki suggested and co-mount
this with io controller? This should help control buffered writes per
cgroup.

> 
> Importantly screwed!  It's a very common workload pattern, and one
> which causes tremendous amounts of IO to be generated very quickly,
> traditionally causing bad latency effects all over the place.  And we
> have no answer to this.
> 
> > Vanilla CFQ Vs IO Controller CFQ
> > ================================
> > We have not fundamentally changed CFQ, instead enhanced it to also support
> > hierarchical io scheduling. In the process invariably there are small changes
> > here and there as new scenarios come up. Running some tests here and comparing
> > both the CFQ's to see if there is any major deviation in behavior.
> > 
> > Test1: Sequential Readers
> > =========================
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > 
> > IO scheduler: IO controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > 
> > Test2: Sequential Writers
> > =========================
> > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > 
> > Test3: Random Readers
> > =========================
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > 
> > Test4: Random Writers
> > =====================
> > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > 
> > Notes:
> >  - Does not look like that anything has changed significantly.
> > 
> > Previous versions of the patches were posted here.
> > ------------------------------------------------
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > (V6) http://lkml.org/lkml/2009/7/2/369
> > (V7) http://lkml.org/lkml/2009/7/24/253
> > (V8) http://lkml.org/lkml/2009/8/16/204
> > (V9) http://lkml.org/lkml/2009/8/28/327
> > 
> > Thanks
> > Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-24 21:33   ` Andrew Morton
@ 2009-09-25  5:04     ` Vivek Goyal
  -1 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25  5:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, peterz, jmarchan, torvalds, mingo, riel

On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:
> On Thu, 24 Sep 2009 15:25:04 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V10 of the IO controller patches generated on top of 2.6.31.
> > 
> 
> Thanks for the writeup.  It really helps and is most worthwhile for a
> project of this importance, size and complexity.
> 
> 
> >  
> > What problem are we trying to solve
> > ===================================
> > Provide group IO scheduling feature in Linux along the lines of other resource
> > controllers like cpu.
> > 
> > IOW, provide facility so that a user can group applications using cgroups and
> > control the amount of disk time/bandwidth received by a group based on its
> > weight. 
> > 
> > How to solve the problem
> > =========================
> > 
> > Different people have solved the issue differetnly. So far looks it looks
> > like we seem to have following two core requirements when it comes to
> > fairness at group level.
> > 
> > - Control bandwidth seen by groups.
> > - Control on latencies when a request gets backlogged in group.
> > 
> > At least there are now three patchsets available (including this one).
> > 
> > IO throttling
> > -------------
> > This is a bandwidth controller which keeps track of IO rate of a group and
> > throttles the process in the group if it exceeds the user specified limit.
> > 
> > dm-ioband
> > ---------
> > This is a proportional bandwidth controller implemented as device mapper
> > driver and provides fair access in terms of amount of IO done (not in terms
> > of disk time as CFQ does).
> > 
> > So one will setup one or more dm-ioband devices on top of physical/logical
> > block device, configure the ioband device and pass information like grouping
> > etc. Now this device will keep track of bios flowing through it and control
> > the flow of bios based on group policies.
> > 
> > IO scheduler based IO controller
> > --------------------------------
> > Here we have viewed the problem of IO contoller as hierarchical group
> > scheduling (along the lines of CFS group scheduling) issue. Currently one can
> > view linux IO schedulers as flat where there is one root group and all the IO
> > belongs to that group.
> > 
> > This patchset basically modifies IO schedulers to also support hierarchical
> > group scheduling. CFQ already provides fairness among different processes. I 
> > have extended it support group IO schduling. Also took some of the code out
> > of CFQ and put in a common layer so that same group scheduling code can be
> > used by noop, deadline and AS to support group scheduling. 
> > 
> > Pros/Cons
> > =========
> > There are pros and cons to each of the approach. Following are some of the
> > thoughts.
> > 
> > Max bandwidth vs proportional bandwidth
> > ---------------------------------------
> > IO throttling is a max bandwidth controller and not a proportional one.
> > Additionaly it provides fairness in terms of amount of IO done (and not in
> > terms of disk time as CFQ does).
> > 
> > Personally, I think that proportional weight controller is useful to more
> > people than just max bandwidth controller. In addition, IO scheduler based
> > controller can also be enhanced to do max bandwidth control. So it can 
> > satisfy wider set of requirements.
> > 
> > Fairness in terms of disk time vs size of IO
> > ---------------------------------------------
> > An higher level controller will most likely be limited to providing fairness
> > in terms of size/number of IO done and will find it hard to provide fairness
> > in terms of disk time used (as CFQ provides between various prio levels). This
> > is because only IO scheduler knows how much disk time a queue has used and
> > information about queues and disk time used is not exported to higher
> > layers.
> > 
> > So a seeky application will still run away with lot of disk time and bring
> > down the overall throughput of the the disk.
> 
> But that's only true if the thing is poorly implemented.
> 
> A high-level controller will need some view of the busyness of the
> underlying device(s).  That could be "proportion of idle time", or
> "average length of queue" or "average request latency" or some mix of
> these or something else altogether.
> 
> But these things are simple to calculate, and are simple to feed back
> to the higher-level controller and probably don't require any changes
> to to IO scheduler at all, which is a great advantage.
> 
> 
> And I must say that high-level throttling based upon feedback from
> lower layers seems like a much better model to me than hacking away in
> the IO scheduler layer.  Both from an implementation point of view and
> from a "we can get it to work on things other than block devices" point
> of view.
> 

Hi Andrew,

Few thoughts.

- A higher level throttling approach suffers from the issue of unfair
  throttling. So if there are multiple tasks in the group, who do we
  throttle and how do we make sure that we did throttling in proportion
  to the prio of tasks. Andrea's IO throttling implementation suffered
  from these issues. I had run some tests where RT and BW tasks were 
  getting same BW with-in group or tasks of different prio were gettting
  same BW. 

  Even if we figure a way out to do fair throttling with-in group, underlying
  IO scheduler might not be CFQ at all and we should not have done so.

https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html

- Higher level throttling does not know where actually IO is going in 
  physical layer. So we might unnecessarily be throttling IO which are
  going to same logical device but at the end of day to different physical
  devices.

  Agreed that some people will want that behavior, especially in the case
  of max bandwidth control where one does not want to give you the BW
  because you did not pay for it.

  So higher level controller is good for max bw control but if it comes
  to optimal usage of resources and do control only if needed, then it
  probably is not the best thing.

About the feedback thing, I am not very sure. Are you saying that we will
run timed groups in higher layer and take feedback from underlying IO
scheduler about how much time a group consumed or something like that and
not do accounting in terms of size of IO?

 
> > Currently dm-ioband provides fairness in terms of number/size of IO.
> > 
> > Latencies and isolation between groups
> > --------------------------------------
> > An higher level controller is generally implementing a bandwidth throttling
> > solution where if a group exceeds either the max bandwidth or the proportional
> > share then throttle that group.
> > 
> > This kind of approach will probably not help in controlling latencies as it
> > will depend on underlying IO scheduler. Consider following scenario. 
> > 
> > Assume there are two groups. One group is running multiple sequential readers
> > and other group has a random reader. sequential readers will get a nice 100ms
> > slice
> 
> Do you refer to each reader within group1, or to all readers?  It would be
> daft if each reader in group1 were to get 100ms.
> 

All readers in the group should get 100ms each, both in IO throttling and
dm-ioband solution.

Higher level solutions are not keeping track of time slices. Time slices will
be allocated by CFQ which does not have any idea about grouping. Higher
level controller just keeps track of size of IO done at group level and
then run either a leaky bucket or token bucket algorithm.

IO throttling is a max BW controller, so it will not even care about what is
happening in other group. It will just be concerned with rate of IO in one
particular group and if we exceed specified limit, throttle it. So until and
unless sequential reader group hits it max bw limit, it will keep sending
reads down to CFQ, and CFQ will happily assign 100ms slices to readers.

dm-ioband will not try to choke the high throughput sequential reader group
for the slow random reader group because that would just kill the throughput
of rotational media. Every sequential reader will run for few ms and then 
be throttled and this goes on. Disk will soon be seek bound.

> > each and then a random reader from group2 will get to dispatch the
> > request. So latency of this random reader will depend on how many sequential
> > readers are running in other group and that is a weak isolation between groups.
> 
> And yet that is what you appear to mean.
> 
> But surely nobody would do that - the 100ms would be assigned to and
> distributed amongst all readers in group1?

Dividing 100ms to all the sequential readers might not be very good on
rotational media as each reader runs for small time and then seek happens.
This will increase number of seeks in the system. Think of 32 sequential
readers in the group and then each getting less than 3ms to run.

A better way probably is to give each queue 100ms in one run of group and
then switch group. Someting like following.

SR1 RR SR2 RR SR3 RR SR4 RR...

Now each sequential reader gets 100ms and disk is not seek bound at the
same time random reader latency limited by number of competing groups
and not by number of processes in the group. This is what IO scheduler
based IO controller is effectively doing currently.

> 
> > When we control things at IO scheduler level, we assign one time slice to one
> > group and then pick next entity to run. So effectively after one time slice
> > (max 180ms, if prio 0 sequential reader is running), random reader in other
> > group will get to run. Hence we achieve better isolation between groups as
> > response time of process in a differnt group is generally not dependent on
> > number of processes running in competing group.  
> 
> I don't understand why you're comparing this implementation with such
> an obviously dumb competing design!
> 
> > So a higher level solution is most likely limited to only shaping bandwidth
> > without any control on latencies.
> > 
> > Stacking group scheduler on top of CFQ can lead to issues
> > ---------------------------------------------------------
> > IO throttling and dm-ioband both are second level controller. That is these
> > controllers are implemented in higher layers than io schedulers. So they
> > control the IO at higher layer based on group policies and later IO
> > schedulers take care of dispatching these bios to disk.
> > 
> > Implementing a second level controller has the advantage of being able to
> > provide bandwidth control even on logical block devices in the IO stack
> > which don't have any IO schedulers attached to these. But they can also 
> > interefere with IO scheduling policy of underlying IO scheduler and change
> > the effective behavior. Following are some of the issues which I think
> > should be visible in second level controller in one form or other.
> > 
> >   Prio with-in group
> >   ------------------
> >   A second level controller can potentially interefere with behavior of
> >   different prio processes with-in a group. bios are buffered at higher layer
> >   in single queue and release of bios is FIFO and not proportionate to the
> >   ioprio of the process. This can result in a particular prio level not
> >   getting fair share.
> 
> That's an administrator error, isn't it?  Should have put the
> different-priority processes into different groups.
> 

I am thinking in practice it probably will be a mix of priority in each
group. For example, consider a hypothetical scenario where two students
on a university server are given two cgroups of certain weights so that IO
done by these students are limited in case of contention. Now these students
might want to throw in a mix of priority workload in their respective cgroup.
Admin would not have any idea what priority process students are running in 
respective cgroup.

> >   Buffering at higher layer can delay read requests for more than slice idle
> >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> >   for a request from the queue but it is buffered at higher layer and then idle
> >   timer will fire. It means that queue will losse its share at the same time
> >   overall throughput will be impacted as we lost those 8 ms.
> 
> That sounds like a bug.
> 

Actually this probably is a limitation of higher level controller. It most
likely is sitting so high in IO stack that it has no idea what underlying
IO scheduler is and what are IO scheduler's policies. So it can't keep up
with IO scheduler's policies. Secondly, it might be a low weight group and
tokens might not be available fast enough to release the request.

> >   Read Vs Write
> >   -------------
> >   Writes can overwhelm readers hence second level controller FIFO release
> >   will run into issue here. If there is a single queue maintained then reads
> >   will suffer large latencies. If there separate queues for reads and writes
> >   then it will be hard to decide in what ratio to dispatch reads and writes as
> >   it is IO scheduler's decision to decide when and how much read/write to
> >   dispatch. This is another place where higher level controller will not be in
> >   sync with lower level io scheduler and can change the effective policies of
> >   underlying io scheduler.
> 
> The IO schedulers already take care of read-vs-write and already take
> care of preventing large writes-starve-reads latencies (or at least,
> they're supposed to).

True. Actually this is a limitation of higher level controller. A higher
level controller will most likely implement some of kind of queuing/buffering
mechanism where it will buffer requeuests when it decides to throttle the
group. Now once a fair number read and requests are buffered, and if
controller is ready to dispatch some requests from the group, which
requests/bio should it dispatch? reads first or writes first or reads and
writes in certain ratio?

In what ratio reads and writes are dispatched is the property and decision of
IO scheduler. Now higher level controller will be taking this decision and
change the behavior of underlying io scheduler.

> 
> >   CFQ IO context Issues
> >   ---------------------
> >   Buffering at higher layer means submission of bios later with the help of
> >   a worker thread.
> 
> Why?
> 
> If it's a read, we just block the userspace process.
> 
> If it's a delayed write, the IO submission already happens in a kernel thread.

Is it ok to block pdflush on group. Some low weight group might block it
for long time and hence not allow flushing out other pages. Probably that's
the reason pdflush used to check if underlying device is congested or not
and if it is congested, we don't go ahead with submission of request.
With per bdi flusher thread things will change. 

I think btrfs also has some threds which don't want to block and if
underlying deivce is congested, it bails out. That's the reason I
implemented per group congestion interface where if a thread does not want
to block, it can check whether the group IO is going in is congested or
not and will it block. So for such threads, probably higher level
controller shall have to implement per group congestion interface so that
threads which don't want to block can check with the controller whether
it has sufficient BW to let it through and not block or may be start
buffering writes in group queue.

> 
> If it's a synchronous write, we have to block the userspace caller
> anyway.
> 
> Async reads might be an issue, dunno.
> 

I think async IO is one of the reason. IIRC, Andrea Righi, implemented the
policy of returning error for async IO if group did not have sufficient
tokens to dispatch the async IO and expected the application to retry
later. I am not sure if that is ok.

So yes, if we are not buffering any of the read requests and either
blocking the caller or returning an error (async IO) than CFQ io context is not
an issue.

> > This changes the io context information at CFQ layer which
> >   assigns the request to submitting thread. Change of io context info again
> >   leads to issues of idle timer expiry and issue of a process not getting fair
> >   share and reduced throughput.
> 
> But we already have that problem with delayed writeback, which is a
> huge thing - often it's the majority of IO.
> 

For delayed writes CFQ will not anticipate so increased anticipation timer
expiry is not an issue with writes. But it probably will be an issue with
reads where if higher level controller decides to block next read and 
CFQ is anticipating on that read. I am wondering that such kind of issues
must appear with all the higher level device mapper/software raid devices
also. How do they handle it. May be it is more theoritical and in practice
impact is not significant.

> >   Throughput with noop, deadline and AS
> >   ---------------------------------------------
> >   I think an higher level controller will result in reduced overall throughput
> >   (as compared to io scheduler based io controller) and more seeks with noop,
> >   deadline and AS.
> > 
> >   The reason being, that it is likely that IO with-in a group will be related
> >   and will be relatively close as compared to IO across the groups. For example,
> >   thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
> >   control, IO from various groups will go into a single queue at lower level
> >   controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
> >   G4....) causing more seeks and reduced throughput. (Agreed that merging will
> >   help up to some extent but still....).
> > 
> >   Instead, in case of lower level controller, IO scheduler maintains one queue
> >   per group hence there is no interleaving of IO between groups. And if IO is
> >   related with-in group, then we shoud get reduced number/amount of seek and
> >   higher throughput.
> > 
> >   Latency can be a concern but that can be controlled by reducing the time
> >   slice length of the queue.
> 
> Well maybe, maybe not.  If a group is throttled, it isn't submitting
> new IO.  The unthrottled group is doing the IO submitting and that IO
> will have decent locality.

But throttling will kick in ocassionaly. Rest of the time both the groups
will be dispatching bios at the same time. So for most part of it IO
scheduler will probably see IO from both the groups and there will be
small intervals where one group is completely throttled and IO scheduler
is busy dispatching requests only from a single group.

> 
> > Fairness at logical device level vs at physical device level
> > ------------------------------------------------------------
> > 
> > IO scheduler based controller has the limitation that it works only with the
> > bottom most devices in the IO stack where IO scheduler is attached.
> > 
> > For example, assume a user has created a logical device lv0 using three
> > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
> > in two groups doing IO on lv0. Also assume that weights of groups are in the
> > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
> > 
> > 			     T1    T2
> > 			       \   /
> > 			        lv0
> > 			      /  |  \
> > 			    sda sdb  sdc
> > 
> > 
> > Now resource control will take place only on devices sda, sdb and sdc and
> > not at lv0 level. So if IO from two tasks is relatively uniformly
> > distributed across the disks then T1 and T2 will see the throughput ratio
> > in proportion to weight specified. But if IO from T1 and T2 is going to
> > different disks and there is no contention then at higher level they both
> > will see same BW.
> > 
> > Here a second level controller can produce better fairness numbers at
> > logical device but most likely at redued overall throughput of the system,
> > because it will try to control IO even if there is no contention at phsical
> > possibly leaving diksks unused in the system.
> > 
> > Hence, question comes that how important it is to control bandwidth at
> > higher level logical devices also. The actual contention for resources is
> > at the leaf block device so it probably makes sense to do any kind of
> > control there and not at the intermediate devices. Secondly probably it
> > also means better use of available resources.
> 
> hm.  What will be the effects of this limitation in real-world use?

In some cases user/application will not see the bandwidth ratio between 
two groups in same proportion as assigned weights and primary reason for
that will be that this workload did not create enough contention for
physical resources unerneath.

So it all depends on what kind of bandwidth gurantees are we offering. If
we are saying that we provide good fairness numbers at logical devices
irrespective of whether resources are not used optimally, then it will be
irritating for the user. 

I think it also might become an issue once we implement max bandwidth
control. We will not be able to define max bandwidth on a logical device
and an application will get more than max bandwidth if it is doing IO to
different underlying devices.

I would say that leaf node control is good for optimal resource usage and
for proportional BW control, but not a good fit for max bandwidth control.

> 
> > Limited Fairness
> > ----------------
> > Currently CFQ idles on a sequential reader queue to make sure it gets its
> > fair share. A second level controller will find it tricky to anticipate.
> > Either it will not have any anticipation logic and in that case it will not
> > provide fairness to single readers in a group (as dm-ioband does) or if it
> > starts anticipating then we should run into these strange situations where
> > second level controller is anticipating on one queue/group and underlying
> > IO scheduler might be anticipating on something else.
> 
> It depends on the size of the inter-group timeslices.  If the amount of
> time for which a group is unthrottled is "large" comapred to the
> typical anticipation times, this issue fades away.
> 
> And those timeslices _should_ be large.  Because as you mentioned
> above, different groups are probably working different parts of the
> disk.
> 
> > Need of device mapper tools
> > ---------------------------
> > A device mapper based solution will require creation of a ioband device
> > on each physical/logical device one wants to control. So it requires usage
> > of device mapper tools even for the people who are not using device mapper.
> > At the same time creation of ioband device on each partition in the system to 
> > control the IO can be cumbersome and overwhelming if system has got lots of
> > disks and partitions with-in.
> > 
> > 
> > IMHO, IO scheduler based IO controller is a reasonable approach to solve the
> > problem of group bandwidth control, and can do hierarchical IO scheduling
> > more tightly and efficiently.
> > 
> > But I am all ears to alternative approaches and suggestions how doing things
> > can be done better and will be glad to implement it.
> > 
> > TODO
> > ====
> > - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > - More testing to make sure there are no regressions in CFQ.
> > 
> > Testing
> > =======
> > 
> > Environment
> > ==========
> > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
> 
> That's a bit of a toy.

Yes it is. :-)

> 
> Do we have testing results for more enterprisey hardware?  Big storage
> arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)

Not yet. I will try to get hold of some storage arrays and run some tests.

> 
> 
> > I am mostly
> > running fio jobs which have been limited to 30 seconds run and then monitored
> > the throughput and latency.
> >  
> > Test1: Random Reader Vs Random Writers
> > ======================================
> > Launched a random reader and then increasing number of random writers to see
> > the effect on random reader BW and max lantecies.
> > 
> > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> > 
> > [Vanilla CFQ, No groups]
> > <--------------random writers-------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   5737KiB/s   5737KiB/s   5737KiB/s   164K usec   503KiB/s    159K usec   
> > 2   2055KiB/s   1984KiB/s   4039KiB/s   1459K usec  150KiB/s    170K usec   
> > 4   1238KiB/s   932KiB/s    4419KiB/s   4332K usec  153KiB/s    225K usec   
> > 8   1059KiB/s   929KiB/s    7901KiB/s   1260K usec  118KiB/s    377K usec   
> > 16  604KiB/s    483KiB/s    8519KiB/s   3081K usec  47KiB/s     756K usec   
> > 32  367KiB/s    222KiB/s    9643KiB/s   5940K usec  22KiB/s     923K usec   
> > 
> > Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> > number of random writers in group1 and one random reader in group2 using fio.
> > 
> > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
> > <--------------random writers(group1)-------------> <-random reader(group2)->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   18115KiB/s  18115KiB/s  18115KiB/s  604K usec   345KiB/s    176K usec   
> > 2   3752KiB/s   3676KiB/s   7427KiB/s   4367K usec  402KiB/s    187K usec   
> > 4   1951KiB/s   1863KiB/s   7642KiB/s   1989K usec  384KiB/s    181K usec   
> > 8   755KiB/s    629KiB/s    5683KiB/s   2133K usec  366KiB/s    319K usec   
> > 16  418KiB/s    369KiB/s    6276KiB/s   1323K usec  352KiB/s    287K usec   
> > 32  236KiB/s    191KiB/s    6518KiB/s   1910K usec  337KiB/s    273K usec   
> 
> That's a good result.
> 
> > Also ran the same test with IO controller CFQ in flat mode to see if there
> > are any major deviations from Vanilla CFQ. Does not look like any.
> > 
> > [IO controller CFQ; No groups ]
> > <--------------random writers-------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   5696KiB/s   5696KiB/s   5696KiB/s   259K usec   500KiB/s    194K usec   
> > 2   2483KiB/s   2197KiB/s   4680KiB/s   887K usec   150KiB/s    159K usec   
> > 4   1471KiB/s   1433KiB/s   5817KiB/s   962K usec   126KiB/s    189K usec   
> > 8   691KiB/s    580KiB/s    5159KiB/s   2752K usec  197KiB/s    246K usec   
> > 16  781KiB/s    698KiB/s    11892KiB/s  943K usec   61KiB/s     529K usec   
> > 32  415KiB/s    324KiB/s    12461KiB/s  4614K usec  17KiB/s     737K usec   
> > 
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down
> >   its throughput and bump up latencies significantly.
> 
> Isn't that a CFQ shortcoming which we should address separately?  If
> so, the comparisons aren't presently valid because we're comparing with
> a CFQ which has known, should-be-fixed problems.

I am not sure if it is a CFQ issue. These are synchronous random writes.
These are equally important as random reader. So now CFQ has 33 synchronous
queues to serve. Becuase it does not know about groups, it has no choice but
to serve them in round robin manner. So it does not sound like a CFQ issue.
Just that CFQ can give random reader an advantage if it knows that random
reader is in a different group and that's where IO controller comes in to
picture.

> 
> > - With IO controller, one can provide isolation to the random reader group and
> >   maintain consitent view of bandwidth and latencies. 
> > 
> > Test2: Random Reader Vs Sequential Reader
> > ========================================
> > Launched a random reader and then increasing number of sequential readers to
> > see the effect on BW and latencies of random reader.
> > 
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> > 
> > [ Vanilla CFQ, No groups ]
> > <---------------seq readers---------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   23318KiB/s  23318KiB/s  23318KiB/s  55940 usec  36KiB/s     247K usec   
> > 2   14732KiB/s  11406KiB/s  26126KiB/s  142K usec   20KiB/s     446K usec   
> > 4   9417KiB/s   5169KiB/s   27338KiB/s  404K usec   10KiB/s     993K usec   
> > 8   3360KiB/s   3041KiB/s   25850KiB/s  954K usec   60KiB/s     956K usec   
> > 16  1888KiB/s   1457KiB/s   26763KiB/s  1871K usec  28KiB/s     1868K usec  
> > 
> > Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> > number of sequential readers in group1 and one random reader in group2 using
> > fio.
> > 
> > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
> > <---------------group1--------------------------->  <------group2--------->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   13733KiB/s  13733KiB/s  13733KiB/s  247K usec   330KiB/s    154K usec   
> > 2   8553KiB/s   4963KiB/s   13514KiB/s  472K usec   322KiB/s    174K usec   
> > 4   5045KiB/s   1367KiB/s   13134KiB/s  947K usec   318KiB/s    178K usec   
> > 8   1774KiB/s   1420KiB/s   13035KiB/s  1871K usec  323KiB/s    233K usec   
> > 16  959KiB/s    518KiB/s    12691KiB/s  3809K usec  324KiB/s    208K usec   
> > 
> > Also ran the same test with IO controller CFQ in flat mode to see if there
> > are any major deviations from Vanilla CFQ. Does not look like any.
> > 
> > [IO controller CFQ; No groups ]
> > <---------------seq readers---------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   23028KiB/s  23028KiB/s  23028KiB/s  47460 usec  36KiB/s     253K usec   
> > 2   14452KiB/s  11176KiB/s  25628KiB/s  145K usec   20KiB/s     447K usec   
> > 4   8815KiB/s   5720KiB/s   27121KiB/s  396K usec   10KiB/s     968K usec   
> > 8   3335KiB/s   2827KiB/s   24866KiB/s  960K usec   62KiB/s     955K usec   
> > 16  1784KiB/s   1311KiB/s   26537KiB/s  1883K usec  26KiB/s     1866K usec  
> > 
> > Notes:
> > - The BW and latencies of random reader in group 2 seems to be stable and
> >   bounded and does not get impacted much as number of sequential readers
> >   increase in group1. Hence provding good isolation.
> > 
> > - Throughput of sequential readers comes down and latencies go up as half
> >   of disk bandwidth (in terms of time) has been reserved for random reader
> >   group.
> > 
> > Test3: Sequential Reader Vs Sequential Reader
> > ============================================
> > Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
> > Launched increasing number of sequential readers in group1 and one sequential
> > reader in group2 using fio and monitored how bandwidth is being distributed
> > between two groups.
> > 
> > First 5 columns give stats about job in group1 and last two columns give
> > stats about job in group2.
> > 
> > <---------------group1--------------------------->  <------group2--------->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   8970KiB/s   8970KiB/s   8970KiB/s   230K usec   20681KiB/s  124K usec   
> > 2   6783KiB/s   3202KiB/s   9984KiB/s   546K usec   19682KiB/s  139K usec   
> > 4   4641KiB/s   1029KiB/s   9280KiB/s   1185K usec  19235KiB/s  172K usec   
> > 8   1435KiB/s   1079KiB/s   9926KiB/s   2461K usec  19501KiB/s  153K usec   
> > 16  764KiB/s    398KiB/s    9395KiB/s   4986K usec  19367KiB/s  172K usec   
> > 
> > Note: group2 is getting double the bandwidth of group1 even in the face
> > of increasing number of readers in group1.
> > 
> > Test4 (Isolation between two KVM virtual machines)
> > ==================================================
> > Created two KVM virtual machines. Partitioned a disk on host in two partitions
> > and gave one partition to each virtual machine. Put both the virtual machines
> > in two different cgroup of weight 1000 and 500 each. Virtual machines created
> > ext3 file system on the partitions exported from host and did buffered writes.
> > Host seems writes as synchronous and virtual machine with higher weight gets
> > double the disk time of virtual machine of lower weight. Used deadline
> > scheduler in this test case.
> > 
> > Some more details about configuration are in documentation patch.
> > 
> > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > ===================================================================
> > Fairness for async writes is tricky and biggest reason is that async writes
> > are cached in higher layers (page cahe) as well as possibly in file system
> > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > in proportional manner.
> > 
> > For example, consider two dd threads reading /dev/zero as input file and doing
> > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > be forced to write out some pages to disk before more pages can be dirtied. But
> > not necessarily dirty pages of same thread are picked. It can very well pick
> > the inode of lesser priority dd thread and do some writeout. So effectively
> > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > service differentation.
> > 
> > IOW, the core problem with buffered write fairness is that higher weight thread
> > does not throw enought IO traffic at IO controller to keep the queue
> > continuously backlogged. In my testing, there are many .2 to .8 second
> > intervals where higher weight queue is empty and in that duration lower weight
> > queue get lots of job done giving the impression that there was no service
> > differentiation.
> > 
> > In summary, from IO controller point of view async writes support is there.
> > Because page cache has not been designed in such a manner that higher 
> > prio/weight writer can do more write out as compared to lower prio/weight
> > writer, gettting service differentiation is hard and it is visible in some
> > cases and not visible in some cases.
> 
> Here's where it all falls to pieces.
> 
> For async writeback we just don't care about IO priorities.  Because
> from the point of view of the userspace task, the write was async!  It
> occurred at memory bandwidth speed.
> 
> It's only when the kernel's dirty memory thresholds start to get
> exceeded that we start to care about prioritisation.  And at that time,
> all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> consumes just as much memory as a low-ioprio dirty page.
> 
> So when balance_dirty_pages() hits, what do we want to do?
> 
> I suppose that all we can do is to block low-ioprio processes more
> agressively at the VFS layer, to reduce the rate at which they're
> dirtying memory so as to give high-ioprio processes more of the disk
> bandwidth.
> 
> But you've gone and implemented all of this stuff at the io-controller
> level and not at the VFS level so you're, umm, screwed.

True that's an issue. For async writes we don't create parallel IO paths
from user space to IO scheduler hence it is hard to provide fairness in
all the cases. I think part of the problem is page cache and some
serialization also comes from kjournald.

How about coming up with another cgroup controller for buffered writes or
clubbing it with memory controller as KAMEZAWA Hiroyuki suggested and co-mount
this with io controller? This should help control buffered writes per
cgroup.

> 
> Importantly screwed!  It's a very common workload pattern, and one
> which causes tremendous amounts of IO to be generated very quickly,
> traditionally causing bad latency effects all over the place.  And we
> have no answer to this.
> 
> > Vanilla CFQ Vs IO Controller CFQ
> > ================================
> > We have not fundamentally changed CFQ, instead enhanced it to also support
> > hierarchical io scheduling. In the process invariably there are small changes
> > here and there as new scenarios come up. Running some tests here and comparing
> > both the CFQ's to see if there is any major deviation in behavior.
> > 
> > Test1: Sequential Readers
> > =========================
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > 
> > IO scheduler: IO controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > 
> > Test2: Sequential Writers
> > =========================
> > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > 
> > Test3: Random Readers
> > =========================
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > 
> > Test4: Random Writers
> > =====================
> > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > 
> > Notes:
> >  - Does not look like that anything has changed significantly.
> > 
> > Previous versions of the patches were posted here.
> > ------------------------------------------------
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > (V6) http://lkml.org/lkml/2009/7/2/369
> > (V7) http://lkml.org/lkml/2009/7/24/253
> > (V8) http://lkml.org/lkml/2009/8/16/204
> > (V9) http://lkml.org/lkml/2009/8/28/327
> > 
> > Thanks
> > Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-25  5:04     ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25  5:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers,
	linux-kernel, s-uchida, righi.andrea, torvalds

On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote:
> On Thu, 24 Sep 2009 15:25:04 -0400
> Vivek Goyal <vgoyal@redhat.com> wrote:
> 
> > 
> > Hi All,
> > 
> > Here is the V10 of the IO controller patches generated on top of 2.6.31.
> > 
> 
> Thanks for the writeup.  It really helps and is most worthwhile for a
> project of this importance, size and complexity.
> 
> 
> >  
> > What problem are we trying to solve
> > ===================================
> > Provide group IO scheduling feature in Linux along the lines of other resource
> > controllers like cpu.
> > 
> > IOW, provide facility so that a user can group applications using cgroups and
> > control the amount of disk time/bandwidth received by a group based on its
> > weight. 
> > 
> > How to solve the problem
> > =========================
> > 
> > Different people have solved the issue differetnly. So far looks it looks
> > like we seem to have following two core requirements when it comes to
> > fairness at group level.
> > 
> > - Control bandwidth seen by groups.
> > - Control on latencies when a request gets backlogged in group.
> > 
> > At least there are now three patchsets available (including this one).
> > 
> > IO throttling
> > -------------
> > This is a bandwidth controller which keeps track of IO rate of a group and
> > throttles the process in the group if it exceeds the user specified limit.
> > 
> > dm-ioband
> > ---------
> > This is a proportional bandwidth controller implemented as device mapper
> > driver and provides fair access in terms of amount of IO done (not in terms
> > of disk time as CFQ does).
> > 
> > So one will setup one or more dm-ioband devices on top of physical/logical
> > block device, configure the ioband device and pass information like grouping
> > etc. Now this device will keep track of bios flowing through it and control
> > the flow of bios based on group policies.
> > 
> > IO scheduler based IO controller
> > --------------------------------
> > Here we have viewed the problem of IO contoller as hierarchical group
> > scheduling (along the lines of CFS group scheduling) issue. Currently one can
> > view linux IO schedulers as flat where there is one root group and all the IO
> > belongs to that group.
> > 
> > This patchset basically modifies IO schedulers to also support hierarchical
> > group scheduling. CFQ already provides fairness among different processes. I 
> > have extended it support group IO schduling. Also took some of the code out
> > of CFQ and put in a common layer so that same group scheduling code can be
> > used by noop, deadline and AS to support group scheduling. 
> > 
> > Pros/Cons
> > =========
> > There are pros and cons to each of the approach. Following are some of the
> > thoughts.
> > 
> > Max bandwidth vs proportional bandwidth
> > ---------------------------------------
> > IO throttling is a max bandwidth controller and not a proportional one.
> > Additionaly it provides fairness in terms of amount of IO done (and not in
> > terms of disk time as CFQ does).
> > 
> > Personally, I think that proportional weight controller is useful to more
> > people than just max bandwidth controller. In addition, IO scheduler based
> > controller can also be enhanced to do max bandwidth control. So it can 
> > satisfy wider set of requirements.
> > 
> > Fairness in terms of disk time vs size of IO
> > ---------------------------------------------
> > An higher level controller will most likely be limited to providing fairness
> > in terms of size/number of IO done and will find it hard to provide fairness
> > in terms of disk time used (as CFQ provides between various prio levels). This
> > is because only IO scheduler knows how much disk time a queue has used and
> > information about queues and disk time used is not exported to higher
> > layers.
> > 
> > So a seeky application will still run away with lot of disk time and bring
> > down the overall throughput of the the disk.
> 
> But that's only true if the thing is poorly implemented.
> 
> A high-level controller will need some view of the busyness of the
> underlying device(s).  That could be "proportion of idle time", or
> "average length of queue" or "average request latency" or some mix of
> these or something else altogether.
> 
> But these things are simple to calculate, and are simple to feed back
> to the higher-level controller and probably don't require any changes
> to to IO scheduler at all, which is a great advantage.
> 
> 
> And I must say that high-level throttling based upon feedback from
> lower layers seems like a much better model to me than hacking away in
> the IO scheduler layer.  Both from an implementation point of view and
> from a "we can get it to work on things other than block devices" point
> of view.
> 

Hi Andrew,

Few thoughts.

- A higher level throttling approach suffers from the issue of unfair
  throttling. So if there are multiple tasks in the group, who do we
  throttle and how do we make sure that we did throttling in proportion
  to the prio of tasks. Andrea's IO throttling implementation suffered
  from these issues. I had run some tests where RT and BW tasks were 
  getting same BW with-in group or tasks of different prio were gettting
  same BW. 

  Even if we figure a way out to do fair throttling with-in group, underlying
  IO scheduler might not be CFQ at all and we should not have done so.

https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html

- Higher level throttling does not know where actually IO is going in 
  physical layer. So we might unnecessarily be throttling IO which are
  going to same logical device but at the end of day to different physical
  devices.

  Agreed that some people will want that behavior, especially in the case
  of max bandwidth control where one does not want to give you the BW
  because you did not pay for it.

  So higher level controller is good for max bw control but if it comes
  to optimal usage of resources and do control only if needed, then it
  probably is not the best thing.

About the feedback thing, I am not very sure. Are you saying that we will
run timed groups in higher layer and take feedback from underlying IO
scheduler about how much time a group consumed or something like that and
not do accounting in terms of size of IO?

 
> > Currently dm-ioband provides fairness in terms of number/size of IO.
> > 
> > Latencies and isolation between groups
> > --------------------------------------
> > An higher level controller is generally implementing a bandwidth throttling
> > solution where if a group exceeds either the max bandwidth or the proportional
> > share then throttle that group.
> > 
> > This kind of approach will probably not help in controlling latencies as it
> > will depend on underlying IO scheduler. Consider following scenario. 
> > 
> > Assume there are two groups. One group is running multiple sequential readers
> > and other group has a random reader. sequential readers will get a nice 100ms
> > slice
> 
> Do you refer to each reader within group1, or to all readers?  It would be
> daft if each reader in group1 were to get 100ms.
> 

All readers in the group should get 100ms each, both in IO throttling and
dm-ioband solution.

Higher level solutions are not keeping track of time slices. Time slices will
be allocated by CFQ which does not have any idea about grouping. Higher
level controller just keeps track of size of IO done at group level and
then run either a leaky bucket or token bucket algorithm.

IO throttling is a max BW controller, so it will not even care about what is
happening in other group. It will just be concerned with rate of IO in one
particular group and if we exceed specified limit, throttle it. So until and
unless sequential reader group hits it max bw limit, it will keep sending
reads down to CFQ, and CFQ will happily assign 100ms slices to readers.

dm-ioband will not try to choke the high throughput sequential reader group
for the slow random reader group because that would just kill the throughput
of rotational media. Every sequential reader will run for few ms and then 
be throttled and this goes on. Disk will soon be seek bound.

> > each and then a random reader from group2 will get to dispatch the
> > request. So latency of this random reader will depend on how many sequential
> > readers are running in other group and that is a weak isolation between groups.
> 
> And yet that is what you appear to mean.
> 
> But surely nobody would do that - the 100ms would be assigned to and
> distributed amongst all readers in group1?

Dividing 100ms to all the sequential readers might not be very good on
rotational media as each reader runs for small time and then seek happens.
This will increase number of seeks in the system. Think of 32 sequential
readers in the group and then each getting less than 3ms to run.

A better way probably is to give each queue 100ms in one run of group and
then switch group. Someting like following.

SR1 RR SR2 RR SR3 RR SR4 RR...

Now each sequential reader gets 100ms and disk is not seek bound at the
same time random reader latency limited by number of competing groups
and not by number of processes in the group. This is what IO scheduler
based IO controller is effectively doing currently.

> 
> > When we control things at IO scheduler level, we assign one time slice to one
> > group and then pick next entity to run. So effectively after one time slice
> > (max 180ms, if prio 0 sequential reader is running), random reader in other
> > group will get to run. Hence we achieve better isolation between groups as
> > response time of process in a differnt group is generally not dependent on
> > number of processes running in competing group.  
> 
> I don't understand why you're comparing this implementation with such
> an obviously dumb competing design!
> 
> > So a higher level solution is most likely limited to only shaping bandwidth
> > without any control on latencies.
> > 
> > Stacking group scheduler on top of CFQ can lead to issues
> > ---------------------------------------------------------
> > IO throttling and dm-ioband both are second level controller. That is these
> > controllers are implemented in higher layers than io schedulers. So they
> > control the IO at higher layer based on group policies and later IO
> > schedulers take care of dispatching these bios to disk.
> > 
> > Implementing a second level controller has the advantage of being able to
> > provide bandwidth control even on logical block devices in the IO stack
> > which don't have any IO schedulers attached to these. But they can also 
> > interefere with IO scheduling policy of underlying IO scheduler and change
> > the effective behavior. Following are some of the issues which I think
> > should be visible in second level controller in one form or other.
> > 
> >   Prio with-in group
> >   ------------------
> >   A second level controller can potentially interefere with behavior of
> >   different prio processes with-in a group. bios are buffered at higher layer
> >   in single queue and release of bios is FIFO and not proportionate to the
> >   ioprio of the process. This can result in a particular prio level not
> >   getting fair share.
> 
> That's an administrator error, isn't it?  Should have put the
> different-priority processes into different groups.
> 

I am thinking in practice it probably will be a mix of priority in each
group. For example, consider a hypothetical scenario where two students
on a university server are given two cgroups of certain weights so that IO
done by these students are limited in case of contention. Now these students
might want to throw in a mix of priority workload in their respective cgroup.
Admin would not have any idea what priority process students are running in 
respective cgroup.

> >   Buffering at higher layer can delay read requests for more than slice idle
> >   period of CFQ (default 8 ms). That means, it is possible that we are waiting
> >   for a request from the queue but it is buffered at higher layer and then idle
> >   timer will fire. It means that queue will losse its share at the same time
> >   overall throughput will be impacted as we lost those 8 ms.
> 
> That sounds like a bug.
> 

Actually this probably is a limitation of higher level controller. It most
likely is sitting so high in IO stack that it has no idea what underlying
IO scheduler is and what are IO scheduler's policies. So it can't keep up
with IO scheduler's policies. Secondly, it might be a low weight group and
tokens might not be available fast enough to release the request.

> >   Read Vs Write
> >   -------------
> >   Writes can overwhelm readers hence second level controller FIFO release
> >   will run into issue here. If there is a single queue maintained then reads
> >   will suffer large latencies. If there separate queues for reads and writes
> >   then it will be hard to decide in what ratio to dispatch reads and writes as
> >   it is IO scheduler's decision to decide when and how much read/write to
> >   dispatch. This is another place where higher level controller will not be in
> >   sync with lower level io scheduler and can change the effective policies of
> >   underlying io scheduler.
> 
> The IO schedulers already take care of read-vs-write and already take
> care of preventing large writes-starve-reads latencies (or at least,
> they're supposed to).

True. Actually this is a limitation of higher level controller. A higher
level controller will most likely implement some of kind of queuing/buffering
mechanism where it will buffer requeuests when it decides to throttle the
group. Now once a fair number read and requests are buffered, and if
controller is ready to dispatch some requests from the group, which
requests/bio should it dispatch? reads first or writes first or reads and
writes in certain ratio?

In what ratio reads and writes are dispatched is the property and decision of
IO scheduler. Now higher level controller will be taking this decision and
change the behavior of underlying io scheduler.

> 
> >   CFQ IO context Issues
> >   ---------------------
> >   Buffering at higher layer means submission of bios later with the help of
> >   a worker thread.
> 
> Why?
> 
> If it's a read, we just block the userspace process.
> 
> If it's a delayed write, the IO submission already happens in a kernel thread.

Is it ok to block pdflush on group. Some low weight group might block it
for long time and hence not allow flushing out other pages. Probably that's
the reason pdflush used to check if underlying device is congested or not
and if it is congested, we don't go ahead with submission of request.
With per bdi flusher thread things will change. 

I think btrfs also has some threds which don't want to block and if
underlying deivce is congested, it bails out. That's the reason I
implemented per group congestion interface where if a thread does not want
to block, it can check whether the group IO is going in is congested or
not and will it block. So for such threads, probably higher level
controller shall have to implement per group congestion interface so that
threads which don't want to block can check with the controller whether
it has sufficient BW to let it through and not block or may be start
buffering writes in group queue.

> 
> If it's a synchronous write, we have to block the userspace caller
> anyway.
> 
> Async reads might be an issue, dunno.
> 

I think async IO is one of the reason. IIRC, Andrea Righi, implemented the
policy of returning error for async IO if group did not have sufficient
tokens to dispatch the async IO and expected the application to retry
later. I am not sure if that is ok.

So yes, if we are not buffering any of the read requests and either
blocking the caller or returning an error (async IO) than CFQ io context is not
an issue.

> > This changes the io context information at CFQ layer which
> >   assigns the request to submitting thread. Change of io context info again
> >   leads to issues of idle timer expiry and issue of a process not getting fair
> >   share and reduced throughput.
> 
> But we already have that problem with delayed writeback, which is a
> huge thing - often it's the majority of IO.
> 

For delayed writes CFQ will not anticipate so increased anticipation timer
expiry is not an issue with writes. But it probably will be an issue with
reads where if higher level controller decides to block next read and 
CFQ is anticipating on that read. I am wondering that such kind of issues
must appear with all the higher level device mapper/software raid devices
also. How do they handle it. May be it is more theoritical and in practice
impact is not significant.

> >   Throughput with noop, deadline and AS
> >   ---------------------------------------------
> >   I think an higher level controller will result in reduced overall throughput
> >   (as compared to io scheduler based io controller) and more seeks with noop,
> >   deadline and AS.
> > 
> >   The reason being, that it is likely that IO with-in a group will be related
> >   and will be relatively close as compared to IO across the groups. For example,
> >   thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
> >   control, IO from various groups will go into a single queue at lower level
> >   controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
> >   G4....) causing more seeks and reduced throughput. (Agreed that merging will
> >   help up to some extent but still....).
> > 
> >   Instead, in case of lower level controller, IO scheduler maintains one queue
> >   per group hence there is no interleaving of IO between groups. And if IO is
> >   related with-in group, then we shoud get reduced number/amount of seek and
> >   higher throughput.
> > 
> >   Latency can be a concern but that can be controlled by reducing the time
> >   slice length of the queue.
> 
> Well maybe, maybe not.  If a group is throttled, it isn't submitting
> new IO.  The unthrottled group is doing the IO submitting and that IO
> will have decent locality.

But throttling will kick in ocassionaly. Rest of the time both the groups
will be dispatching bios at the same time. So for most part of it IO
scheduler will probably see IO from both the groups and there will be
small intervals where one group is completely throttled and IO scheduler
is busy dispatching requests only from a single group.

> 
> > Fairness at logical device level vs at physical device level
> > ------------------------------------------------------------
> > 
> > IO scheduler based controller has the limitation that it works only with the
> > bottom most devices in the IO stack where IO scheduler is attached.
> > 
> > For example, assume a user has created a logical device lv0 using three
> > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
> > in two groups doing IO on lv0. Also assume that weights of groups are in the
> > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
> > 
> > 			     T1    T2
> > 			       \   /
> > 			        lv0
> > 			      /  |  \
> > 			    sda sdb  sdc
> > 
> > 
> > Now resource control will take place only on devices sda, sdb and sdc and
> > not at lv0 level. So if IO from two tasks is relatively uniformly
> > distributed across the disks then T1 and T2 will see the throughput ratio
> > in proportion to weight specified. But if IO from T1 and T2 is going to
> > different disks and there is no contention then at higher level they both
> > will see same BW.
> > 
> > Here a second level controller can produce better fairness numbers at
> > logical device but most likely at redued overall throughput of the system,
> > because it will try to control IO even if there is no contention at phsical
> > possibly leaving diksks unused in the system.
> > 
> > Hence, question comes that how important it is to control bandwidth at
> > higher level logical devices also. The actual contention for resources is
> > at the leaf block device so it probably makes sense to do any kind of
> > control there and not at the intermediate devices. Secondly probably it
> > also means better use of available resources.
> 
> hm.  What will be the effects of this limitation in real-world use?

In some cases user/application will not see the bandwidth ratio between 
two groups in same proportion as assigned weights and primary reason for
that will be that this workload did not create enough contention for
physical resources unerneath.

So it all depends on what kind of bandwidth gurantees are we offering. If
we are saying that we provide good fairness numbers at logical devices
irrespective of whether resources are not used optimally, then it will be
irritating for the user. 

I think it also might become an issue once we implement max bandwidth
control. We will not be able to define max bandwidth on a logical device
and an application will get more than max bandwidth if it is doing IO to
different underlying devices.

I would say that leaf node control is good for optimal resource usage and
for proportional BW control, but not a good fit for max bandwidth control.

> 
> > Limited Fairness
> > ----------------
> > Currently CFQ idles on a sequential reader queue to make sure it gets its
> > fair share. A second level controller will find it tricky to anticipate.
> > Either it will not have any anticipation logic and in that case it will not
> > provide fairness to single readers in a group (as dm-ioband does) or if it
> > starts anticipating then we should run into these strange situations where
> > second level controller is anticipating on one queue/group and underlying
> > IO scheduler might be anticipating on something else.
> 
> It depends on the size of the inter-group timeslices.  If the amount of
> time for which a group is unthrottled is "large" comapred to the
> typical anticipation times, this issue fades away.
> 
> And those timeslices _should_ be large.  Because as you mentioned
> above, different groups are probably working different parts of the
> disk.
> 
> > Need of device mapper tools
> > ---------------------------
> > A device mapper based solution will require creation of a ioband device
> > on each physical/logical device one wants to control. So it requires usage
> > of device mapper tools even for the people who are not using device mapper.
> > At the same time creation of ioband device on each partition in the system to 
> > control the IO can be cumbersome and overwhelming if system has got lots of
> > disks and partitions with-in.
> > 
> > 
> > IMHO, IO scheduler based IO controller is a reasonable approach to solve the
> > problem of group bandwidth control, and can do hierarchical IO scheduling
> > more tightly and efficiently.
> > 
> > But I am all ears to alternative approaches and suggestions how doing things
> > can be done better and will be glad to implement it.
> > 
> > TODO
> > ====
> > - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
> > - More testing to make sure there are no regressions in CFQ.
> > 
> > Testing
> > =======
> > 
> > Environment
> > ==========
> > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.
> 
> That's a bit of a toy.

Yes it is. :-)

> 
> Do we have testing results for more enterprisey hardware?  Big storage
> arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)

Not yet. I will try to get hold of some storage arrays and run some tests.

> 
> 
> > I am mostly
> > running fio jobs which have been limited to 30 seconds run and then monitored
> > the throughput and latency.
> >  
> > Test1: Random Reader Vs Random Writers
> > ======================================
> > Launched a random reader and then increasing number of random writers to see
> > the effect on random reader BW and max lantecies.
> > 
> > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> > 
> > [Vanilla CFQ, No groups]
> > <--------------random writers-------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   5737KiB/s   5737KiB/s   5737KiB/s   164K usec   503KiB/s    159K usec   
> > 2   2055KiB/s   1984KiB/s   4039KiB/s   1459K usec  150KiB/s    170K usec   
> > 4   1238KiB/s   932KiB/s    4419KiB/s   4332K usec  153KiB/s    225K usec   
> > 8   1059KiB/s   929KiB/s    7901KiB/s   1260K usec  118KiB/s    377K usec   
> > 16  604KiB/s    483KiB/s    8519KiB/s   3081K usec  47KiB/s     756K usec   
> > 32  367KiB/s    222KiB/s    9643KiB/s   5940K usec  22KiB/s     923K usec   
> > 
> > Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> > number of random writers in group1 and one random reader in group2 using fio.
> > 
> > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
> > <--------------random writers(group1)-------------> <-random reader(group2)->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   18115KiB/s  18115KiB/s  18115KiB/s  604K usec   345KiB/s    176K usec   
> > 2   3752KiB/s   3676KiB/s   7427KiB/s   4367K usec  402KiB/s    187K usec   
> > 4   1951KiB/s   1863KiB/s   7642KiB/s   1989K usec  384KiB/s    181K usec   
> > 8   755KiB/s    629KiB/s    5683KiB/s   2133K usec  366KiB/s    319K usec   
> > 16  418KiB/s    369KiB/s    6276KiB/s   1323K usec  352KiB/s    287K usec   
> > 32  236KiB/s    191KiB/s    6518KiB/s   1910K usec  337KiB/s    273K usec   
> 
> That's a good result.
> 
> > Also ran the same test with IO controller CFQ in flat mode to see if there
> > are any major deviations from Vanilla CFQ. Does not look like any.
> > 
> > [IO controller CFQ; No groups ]
> > <--------------random writers-------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   5696KiB/s   5696KiB/s   5696KiB/s   259K usec   500KiB/s    194K usec   
> > 2   2483KiB/s   2197KiB/s   4680KiB/s   887K usec   150KiB/s    159K usec   
> > 4   1471KiB/s   1433KiB/s   5817KiB/s   962K usec   126KiB/s    189K usec   
> > 8   691KiB/s    580KiB/s    5159KiB/s   2752K usec  197KiB/s    246K usec   
> > 16  781KiB/s    698KiB/s    11892KiB/s  943K usec   61KiB/s     529K usec   
> > 32  415KiB/s    324KiB/s    12461KiB/s  4614K usec  17KiB/s     737K usec   
> > 
> > Notes:
> > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down
> >   its throughput and bump up latencies significantly.
> 
> Isn't that a CFQ shortcoming which we should address separately?  If
> so, the comparisons aren't presently valid because we're comparing with
> a CFQ which has known, should-be-fixed problems.

I am not sure if it is a CFQ issue. These are synchronous random writes.
These are equally important as random reader. So now CFQ has 33 synchronous
queues to serve. Becuase it does not know about groups, it has no choice but
to serve them in round robin manner. So it does not sound like a CFQ issue.
Just that CFQ can give random reader an advantage if it knows that random
reader is in a different group and that's where IO controller comes in to
picture.

> 
> > - With IO controller, one can provide isolation to the random reader group and
> >   maintain consitent view of bandwidth and latencies. 
> > 
> > Test2: Random Reader Vs Sequential Reader
> > ========================================
> > Launched a random reader and then increasing number of sequential readers to
> > see the effect on BW and latencies of random reader.
> > 
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> > 
> > [ Vanilla CFQ, No groups ]
> > <---------------seq readers---------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   23318KiB/s  23318KiB/s  23318KiB/s  55940 usec  36KiB/s     247K usec   
> > 2   14732KiB/s  11406KiB/s  26126KiB/s  142K usec   20KiB/s     446K usec   
> > 4   9417KiB/s   5169KiB/s   27338KiB/s  404K usec   10KiB/s     993K usec   
> > 8   3360KiB/s   3041KiB/s   25850KiB/s  954K usec   60KiB/s     956K usec   
> > 16  1888KiB/s   1457KiB/s   26763KiB/s  1871K usec  28KiB/s     1868K usec  
> > 
> > Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> > number of sequential readers in group1 and one random reader in group2 using
> > fio.
> > 
> > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
> > <---------------group1--------------------------->  <------group2--------->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   13733KiB/s  13733KiB/s  13733KiB/s  247K usec   330KiB/s    154K usec   
> > 2   8553KiB/s   4963KiB/s   13514KiB/s  472K usec   322KiB/s    174K usec   
> > 4   5045KiB/s   1367KiB/s   13134KiB/s  947K usec   318KiB/s    178K usec   
> > 8   1774KiB/s   1420KiB/s   13035KiB/s  1871K usec  323KiB/s    233K usec   
> > 16  959KiB/s    518KiB/s    12691KiB/s  3809K usec  324KiB/s    208K usec   
> > 
> > Also ran the same test with IO controller CFQ in flat mode to see if there
> > are any major deviations from Vanilla CFQ. Does not look like any.
> > 
> > [IO controller CFQ; No groups ]
> > <---------------seq readers---------------------->  <------random reader-->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   23028KiB/s  23028KiB/s  23028KiB/s  47460 usec  36KiB/s     253K usec   
> > 2   14452KiB/s  11176KiB/s  25628KiB/s  145K usec   20KiB/s     447K usec   
> > 4   8815KiB/s   5720KiB/s   27121KiB/s  396K usec   10KiB/s     968K usec   
> > 8   3335KiB/s   2827KiB/s   24866KiB/s  960K usec   62KiB/s     955K usec   
> > 16  1784KiB/s   1311KiB/s   26537KiB/s  1883K usec  26KiB/s     1866K usec  
> > 
> > Notes:
> > - The BW and latencies of random reader in group 2 seems to be stable and
> >   bounded and does not get impacted much as number of sequential readers
> >   increase in group1. Hence provding good isolation.
> > 
> > - Throughput of sequential readers comes down and latencies go up as half
> >   of disk bandwidth (in terms of time) has been reserved for random reader
> >   group.
> > 
> > Test3: Sequential Reader Vs Sequential Reader
> > ============================================
> > Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
> > Launched increasing number of sequential readers in group1 and one sequential
> > reader in group2 using fio and monitored how bandwidth is being distributed
> > between two groups.
> > 
> > First 5 columns give stats about job in group1 and last two columns give
> > stats about job in group2.
> > 
> > <---------------group1--------------------------->  <------group2--------->
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> > 1   8970KiB/s   8970KiB/s   8970KiB/s   230K usec   20681KiB/s  124K usec   
> > 2   6783KiB/s   3202KiB/s   9984KiB/s   546K usec   19682KiB/s  139K usec   
> > 4   4641KiB/s   1029KiB/s   9280KiB/s   1185K usec  19235KiB/s  172K usec   
> > 8   1435KiB/s   1079KiB/s   9926KiB/s   2461K usec  19501KiB/s  153K usec   
> > 16  764KiB/s    398KiB/s    9395KiB/s   4986K usec  19367KiB/s  172K usec   
> > 
> > Note: group2 is getting double the bandwidth of group1 even in the face
> > of increasing number of readers in group1.
> > 
> > Test4 (Isolation between two KVM virtual machines)
> > ==================================================
> > Created two KVM virtual machines. Partitioned a disk on host in two partitions
> > and gave one partition to each virtual machine. Put both the virtual machines
> > in two different cgroup of weight 1000 and 500 each. Virtual machines created
> > ext3 file system on the partitions exported from host and did buffered writes.
> > Host seems writes as synchronous and virtual machine with higher weight gets
> > double the disk time of virtual machine of lower weight. Used deadline
> > scheduler in this test case.
> > 
> > Some more details about configuration are in documentation patch.
> > 
> > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > ===================================================================
> > Fairness for async writes is tricky and biggest reason is that async writes
> > are cached in higher layers (page cahe) as well as possibly in file system
> > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > in proportional manner.
> > 
> > For example, consider two dd threads reading /dev/zero as input file and doing
> > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > be forced to write out some pages to disk before more pages can be dirtied. But
> > not necessarily dirty pages of same thread are picked. It can very well pick
> > the inode of lesser priority dd thread and do some writeout. So effectively
> > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > service differentation.
> > 
> > IOW, the core problem with buffered write fairness is that higher weight thread
> > does not throw enought IO traffic at IO controller to keep the queue
> > continuously backlogged. In my testing, there are many .2 to .8 second
> > intervals where higher weight queue is empty and in that duration lower weight
> > queue get lots of job done giving the impression that there was no service
> > differentiation.
> > 
> > In summary, from IO controller point of view async writes support is there.
> > Because page cache has not been designed in such a manner that higher 
> > prio/weight writer can do more write out as compared to lower prio/weight
> > writer, gettting service differentiation is hard and it is visible in some
> > cases and not visible in some cases.
> 
> Here's where it all falls to pieces.
> 
> For async writeback we just don't care about IO priorities.  Because
> from the point of view of the userspace task, the write was async!  It
> occurred at memory bandwidth speed.
> 
> It's only when the kernel's dirty memory thresholds start to get
> exceeded that we start to care about prioritisation.  And at that time,
> all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> consumes just as much memory as a low-ioprio dirty page.
> 
> So when balance_dirty_pages() hits, what do we want to do?
> 
> I suppose that all we can do is to block low-ioprio processes more
> agressively at the VFS layer, to reduce the rate at which they're
> dirtying memory so as to give high-ioprio processes more of the disk
> bandwidth.
> 
> But you've gone and implemented all of this stuff at the io-controller
> level and not at the VFS level so you're, umm, screwed.

True that's an issue. For async writes we don't create parallel IO paths
from user space to IO scheduler hence it is hard to provide fairness in
all the cases. I think part of the problem is page cache and some
serialization also comes from kjournald.

How about coming up with another cgroup controller for buffered writes or
clubbing it with memory controller as KAMEZAWA Hiroyuki suggested and co-mount
this with io controller? This should help control buffered writes per
cgroup.

> 
> Importantly screwed!  It's a very common workload pattern, and one
> which causes tremendous amounts of IO to be generated very quickly,
> traditionally causing bad latency effects all over the place.  And we
> have no answer to this.
> 
> > Vanilla CFQ Vs IO Controller CFQ
> > ================================
> > We have not fundamentally changed CFQ, instead enhanced it to also support
> > hierarchical io scheduling. In the process invariably there are small changes
> > here and there as new scenarios come up. Running some tests here and comparing
> > both the CFQ's to see if there is any major deviation in behavior.
> > 
> > Test1: Sequential Readers
> > =========================
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > 
> > IO scheduler: IO controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > 
> > Test2: Sequential Writers
> > =========================
> > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > 
> > Test3: Random Readers
> > =========================
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > 
> > Test4: Random Writers
> > =====================
> > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > 
> > Notes:
> >  - Does not look like that anything has changed significantly.
> > 
> > Previous versions of the patches were posted here.
> > ------------------------------------------------
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > (V6) http://lkml.org/lkml/2009/7/2/369
> > (V7) http://lkml.org/lkml/2009/7/24/253
> > (V8) http://lkml.org/lkml/2009/8/16/204
> > (V9) http://lkml.org/lkml/2009/8/28/327
> > 
> > Thanks
> > Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]     ` <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  2009-09-25  1:18       ` KAMEZAWA Hiroyuki
@ 2009-09-25  4:14       ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25  4:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, Sep 25, 2009 at 10:09:52AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 24 Sep 2009 14:33:15 -0700
> Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > ===================================================================
> > > Fairness for async writes is tricky and biggest reason is that async writes
> > > are cached in higher layers (page cahe) as well as possibly in file system
> > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > in proportional manner.
> > > 
> > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > service differentation.
> > > 
> > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > does not throw enought IO traffic at IO controller to keep the queue
> > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > intervals where higher weight queue is empty and in that duration lower weight
> > > queue get lots of job done giving the impression that there was no service
> > > differentiation.
> > > 
> > > In summary, from IO controller point of view async writes support is there.
> > > Because page cache has not been designed in such a manner that higher 
> > > prio/weight writer can do more write out as compared to lower prio/weight
> > > writer, gettting service differentiation is hard and it is visible in some
> > > cases and not visible in some cases.
> > 
> > Here's where it all falls to pieces.
> > 
> > For async writeback we just don't care about IO priorities.  Because
> > from the point of view of the userspace task, the write was async!  It
> > occurred at memory bandwidth speed.
> > 
> > It's only when the kernel's dirty memory thresholds start to get
> > exceeded that we start to care about prioritisation.  And at that time,
> > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > consumes just as much memory as a low-ioprio dirty page.
> > 
> > So when balance_dirty_pages() hits, what do we want to do?
> > 
> > I suppose that all we can do is to block low-ioprio processes more
> > agressively at the VFS layer, to reduce the rate at which they're
> > dirtying memory so as to give high-ioprio processes more of the disk
> > bandwidth.
> > 
> > But you've gone and implemented all of this stuff at the io-controller
> > level and not at the VFS level so you're, umm, screwed.
> > 
> 
> I think I must support dirty-ratio in memcg layer. But not yet.
> I can't easily imagine how the system will work if both dirty-ratio and
> io-controller cgroup are supported.

IIUC, you are suggesting per memeory cgroup dirty ratio and writer will be 
throttled if dirty ratio is crossed.  makes sense to me. Just that io
controller and memory controller shall have to me mounted together.

Thanks
Vivek

> But considering use them as a set of
> cgroup, called containers(zone?), it's will not be bad, I think.
> 
> The final bottelneck queue for fairness in usual workload on usual (small)
> server will ext3's journal, I wonder ;)
> 
> Thanks,
> -Kame
> 
> 
> > Importantly screwed!  It's a very common workload pattern, and one
> > which causes tremendous amounts of IO to be generated very quickly,
> > traditionally causing bad latency effects all over the place.  And we
> > have no answer to this.
> > 
> > > Vanilla CFQ Vs IO Controller CFQ
> > > ================================
> > > We have not fundamentally changed CFQ, instead enhanced it to also support
> > > hierarchical io scheduling. In the process invariably there are small changes
> > > here and there as new scenarios come up. Running some tests here and comparing
> > > both the CFQ's to see if there is any major deviation in behavior.
> > > 
> > > Test1: Sequential Readers
> > > =========================
> > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > > 
> > > IO scheduler: IO controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > > 
> > > Test2: Sequential Writers
> > > =========================
> > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > > 
> > > Test3: Random Readers
> > > =========================
> > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > > 
> > > Test4: Random Writers
> > > =====================
> > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > > 
> > > Notes:
> > >  - Does not look like that anything has changed significantly.
> > > 
> > > Previous versions of the patches were posted here.
> > > ------------------------------------------------
> > > 
> > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > (V6) http://lkml.org/lkml/2009/7/2/369
> > > (V7) http://lkml.org/lkml/2009/7/24/253
> > > (V8) http://lkml.org/lkml/2009/8/16/204
> > > (V9) http://lkml.org/lkml/2009/8/28/327
> > > 
> > > Thanks
> > > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25  1:09   ` KAMEZAWA Hiroyuki
@ 2009-09-25  4:14       ` Vivek Goyal
  2009-09-25  1:18       ` KAMEZAWA Hiroyuki
  2009-09-25  4:14       ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25  4:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, linux-kernel, jens.axboe, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo,
	riel

On Fri, Sep 25, 2009 at 10:09:52AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 24 Sep 2009 14:33:15 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > ===================================================================
> > > Fairness for async writes is tricky and biggest reason is that async writes
> > > are cached in higher layers (page cahe) as well as possibly in file system
> > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > in proportional manner.
> > > 
> > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > service differentation.
> > > 
> > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > does not throw enought IO traffic at IO controller to keep the queue
> > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > intervals where higher weight queue is empty and in that duration lower weight
> > > queue get lots of job done giving the impression that there was no service
> > > differentiation.
> > > 
> > > In summary, from IO controller point of view async writes support is there.
> > > Because page cache has not been designed in such a manner that higher 
> > > prio/weight writer can do more write out as compared to lower prio/weight
> > > writer, gettting service differentiation is hard and it is visible in some
> > > cases and not visible in some cases.
> > 
> > Here's where it all falls to pieces.
> > 
> > For async writeback we just don't care about IO priorities.  Because
> > from the point of view of the userspace task, the write was async!  It
> > occurred at memory bandwidth speed.
> > 
> > It's only when the kernel's dirty memory thresholds start to get
> > exceeded that we start to care about prioritisation.  And at that time,
> > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > consumes just as much memory as a low-ioprio dirty page.
> > 
> > So when balance_dirty_pages() hits, what do we want to do?
> > 
> > I suppose that all we can do is to block low-ioprio processes more
> > agressively at the VFS layer, to reduce the rate at which they're
> > dirtying memory so as to give high-ioprio processes more of the disk
> > bandwidth.
> > 
> > But you've gone and implemented all of this stuff at the io-controller
> > level and not at the VFS level so you're, umm, screwed.
> > 
> 
> I think I must support dirty-ratio in memcg layer. But not yet.
> I can't easily imagine how the system will work if both dirty-ratio and
> io-controller cgroup are supported.

IIUC, you are suggesting per memeory cgroup dirty ratio and writer will be 
throttled if dirty ratio is crossed.  makes sense to me. Just that io
controller and memory controller shall have to me mounted together.

Thanks
Vivek

> But considering use them as a set of
> cgroup, called containers(zone?), it's will not be bad, I think.
> 
> The final bottelneck queue for fairness in usual workload on usual (small)
> server will ext3's journal, I wonder ;)
> 
> Thanks,
> -Kame
> 
> 
> > Importantly screwed!  It's a very common workload pattern, and one
> > which causes tremendous amounts of IO to be generated very quickly,
> > traditionally causing bad latency effects all over the place.  And we
> > have no answer to this.
> > 
> > > Vanilla CFQ Vs IO Controller CFQ
> > > ================================
> > > We have not fundamentally changed CFQ, instead enhanced it to also support
> > > hierarchical io scheduling. In the process invariably there are small changes
> > > here and there as new scenarios come up. Running some tests here and comparing
> > > both the CFQ's to see if there is any major deviation in behavior.
> > > 
> > > Test1: Sequential Readers
> > > =========================
> > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > > 
> > > IO scheduler: IO controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > > 
> > > Test2: Sequential Writers
> > > =========================
> > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > > 
> > > Test3: Random Readers
> > > =========================
> > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > > 
> > > Test4: Random Writers
> > > =====================
> > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > > 
> > > Notes:
> > >  - Does not look like that anything has changed significantly.
> > > 
> > > Previous versions of the patches were posted here.
> > > ------------------------------------------------
> > > 
> > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > (V6) http://lkml.org/lkml/2009/7/2/369
> > > (V7) http://lkml.org/lkml/2009/7/24/253
> > > (V8) http://lkml.org/lkml/2009/8/16/204
> > > (V9) http://lkml.org/lkml/2009/8/28/327
> > > 
> > > Thanks
> > > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-25  4:14       ` Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-25  4:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, m-ikeda, riel, lizf, fchecconi, Andrew Morton,
	containers, linux-kernel, s-uchida, righi.andrea, torvalds

On Fri, Sep 25, 2009 at 10:09:52AM +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 24 Sep 2009 14:33:15 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > ===================================================================
> > > Fairness for async writes is tricky and biggest reason is that async writes
> > > are cached in higher layers (page cahe) as well as possibly in file system
> > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > in proportional manner.
> > > 
> > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > service differentation.
> > > 
> > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > does not throw enought IO traffic at IO controller to keep the queue
> > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > intervals where higher weight queue is empty and in that duration lower weight
> > > queue get lots of job done giving the impression that there was no service
> > > differentiation.
> > > 
> > > In summary, from IO controller point of view async writes support is there.
> > > Because page cache has not been designed in such a manner that higher 
> > > prio/weight writer can do more write out as compared to lower prio/weight
> > > writer, gettting service differentiation is hard and it is visible in some
> > > cases and not visible in some cases.
> > 
> > Here's where it all falls to pieces.
> > 
> > For async writeback we just don't care about IO priorities.  Because
> > from the point of view of the userspace task, the write was async!  It
> > occurred at memory bandwidth speed.
> > 
> > It's only when the kernel's dirty memory thresholds start to get
> > exceeded that we start to care about prioritisation.  And at that time,
> > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > consumes just as much memory as a low-ioprio dirty page.
> > 
> > So when balance_dirty_pages() hits, what do we want to do?
> > 
> > I suppose that all we can do is to block low-ioprio processes more
> > agressively at the VFS layer, to reduce the rate at which they're
> > dirtying memory so as to give high-ioprio processes more of the disk
> > bandwidth.
> > 
> > But you've gone and implemented all of this stuff at the io-controller
> > level and not at the VFS level so you're, umm, screwed.
> > 
> 
> I think I must support dirty-ratio in memcg layer. But not yet.
> I can't easily imagine how the system will work if both dirty-ratio and
> io-controller cgroup are supported.

IIUC, you are suggesting per memeory cgroup dirty ratio and writer will be 
throttled if dirty ratio is crossed.  makes sense to me. Just that io
controller and memory controller shall have to me mounted together.

Thanks
Vivek

> But considering use them as a set of
> cgroup, called containers(zone?), it's will not be bad, I think.
> 
> The final bottelneck queue for fairness in usual workload on usual (small)
> server will ext3's journal, I wonder ;)
> 
> Thanks,
> -Kame
> 
> 
> > Importantly screwed!  It's a very common workload pattern, and one
> > which causes tremendous amounts of IO to be generated very quickly,
> > traditionally causing bad latency effects all over the place.  And we
> > have no answer to this.
> > 
> > > Vanilla CFQ Vs IO Controller CFQ
> > > ================================
> > > We have not fundamentally changed CFQ, instead enhanced it to also support
> > > hierarchical io scheduling. In the process invariably there are small changes
> > > here and there as new scenarios come up. Running some tests here and comparing
> > > both the CFQ's to see if there is any major deviation in behavior.
> > > 
> > > Test1: Sequential Readers
> > > =========================
> > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > > 
> > > IO scheduler: IO controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > > 
> > > Test2: Sequential Writers
> > > =========================
> > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > > 
> > > Test3: Random Readers
> > > =========================
> > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > > 
> > > Test4: Random Writers
> > > =====================
> > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > > 
> > > IO scheduler: Vanilla CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > > 
> > > IO scheduler: IO Controller CFQ
> > > 
> > > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > > 
> > > Notes:
> > >  - Does not look like that anything has changed significantly.
> > > 
> > > Previous versions of the patches were posted here.
> > > ------------------------------------------------
> > > 
> > > (V1) http://lkml.org/lkml/2009/3/11/486
> > > (V2) http://lkml.org/lkml/2009/5/5/275
> > > (V3) http://lkml.org/lkml/2009/5/26/472
> > > (V4) http://lkml.org/lkml/2009/6/8/580
> > > (V5) http://lkml.org/lkml/2009/6/19/279
> > > (V6) http://lkml.org/lkml/2009/7/2/369
> > > (V7) http://lkml.org/lkml/2009/7/24/253
> > > (V8) http://lkml.org/lkml/2009/8/16/204
> > > (V9) http://lkml.org/lkml/2009/8/28/327
> > > 
> > > Thanks
> > > Vivek
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> > 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  2009-09-24 21:33   ` Andrew Morton
@ 2009-09-25  2:20   ` Ulrich Lukas
  2009-09-29  0:37   ` Nauman Rafique
  2 siblings, 0 replies; 349+ messages in thread
From: Ulrich Lukas @ 2009-09-25  2:20 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA

Vivek Goyal wrote:
> Notes:
> - With vanilla CFQ, random writers can overwhelm a random reader.
>   Bring down its throughput and bump up latencies significantly.


IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
too.

I'm basing this assumption on the observations I made on both OpenSuse
11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
titled: "Poor desktop responsiveness with background I/O-operations" of
2009-09-20.
(Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org)


Thus, I'm posting this to show that your work is greatly appreciated,
given the rather disappointig status quo of Linux's fairness when it
comes to disk IO time.

I hope that your efforts lead to a change in performance of current
userland applications, the sooner, the better.


Thanks
Ulrich

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-24 19:25 Vivek Goyal
  2009-09-24 21:33   ` Andrew Morton
@ 2009-09-25  2:20 ` Ulrich Lukas
       [not found]   ` <4ABC28DE.7050809-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org>
  2009-09-25 20:26     ` Vivek Goyal
  2009-09-29  0:37   ` Nauman Rafique
       [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  3 siblings, 2 replies; 349+ messages in thread
From: Ulrich Lukas @ 2009-09-25  2:20 UTC (permalink / raw)
  To: Vivek Goyal; +Cc: linux-kernel, containers

Vivek Goyal wrote:
> Notes:
> - With vanilla CFQ, random writers can overwhelm a random reader.
>   Bring down its throughput and bump up latencies significantly.


IIRC, with vanilla CFQ, sequential writing can overwhelm random readers,
too.

I'm basing this assumption on the observations I made on both OpenSuse
11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML
titled: "Poor desktop responsiveness with background I/O-operations" of
2009-09-20.
(Message ID: 4AB59CBB.8090907@datenparkplatz.de)


Thus, I'm posting this to show that your work is greatly appreciated,
given the rather disappointig status quo of Linux's fairness when it
comes to disk IO time.

I hope that your efforts lead to a change in performance of current
userland applications, the sooner, the better.


Thanks
Ulrich

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]     ` <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
@ 2009-09-25  1:18       ` KAMEZAWA Hiroyuki
  2009-09-25  4:14       ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  1:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	Andrew Morton,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Fri, 25 Sep 2009 10:09:52 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote:

> On Thu, 24 Sep 2009 14:33:15 -0700
> Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > ===================================================================
> > > Fairness for async writes is tricky and biggest reason is that async writes
> > > are cached in higher layers (page cahe) as well as possibly in file system
> > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > in proportional manner.
> > > 
> > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > service differentation.
> > > 
> > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > does not throw enought IO traffic at IO controller to keep the queue
> > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > intervals where higher weight queue is empty and in that duration lower weight
> > > queue get lots of job done giving the impression that there was no service
> > > differentiation.
> > > 
> > > In summary, from IO controller point of view async writes support is there.
> > > Because page cache has not been designed in such a manner that higher 
> > > prio/weight writer can do more write out as compared to lower prio/weight
> > > writer, gettting service differentiation is hard and it is visible in some
> > > cases and not visible in some cases.
> > 
> > Here's where it all falls to pieces.
> > 
> > For async writeback we just don't care about IO priorities.  Because
> > from the point of view of the userspace task, the write was async!  It
> > occurred at memory bandwidth speed.
> > 
> > It's only when the kernel's dirty memory thresholds start to get
> > exceeded that we start to care about prioritisation.  And at that time,
> > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > consumes just as much memory as a low-ioprio dirty page.
> > 
> > So when balance_dirty_pages() hits, what do we want to do?
> > 
> > I suppose that all we can do is to block low-ioprio processes more
> > agressively at the VFS layer, to reduce the rate at which they're
> > dirtying memory so as to give high-ioprio processes more of the disk
> > bandwidth.
> > 
> > But you've gone and implemented all of this stuff at the io-controller
> > level and not at the VFS level so you're, umm, screwed.
> > 
> 
> I think I must support dirty-ratio in memcg layer. But not yet.

OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
And add a control knob as
  bufferred_write.nr_dirty_thresh
to limit the number of dirty pages generetad via a cgroup.

Because memcg just records a owner of pages but not records who makes them
dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
cgroup code.

But I'm not sure how I should treat I/Os generated out by kswapd.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-25  1:09   ` KAMEZAWA Hiroyuki
@ 2009-09-25  1:18       ` KAMEZAWA Hiroyuki
  2009-09-25  1:18       ` KAMEZAWA Hiroyuki
  2009-09-25  4:14       ` Vivek Goyal
  2 siblings, 0 replies; 349+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  1:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Vivek Goyal, linux-kernel, jens.axboe, containers,
	dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente,
	ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval,
	balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds,
	mingo, riel

On Fri, 25 Sep 2009 10:09:52 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 24 Sep 2009 14:33:15 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > ===================================================================
> > > Fairness for async writes is tricky and biggest reason is that async writes
> > > are cached in higher layers (page cahe) as well as possibly in file system
> > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > in proportional manner.
> > > 
> > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > service differentation.
> > > 
> > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > does not throw enought IO traffic at IO controller to keep the queue
> > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > intervals where higher weight queue is empty and in that duration lower weight
> > > queue get lots of job done giving the impression that there was no service
> > > differentiation.
> > > 
> > > In summary, from IO controller point of view async writes support is there.
> > > Because page cache has not been designed in such a manner that higher 
> > > prio/weight writer can do more write out as compared to lower prio/weight
> > > writer, gettting service differentiation is hard and it is visible in some
> > > cases and not visible in some cases.
> > 
> > Here's where it all falls to pieces.
> > 
> > For async writeback we just don't care about IO priorities.  Because
> > from the point of view of the userspace task, the write was async!  It
> > occurred at memory bandwidth speed.
> > 
> > It's only when the kernel's dirty memory thresholds start to get
> > exceeded that we start to care about prioritisation.  And at that time,
> > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > consumes just as much memory as a low-ioprio dirty page.
> > 
> > So when balance_dirty_pages() hits, what do we want to do?
> > 
> > I suppose that all we can do is to block low-ioprio processes more
> > agressively at the VFS layer, to reduce the rate at which they're
> > dirtying memory so as to give high-ioprio processes more of the disk
> > bandwidth.
> > 
> > But you've gone and implemented all of this stuff at the io-controller
> > level and not at the VFS level so you're, umm, screwed.
> > 
> 
> I think I must support dirty-ratio in memcg layer. But not yet.

OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
And add a control knob as
  bufferred_write.nr_dirty_thresh
to limit the number of dirty pages generetad via a cgroup.

Because memcg just records a owner of pages but not records who makes them
dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
cgroup code.

But I'm not sure how I should treat I/Os generated out by kswapd.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-25  1:18       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 349+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  1:18 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, Vivek Goyal, m-ikeda, riel, lizf, fchecconi,
	Andrew Morton, containers, linux-kernel, s-uchida, righi.andrea,
	torvalds

On Fri, 25 Sep 2009 10:09:52 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Thu, 24 Sep 2009 14:33:15 -0700
> Andrew Morton <akpm@linux-foundation.org> wrote:
> > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > > ===================================================================
> > > Fairness for async writes is tricky and biggest reason is that async writes
> > > are cached in higher layers (page cahe) as well as possibly in file system
> > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > > in proportional manner.
> > > 
> > > For example, consider two dd threads reading /dev/zero as input file and doing
> > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > > be forced to write out some pages to disk before more pages can be dirtied. But
> > > not necessarily dirty pages of same thread are picked. It can very well pick
> > > the inode of lesser priority dd thread and do some writeout. So effectively
> > > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > > service differentation.
> > > 
> > > IOW, the core problem with buffered write fairness is that higher weight thread
> > > does not throw enought IO traffic at IO controller to keep the queue
> > > continuously backlogged. In my testing, there are many .2 to .8 second
> > > intervals where higher weight queue is empty and in that duration lower weight
> > > queue get lots of job done giving the impression that there was no service
> > > differentiation.
> > > 
> > > In summary, from IO controller point of view async writes support is there.
> > > Because page cache has not been designed in such a manner that higher 
> > > prio/weight writer can do more write out as compared to lower prio/weight
> > > writer, gettting service differentiation is hard and it is visible in some
> > > cases and not visible in some cases.
> > 
> > Here's where it all falls to pieces.
> > 
> > For async writeback we just don't care about IO priorities.  Because
> > from the point of view of the userspace task, the write was async!  It
> > occurred at memory bandwidth speed.
> > 
> > It's only when the kernel's dirty memory thresholds start to get
> > exceeded that we start to care about prioritisation.  And at that time,
> > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> > consumes just as much memory as a low-ioprio dirty page.
> > 
> > So when balance_dirty_pages() hits, what do we want to do?
> > 
> > I suppose that all we can do is to block low-ioprio processes more
> > agressively at the VFS layer, to reduce the rate at which they're
> > dirtying memory so as to give high-ioprio processes more of the disk
> > bandwidth.
> > 
> > But you've gone and implemented all of this stuff at the io-controller
> > level and not at the VFS level so you're, umm, screwed.
> > 
> 
> I think I must support dirty-ratio in memcg layer. But not yet.

OR...I'll add a bufferred-write-cgroup to track bufferred writebacks.
And add a control knob as
  bufferred_write.nr_dirty_thresh
to limit the number of dirty pages generetad via a cgroup.

Because memcg just records a owner of pages but not records who makes them
dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio
cgroup code.

But I'm not sure how I should treat I/Os generated out by kswapd.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found]   ` <20090924143315.781cd0ac.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
@ 2009-09-25  1:09     ` KAMEZAWA Hiroyuki
  2009-09-25  5:04     ` Vivek Goyal
  1 sibling, 0 replies; 349+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  1:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, 24 Sep 2009 14:33:15 -0700
Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:
> > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > ===================================================================
> > Fairness for async writes is tricky and biggest reason is that async writes
> > are cached in higher layers (page cahe) as well as possibly in file system
> > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > in proportional manner.
> > 
> > For example, consider two dd threads reading /dev/zero as input file and doing
> > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > be forced to write out some pages to disk before more pages can be dirtied. But
> > not necessarily dirty pages of same thread are picked. It can very well pick
> > the inode of lesser priority dd thread and do some writeout. So effectively
> > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > service differentation.
> > 
> > IOW, the core problem with buffered write fairness is that higher weight thread
> > does not throw enought IO traffic at IO controller to keep the queue
> > continuously backlogged. In my testing, there are many .2 to .8 second
> > intervals where higher weight queue is empty and in that duration lower weight
> > queue get lots of job done giving the impression that there was no service
> > differentiation.
> > 
> > In summary, from IO controller point of view async writes support is there.
> > Because page cache has not been designed in such a manner that higher 
> > prio/weight writer can do more write out as compared to lower prio/weight
> > writer, gettting service differentiation is hard and it is visible in some
> > cases and not visible in some cases.
> 
> Here's where it all falls to pieces.
> 
> For async writeback we just don't care about IO priorities.  Because
> from the point of view of the userspace task, the write was async!  It
> occurred at memory bandwidth speed.
> 
> It's only when the kernel's dirty memory thresholds start to get
> exceeded that we start to care about prioritisation.  And at that time,
> all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> consumes just as much memory as a low-ioprio dirty page.
> 
> So when balance_dirty_pages() hits, what do we want to do?
> 
> I suppose that all we can do is to block low-ioprio processes more
> agressively at the VFS layer, to reduce the rate at which they're
> dirtying memory so as to give high-ioprio processes more of the disk
> bandwidth.
> 
> But you've gone and implemented all of this stuff at the io-controller
> level and not at the VFS level so you're, umm, screwed.
> 

I think I must support dirty-ratio in memcg layer. But not yet.
I can't easily imagine how the system will work if both dirty-ratio and
io-controller cgroup are supported. But considering use them as a set of
cgroup, called containers(zone?), it's will not be bad, I think.

The final bottelneck queue for fairness in usual workload on usual (small)
server will ext3's journal, I wonder ;)

Thanks,
-Kame


> Importantly screwed!  It's a very common workload pattern, and one
> which causes tremendous amounts of IO to be generated very quickly,
> traditionally causing bad latency effects all over the place.  And we
> have no answer to this.
> 
> > Vanilla CFQ Vs IO Controller CFQ
> > ================================
> > We have not fundamentally changed CFQ, instead enhanced it to also support
> > hierarchical io scheduling. In the process invariably there are small changes
> > here and there as new scenarios come up. Running some tests here and comparing
> > both the CFQ's to see if there is any major deviation in behavior.
> > 
> > Test1: Sequential Readers
> > =========================
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > 
> > IO scheduler: IO controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > 
> > Test2: Sequential Writers
> > =========================
> > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > 
> > Test3: Random Readers
> > =========================
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > 
> > Test4: Random Writers
> > =====================
> > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > 
> > Notes:
> >  - Does not look like that anything has changed significantly.
> > 
> > Previous versions of the patches were posted here.
> > ------------------------------------------------
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > (V6) http://lkml.org/lkml/2009/7/2/369
> > (V7) http://lkml.org/lkml/2009/7/24/253
> > (V8) http://lkml.org/lkml/2009/8/16/204
> > (V9) http://lkml.org/lkml/2009/8/28/327
> > 
> > Thanks
> > Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-24 21:33   ` Andrew Morton
  (?)
@ 2009-09-25  1:09   ` KAMEZAWA Hiroyuki
       [not found]     ` <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
                       ` (2 more replies)
  -1 siblings, 3 replies; 349+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-09-25  1:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vivek Goyal, linux-kernel, jens.axboe, containers, dm-devel,
	nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov,
	fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir,
	righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo,
	riel

On Thu, 24 Sep 2009 14:33:15 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> > ===================================================================
> > Fairness for async writes is tricky and biggest reason is that async writes
> > are cached in higher layers (page cahe) as well as possibly in file system
> > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> > in proportional manner.
> > 
> > For example, consider two dd threads reading /dev/zero as input file and doing
> > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> > be forced to write out some pages to disk before more pages can be dirtied. But
> > not necessarily dirty pages of same thread are picked. It can very well pick
> > the inode of lesser priority dd thread and do some writeout. So effectively
> > higher weight dd is doing writeouts of lower weight dd pages and we don't see
> > service differentation.
> > 
> > IOW, the core problem with buffered write fairness is that higher weight thread
> > does not throw enought IO traffic at IO controller to keep the queue
> > continuously backlogged. In my testing, there are many .2 to .8 second
> > intervals where higher weight queue is empty and in that duration lower weight
> > queue get lots of job done giving the impression that there was no service
> > differentiation.
> > 
> > In summary, from IO controller point of view async writes support is there.
> > Because page cache has not been designed in such a manner that higher 
> > prio/weight writer can do more write out as compared to lower prio/weight
> > writer, gettting service differentiation is hard and it is visible in some
> > cases and not visible in some cases.
> 
> Here's where it all falls to pieces.
> 
> For async writeback we just don't care about IO priorities.  Because
> from the point of view of the userspace task, the write was async!  It
> occurred at memory bandwidth speed.
> 
> It's only when the kernel's dirty memory thresholds start to get
> exceeded that we start to care about prioritisation.  And at that time,
> all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
> consumes just as much memory as a low-ioprio dirty page.
> 
> So when balance_dirty_pages() hits, what do we want to do?
> 
> I suppose that all we can do is to block low-ioprio processes more
> agressively at the VFS layer, to reduce the rate at which they're
> dirtying memory so as to give high-ioprio processes more of the disk
> bandwidth.
> 
> But you've gone and implemented all of this stuff at the io-controller
> level and not at the VFS level so you're, umm, screwed.
> 

I think I must support dirty-ratio in memcg layer. But not yet.
I can't easily imagine how the system will work if both dirty-ratio and
io-controller cgroup are supported. But considering use them as a set of
cgroup, called containers(zone?), it's will not be bad, I think.

The final bottelneck queue for fairness in usual workload on usual (small)
server will ext3's journal, I wonder ;)

Thanks,
-Kame


> Importantly screwed!  It's a very common workload pattern, and one
> which causes tremendous amounts of IO to be generated very quickly,
> traditionally causing bad latency effects all over the place.  And we
> have no answer to this.
> 
> > Vanilla CFQ Vs IO Controller CFQ
> > ================================
> > We have not fundamentally changed CFQ, instead enhanced it to also support
> > hierarchical io scheduling. In the process invariably there are small changes
> > here and there as new scenarios come up. Running some tests here and comparing
> > both the CFQ's to see if there is any major deviation in behavior.
> > 
> > Test1: Sequential Readers
> > =========================
> > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> > 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> > 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> > 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> > 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> > 
> > IO scheduler: IO controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> > 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> > 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> > 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> > 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> > 
> > Test2: Sequential Writers
> > =========================
> > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> > 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> > 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> > 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> > 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> > 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> > 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> > 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> > 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> > 
> > Test3: Random Readers
> > =========================
> > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> > 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> > 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> > 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> > 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> > 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> > 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> > 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> > 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> > 
> > Test4: Random Writers
> > =====================
> > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> > 
> > IO scheduler: Vanilla CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> > 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> > 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> > 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> > 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> > 
> > IO scheduler: IO Controller CFQ
> > 
> > nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> > 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> > 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> > 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> > 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> > 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> > 
> > Notes:
> >  - Does not look like that anything has changed significantly.
> > 
> > Previous versions of the patches were posted here.
> > ------------------------------------------------
> > 
> > (V1) http://lkml.org/lkml/2009/3/11/486
> > (V2) http://lkml.org/lkml/2009/5/5/275
> > (V3) http://lkml.org/lkml/2009/5/26/472
> > (V4) http://lkml.org/lkml/2009/6/8/580
> > (V5) http://lkml.org/lkml/2009/6/19/279
> > (V6) http://lkml.org/lkml/2009/7/2/369
> > (V7) http://lkml.org/lkml/2009/7/24/253
> > (V8) http://lkml.org/lkml/2009/8/16/204
> > (V9) http://lkml.org/lkml/2009/8/28/327
> > 
> > Thanks
> > Vivek
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
       [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2009-09-24 21:33   ` Andrew Morton
  2009-09-25  2:20   ` Ulrich Lukas
  2009-09-29  0:37   ` Nauman Rafique
  2 siblings, 0 replies; 349+ messages in thread
From: Andrew Morton @ 2009-09-24 21:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA,
	jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b

On Thu, 24 Sep 2009 15:25:04 -0400
Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> 
> Hi All,
> 
> Here is the V10 of the IO controller patches generated on top of 2.6.31.
> 

Thanks for the writeup.  It really helps and is most worthwhile for a
project of this importance, size and complexity.


>  
> What problem are we trying to solve
> ===================================
> Provide group IO scheduling feature in Linux along the lines of other resource
> controllers like cpu.
> 
> IOW, provide facility so that a user can group applications using cgroups and
> control the amount of disk time/bandwidth received by a group based on its
> weight. 
> 
> How to solve the problem
> =========================
> 
> Different people have solved the issue differetnly. So far looks it looks
> like we seem to have following two core requirements when it comes to
> fairness at group level.
> 
> - Control bandwidth seen by groups.
> - Control on latencies when a request gets backlogged in group.
> 
> At least there are now three patchsets available (including this one).
> 
> IO throttling
> -------------
> This is a bandwidth controller which keeps track of IO rate of a group and
> throttles the process in the group if it exceeds the user specified limit.
> 
> dm-ioband
> ---------
> This is a proportional bandwidth controller implemented as device mapper
> driver and provides fair access in terms of amount of IO done (not in terms
> of disk time as CFQ does).
> 
> So one will setup one or more dm-ioband devices on top of physical/logical
> block device, configure the ioband device and pass information like grouping
> etc. Now this device will keep track of bios flowing through it and control
> the flow of bios based on group policies.
> 
> IO scheduler based IO controller
> --------------------------------
> Here we have viewed the problem of IO contoller as hierarchical group
> scheduling (along the lines of CFS group scheduling) issue. Currently one can
> view linux IO schedulers as flat where there is one root group and all the IO
> belongs to that group.
> 
> This patchset basically modifies IO schedulers to also support hierarchical
> group scheduling. CFQ already provides fairness among different processes. I 
> have extended it support group IO schduling. Also took some of the code out
> of CFQ and put in a common layer so that same group scheduling code can be
> used by noop, deadline and AS to support group scheduling. 
> 
> Pros/Cons
> =========
> There are pros and cons to each of the approach. Following are some of the
> thoughts.
> 
> Max bandwidth vs proportional bandwidth
> ---------------------------------------
> IO throttling is a max bandwidth controller and not a proportional one.
> Additionaly it provides fairness in terms of amount of IO done (and not in
> terms of disk time as CFQ does).
> 
> Personally, I think that proportional weight controller is useful to more
> people than just max bandwidth controller. In addition, IO scheduler based
> controller can also be enhanced to do max bandwidth control. So it can 
> satisfy wider set of requirements.
> 
> Fairness in terms of disk time vs size of IO
> ---------------------------------------------
> An higher level controller will most likely be limited to providing fairness
> in terms of size/number of IO done and will find it hard to provide fairness
> in terms of disk time used (as CFQ provides between various prio levels). This
> is because only IO scheduler knows how much disk time a queue has used and
> information about queues and disk time used is not exported to higher
> layers.
> 
> So a seeky application will still run away with lot of disk time and bring
> down the overall throughput of the the disk.

But that's only true if the thing is poorly implemented.

A high-level controller will need some view of the busyness of the
underlying device(s).  That could be "proportion of idle time", or
"average length of queue" or "average request latency" or some mix of
these or something else altogether.

But these things are simple to calculate, and are simple to feed back
to the higher-level controller and probably don't require any changes
to to IO scheduler at all, which is a great advantage.


And I must say that high-level throttling based upon feedback from
lower layers seems like a much better model to me than hacking away in
the IO scheduler layer.  Both from an implementation point of view and
from a "we can get it to work on things other than block devices" point
of view.

> Currently dm-ioband provides fairness in terms of number/size of IO.
> 
> Latencies and isolation between groups
> --------------------------------------
> An higher level controller is generally implementing a bandwidth throttling
> solution where if a group exceeds either the max bandwidth or the proportional
> share then throttle that group.
> 
> This kind of approach will probably not help in controlling latencies as it
> will depend on underlying IO scheduler. Consider following scenario. 
> 
> Assume there are two groups. One group is running multiple sequential readers
> and other group has a random reader. sequential readers will get a nice 100ms
> slice

Do you refer to each reader within group1, or to all readers?  It would be
daft if each reader in group1 were to get 100ms.

> each and then a random reader from group2 will get to dispatch the
> request. So latency of this random reader will depend on how many sequential
> readers are running in other group and that is a weak isolation between groups.

And yet that is what you appear to mean.

But surely nobody would do that - the 100ms would be assigned to and
distributed amongst all readers in group1?

> When we control things at IO scheduler level, we assign one time slice to one
> group and then pick next entity to run. So effectively after one time slice
> (max 180ms, if prio 0 sequential reader is running), random reader in other
> group will get to run. Hence we achieve better isolation between groups as
> response time of process in a differnt group is generally not dependent on
> number of processes running in competing group.  

I don't understand why you're comparing this implementation with such
an obviously dumb competing design!

> So a higher level solution is most likely limited to only shaping bandwidth
> without any control on latencies.
> 
> Stacking group scheduler on top of CFQ can lead to issues
> ---------------------------------------------------------
> IO throttling and dm-ioband both are second level controller. That is these
> controllers are implemented in higher layers than io schedulers. So they
> control the IO at higher layer based on group policies and later IO
> schedulers take care of dispatching these bios to disk.
> 
> Implementing a second level controller has the advantage of being able to
> provide bandwidth control even on logical block devices in the IO stack
> which don't have any IO schedulers attached to these. But they can also 
> interefere with IO scheduling policy of underlying IO scheduler and change
> the effective behavior. Following are some of the issues which I think
> should be visible in second level controller in one form or other.
> 
>   Prio with-in group
>   ------------------
>   A second level controller can potentially interefere with behavior of
>   different prio processes with-in a group. bios are buffered at higher layer
>   in single queue and release of bios is FIFO and not proportionate to the
>   ioprio of the process. This can result in a particular prio level not
>   getting fair share.

That's an administrator error, isn't it?  Should have put the
different-priority processes into different groups.

>   Buffering at higher layer can delay read requests for more than slice idle
>   period of CFQ (default 8 ms). That means, it is possible that we are waiting
>   for a request from the queue but it is buffered at higher layer and then idle
>   timer will fire. It means that queue will losse its share at the same time
>   overall throughput will be impacted as we lost those 8 ms.

That sounds like a bug.

>   Read Vs Write
>   -------------
>   Writes can overwhelm readers hence second level controller FIFO release
>   will run into issue here. If there is a single queue maintained then reads
>   will suffer large latencies. If there separate queues for reads and writes
>   then it will be hard to decide in what ratio to dispatch reads and writes as
>   it is IO scheduler's decision to decide when and how much read/write to
>   dispatch. This is another place where higher level controller will not be in
>   sync with lower level io scheduler and can change the effective policies of
>   underlying io scheduler.

The IO schedulers already take care of read-vs-write and already take
care of preventing large writes-starve-reads latencies (or at least,
they're supposed to).

>   CFQ IO context Issues
>   ---------------------
>   Buffering at higher layer means submission of bios later with the help of
>   a worker thread.

Why?

If it's a read, we just block the userspace process.

If it's a delayed write, the IO submission already happens in a kernel thread.

If it's a synchronous write, we have to block the userspace caller
anyway.

Async reads might be an issue, dunno.

> This changes the io context information at CFQ layer which
>   assigns the request to submitting thread. Change of io context info again
>   leads to issues of idle timer expiry and issue of a process not getting fair
>   share and reduced throughput.

But we already have that problem with delayed writeback, which is a
huge thing - often it's the majority of IO.

>   Throughput with noop, deadline and AS
>   ---------------------------------------------
>   I think an higher level controller will result in reduced overall throughput
>   (as compared to io scheduler based io controller) and more seeks with noop,
>   deadline and AS.
> 
>   The reason being, that it is likely that IO with-in a group will be related
>   and will be relatively close as compared to IO across the groups. For example,
>   thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
>   control, IO from various groups will go into a single queue at lower level
>   controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
>   G4....) causing more seeks and reduced throughput. (Agreed that merging will
>   help up to some extent but still....).
> 
>   Instead, in case of lower level controller, IO scheduler maintains one queue
>   per group hence there is no interleaving of IO between groups. And if IO is
>   related with-in group, then we shoud get reduced number/amount of seek and
>   higher throughput.
> 
>   Latency can be a concern but that can be controlled by reducing the time
>   slice length of the queue.

Well maybe, maybe not.  If a group is throttled, it isn't submitting
new IO.  The unthrottled group is doing the IO submitting and that IO
will have decent locality.

> Fairness at logical device level vs at physical device level
> ------------------------------------------------------------
> 
> IO scheduler based controller has the limitation that it works only with the
> bottom most devices in the IO stack where IO scheduler is attached.
> 
> For example, assume a user has created a logical device lv0 using three
> underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
> in two groups doing IO on lv0. Also assume that weights of groups are in the
> ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
> 
> 			     T1    T2
> 			       \   /
> 			        lv0
> 			      /  |  \
> 			    sda sdb  sdc
> 
> 
> Now resource control will take place only on devices sda, sdb and sdc and
> not at lv0 level. So if IO from two tasks is relatively uniformly
> distributed across the disks then T1 and T2 will see the throughput ratio
> in proportion to weight specified. But if IO from T1 and T2 is going to
> different disks and there is no contention then at higher level they both
> will see same BW.
> 
> Here a second level controller can produce better fairness numbers at
> logical device but most likely at redued overall throughput of the system,
> because it will try to control IO even if there is no contention at phsical
> possibly leaving diksks unused in the system.
> 
> Hence, question comes that how important it is to control bandwidth at
> higher level logical devices also. The actual contention for resources is
> at the leaf block device so it probably makes sense to do any kind of
> control there and not at the intermediate devices. Secondly probably it
> also means better use of available resources.

hm.  What will be the effects of this limitation in real-world use?

> Limited Fairness
> ----------------
> Currently CFQ idles on a sequential reader queue to make sure it gets its
> fair share. A second level controller will find it tricky to anticipate.
> Either it will not have any anticipation logic and in that case it will not
> provide fairness to single readers in a group (as dm-ioband does) or if it
> starts anticipating then we should run into these strange situations where
> second level controller is anticipating on one queue/group and underlying
> IO scheduler might be anticipating on something else.

It depends on the size of the inter-group timeslices.  If the amount of
time for which a group is unthrottled is "large" comapred to the
typical anticipation times, this issue fades away.

And those timeslices _should_ be large.  Because as you mentioned
above, different groups are probably working different parts of the
disk.

> Need of device mapper tools
> ---------------------------
> A device mapper based solution will require creation of a ioband device
> on each physical/logical device one wants to control. So it requires usage
> of device mapper tools even for the people who are not using device mapper.
> At the same time creation of ioband device on each partition in the system to 
> control the IO can be cumbersome and overwhelming if system has got lots of
> disks and partitions with-in.
> 
> 
> IMHO, IO scheduler based IO controller is a reasonable approach to solve the
> problem of group bandwidth control, and can do hierarchical IO scheduling
> more tightly and efficiently.
> 
> But I am all ears to alternative approaches and suggestions how doing things
> can be done better and will be glad to implement it.
> 
> TODO
> ====
> - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
> - More testing to make sure there are no regressions in CFQ.
> 
> Testing
> =======
> 
> Environment
> ==========
> A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.

That's a bit of a toy.

Do we have testing results for more enterprisey hardware?  Big storage
arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)


> I am mostly
> running fio jobs which have been limited to 30 seconds run and then monitored
> the throughput and latency.
>  
> Test1: Random Reader Vs Random Writers
> ======================================
> Launched a random reader and then increasing number of random writers to see
> the effect on random reader BW and max lantecies.
> 
> [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> 
> [Vanilla CFQ, No groups]
> <--------------random writers-------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   5737KiB/s   5737KiB/s   5737KiB/s   164K usec   503KiB/s    159K usec   
> 2   2055KiB/s   1984KiB/s   4039KiB/s   1459K usec  150KiB/s    170K usec   
> 4   1238KiB/s   932KiB/s    4419KiB/s   4332K usec  153KiB/s    225K usec   
> 8   1059KiB/s   929KiB/s    7901KiB/s   1260K usec  118KiB/s    377K usec   
> 16  604KiB/s    483KiB/s    8519KiB/s   3081K usec  47KiB/s     756K usec   
> 32  367KiB/s    222KiB/s    9643KiB/s   5940K usec  22KiB/s     923K usec   
> 
> Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> number of random writers in group1 and one random reader in group2 using fio.
> 
> [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
> <--------------random writers(group1)-------------> <-random reader(group2)->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   18115KiB/s  18115KiB/s  18115KiB/s  604K usec   345KiB/s    176K usec   
> 2   3752KiB/s   3676KiB/s   7427KiB/s   4367K usec  402KiB/s    187K usec   
> 4   1951KiB/s   1863KiB/s   7642KiB/s   1989K usec  384KiB/s    181K usec   
> 8   755KiB/s    629KiB/s    5683KiB/s   2133K usec  366KiB/s    319K usec   
> 16  418KiB/s    369KiB/s    6276KiB/s   1323K usec  352KiB/s    287K usec   
> 32  236KiB/s    191KiB/s    6518KiB/s   1910K usec  337KiB/s    273K usec   

That's a good result.

> Also ran the same test with IO controller CFQ in flat mode to see if there
> are any major deviations from Vanilla CFQ. Does not look like any.
> 
> [IO controller CFQ; No groups ]
> <--------------random writers-------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   5696KiB/s   5696KiB/s   5696KiB/s   259K usec   500KiB/s    194K usec   
> 2   2483KiB/s   2197KiB/s   4680KiB/s   887K usec   150KiB/s    159K usec   
> 4   1471KiB/s   1433KiB/s   5817KiB/s   962K usec   126KiB/s    189K usec   
> 8   691KiB/s    580KiB/s    5159KiB/s   2752K usec  197KiB/s    246K usec   
> 16  781KiB/s    698KiB/s    11892KiB/s  943K usec   61KiB/s     529K usec   
> 32  415KiB/s    324KiB/s    12461KiB/s  4614K usec  17KiB/s     737K usec   
> 
> Notes:
> - With vanilla CFQ, random writers can overwhelm a random reader. Bring down
>   its throughput and bump up latencies significantly.

Isn't that a CFQ shortcoming which we should address separately?  If
so, the comparisons aren't presently valid because we're comparing with
a CFQ which has known, should-be-fixed problems.

> - With IO controller, one can provide isolation to the random reader group and
>   maintain consitent view of bandwidth and latencies. 
> 
> Test2: Random Reader Vs Sequential Reader
> ========================================
> Launched a random reader and then increasing number of sequential readers to
> see the effect on BW and latencies of random reader.
> 
> [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> 
> [ Vanilla CFQ, No groups ]
> <---------------seq readers---------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   23318KiB/s  23318KiB/s  23318KiB/s  55940 usec  36KiB/s     247K usec   
> 2   14732KiB/s  11406KiB/s  26126KiB/s  142K usec   20KiB/s     446K usec   
> 4   9417KiB/s   5169KiB/s   27338KiB/s  404K usec   10KiB/s     993K usec   
> 8   3360KiB/s   3041KiB/s   25850KiB/s  954K usec   60KiB/s     956K usec   
> 16  1888KiB/s   1457KiB/s   26763KiB/s  1871K usec  28KiB/s     1868K usec  
> 
> Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> number of sequential readers in group1 and one random reader in group2 using
> fio.
> 
> [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
> <---------------group1--------------------------->  <------group2--------->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   13733KiB/s  13733KiB/s  13733KiB/s  247K usec   330KiB/s    154K usec   
> 2   8553KiB/s   4963KiB/s   13514KiB/s  472K usec   322KiB/s    174K usec   
> 4   5045KiB/s   1367KiB/s   13134KiB/s  947K usec   318KiB/s    178K usec   
> 8   1774KiB/s   1420KiB/s   13035KiB/s  1871K usec  323KiB/s    233K usec   
> 16  959KiB/s    518KiB/s    12691KiB/s  3809K usec  324KiB/s    208K usec   
> 
> Also ran the same test with IO controller CFQ in flat mode to see if there
> are any major deviations from Vanilla CFQ. Does not look like any.
> 
> [IO controller CFQ; No groups ]
> <---------------seq readers---------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   23028KiB/s  23028KiB/s  23028KiB/s  47460 usec  36KiB/s     253K usec   
> 2   14452KiB/s  11176KiB/s  25628KiB/s  145K usec   20KiB/s     447K usec   
> 4   8815KiB/s   5720KiB/s   27121KiB/s  396K usec   10KiB/s     968K usec   
> 8   3335KiB/s   2827KiB/s   24866KiB/s  960K usec   62KiB/s     955K usec   
> 16  1784KiB/s   1311KiB/s   26537KiB/s  1883K usec  26KiB/s     1866K usec  
> 
> Notes:
> - The BW and latencies of random reader in group 2 seems to be stable and
>   bounded and does not get impacted much as number of sequential readers
>   increase in group1. Hence provding good isolation.
> 
> - Throughput of sequential readers comes down and latencies go up as half
>   of disk bandwidth (in terms of time) has been reserved for random reader
>   group.
> 
> Test3: Sequential Reader Vs Sequential Reader
> ============================================
> Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
> Launched increasing number of sequential readers in group1 and one sequential
> reader in group2 using fio and monitored how bandwidth is being distributed
> between two groups.
> 
> First 5 columns give stats about job in group1 and last two columns give
> stats about job in group2.
> 
> <---------------group1--------------------------->  <------group2--------->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   8970KiB/s   8970KiB/s   8970KiB/s   230K usec   20681KiB/s  124K usec   
> 2   6783KiB/s   3202KiB/s   9984KiB/s   546K usec   19682KiB/s  139K usec   
> 4   4641KiB/s   1029KiB/s   9280KiB/s   1185K usec  19235KiB/s  172K usec   
> 8   1435KiB/s   1079KiB/s   9926KiB/s   2461K usec  19501KiB/s  153K usec   
> 16  764KiB/s    398KiB/s    9395KiB/s   4986K usec  19367KiB/s  172K usec   
> 
> Note: group2 is getting double the bandwidth of group1 even in the face
> of increasing number of readers in group1.
> 
> Test4 (Isolation between two KVM virtual machines)
> ==================================================
> Created two KVM virtual machines. Partitioned a disk on host in two partitions
> and gave one partition to each virtual machine. Put both the virtual machines
> in two different cgroup of weight 1000 and 500 each. Virtual machines created
> ext3 file system on the partitions exported from host and did buffered writes.
> Host seems writes as synchronous and virtual machine with higher weight gets
> double the disk time of virtual machine of lower weight. Used deadline
> scheduler in this test case.
> 
> Some more details about configuration are in documentation patch.
> 
> Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> ===================================================================
> Fairness for async writes is tricky and biggest reason is that async writes
> are cached in higher layers (page cahe) as well as possibly in file system
> layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> in proportional manner.
> 
> For example, consider two dd threads reading /dev/zero as input file and doing
> writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> be forced to write out some pages to disk before more pages can be dirtied. But
> not necessarily dirty pages of same thread are picked. It can very well pick
> the inode of lesser priority dd thread and do some writeout. So effectively
> higher weight dd is doing writeouts of lower weight dd pages and we don't see
> service differentation.
> 
> IOW, the core problem with buffered write fairness is that higher weight thread
> does not throw enought IO traffic at IO controller to keep the queue
> continuously backlogged. In my testing, there are many .2 to .8 second
> intervals where higher weight queue is empty and in that duration lower weight
> queue get lots of job done giving the impression that there was no service
> differentiation.
> 
> In summary, from IO controller point of view async writes support is there.
> Because page cache has not been designed in such a manner that higher 
> prio/weight writer can do more write out as compared to lower prio/weight
> writer, gettting service differentiation is hard and it is visible in some
> cases and not visible in some cases.

Here's where it all falls to pieces.

For async writeback we just don't care about IO priorities.  Because
from the point of view of the userspace task, the write was async!  It
occurred at memory bandwidth speed.

It's only when the kernel's dirty memory thresholds start to get
exceeded that we start to care about prioritisation.  And at that time,
all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
consumes just as much memory as a low-ioprio dirty page.

So when balance_dirty_pages() hits, what do we want to do?

I suppose that all we can do is to block low-ioprio processes more
agressively at the VFS layer, to reduce the rate at which they're
dirtying memory so as to give high-ioprio processes more of the disk
bandwidth.

But you've gone and implemented all of this stuff at the io-controller
level and not at the VFS level so you're, umm, screwed.

Importantly screwed!  It's a very common workload pattern, and one
which causes tremendous amounts of IO to be generated very quickly,
traditionally causing bad latency effects all over the place.  And we
have no answer to this.

> Vanilla CFQ Vs IO Controller CFQ
> ================================
> We have not fundamentally changed CFQ, instead enhanced it to also support
> hierarchical io scheduling. In the process invariably there are small changes
> here and there as new scenarios come up. Running some tests here and comparing
> both the CFQ's to see if there is any major deviation in behavior.
> 
> Test1: Sequential Readers
> =========================
> [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> 
> IO scheduler: IO controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> 
> Test2: Sequential Writers
> =========================
> [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> 
> Test3: Random Readers
> =========================
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> 
> Test4: Random Writers
> =====================
> [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> 
> Notes:
>  - Does not look like that anything has changed significantly.
> 
> Previous versions of the patches were posted here.
> ------------------------------------------------
> 
> (V1) http://lkml.org/lkml/2009/3/11/486
> (V2) http://lkml.org/lkml/2009/5/5/275
> (V3) http://lkml.org/lkml/2009/5/26/472
> (V4) http://lkml.org/lkml/2009/6/8/580
> (V5) http://lkml.org/lkml/2009/6/19/279
> (V6) http://lkml.org/lkml/2009/7/2/369
> (V7) http://lkml.org/lkml/2009/7/24/253
> (V8) http://lkml.org/lkml/2009/8/16/204
> (V9) http://lkml.org/lkml/2009/8/28/327
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
  2009-09-24 19:25 Vivek Goyal
@ 2009-09-24 21:33   ` Andrew Morton
  2009-09-25  2:20 ` Ulrich Lukas
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 349+ messages in thread
From: Andrew Morton @ 2009-09-24 21:33 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah,
	lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida,
	taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda,
	agk, vgoyal, peterz, jmarchan, torvalds, mingo, riel

On Thu, 24 Sep 2009 15:25:04 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> 
> Hi All,
> 
> Here is the V10 of the IO controller patches generated on top of 2.6.31.
> 

Thanks for the writeup.  It really helps and is most worthwhile for a
project of this importance, size and complexity.


>  
> What problem are we trying to solve
> ===================================
> Provide group IO scheduling feature in Linux along the lines of other resource
> controllers like cpu.
> 
> IOW, provide facility so that a user can group applications using cgroups and
> control the amount of disk time/bandwidth received by a group based on its
> weight. 
> 
> How to solve the problem
> =========================
> 
> Different people have solved the issue differetnly. So far looks it looks
> like we seem to have following two core requirements when it comes to
> fairness at group level.
> 
> - Control bandwidth seen by groups.
> - Control on latencies when a request gets backlogged in group.
> 
> At least there are now three patchsets available (including this one).
> 
> IO throttling
> -------------
> This is a bandwidth controller which keeps track of IO rate of a group and
> throttles the process in the group if it exceeds the user specified limit.
> 
> dm-ioband
> ---------
> This is a proportional bandwidth controller implemented as device mapper
> driver and provides fair access in terms of amount of IO done (not in terms
> of disk time as CFQ does).
> 
> So one will setup one or more dm-ioband devices on top of physical/logical
> block device, configure the ioband device and pass information like grouping
> etc. Now this device will keep track of bios flowing through it and control
> the flow of bios based on group policies.
> 
> IO scheduler based IO controller
> --------------------------------
> Here we have viewed the problem of IO contoller as hierarchical group
> scheduling (along the lines of CFS group scheduling) issue. Currently one can
> view linux IO schedulers as flat where there is one root group and all the IO
> belongs to that group.
> 
> This patchset basically modifies IO schedulers to also support hierarchical
> group scheduling. CFQ already provides fairness among different processes. I 
> have extended it support group IO schduling. Also took some of the code out
> of CFQ and put in a common layer so that same group scheduling code can be
> used by noop, deadline and AS to support group scheduling. 
> 
> Pros/Cons
> =========
> There are pros and cons to each of the approach. Following are some of the
> thoughts.
> 
> Max bandwidth vs proportional bandwidth
> ---------------------------------------
> IO throttling is a max bandwidth controller and not a proportional one.
> Additionaly it provides fairness in terms of amount of IO done (and not in
> terms of disk time as CFQ does).
> 
> Personally, I think that proportional weight controller is useful to more
> people than just max bandwidth controller. In addition, IO scheduler based
> controller can also be enhanced to do max bandwidth control. So it can 
> satisfy wider set of requirements.
> 
> Fairness in terms of disk time vs size of IO
> ---------------------------------------------
> An higher level controller will most likely be limited to providing fairness
> in terms of size/number of IO done and will find it hard to provide fairness
> in terms of disk time used (as CFQ provides between various prio levels). This
> is because only IO scheduler knows how much disk time a queue has used and
> information about queues and disk time used is not exported to higher
> layers.
> 
> So a seeky application will still run away with lot of disk time and bring
> down the overall throughput of the the disk.

But that's only true if the thing is poorly implemented.

A high-level controller will need some view of the busyness of the
underlying device(s).  That could be "proportion of idle time", or
"average length of queue" or "average request latency" or some mix of
these or something else altogether.

But these things are simple to calculate, and are simple to feed back
to the higher-level controller and probably don't require any changes
to to IO scheduler at all, which is a great advantage.


And I must say that high-level throttling based upon feedback from
lower layers seems like a much better model to me than hacking away in
the IO scheduler layer.  Both from an implementation point of view and
from a "we can get it to work on things other than block devices" point
of view.

> Currently dm-ioband provides fairness in terms of number/size of IO.
> 
> Latencies and isolation between groups
> --------------------------------------
> An higher level controller is generally implementing a bandwidth throttling
> solution where if a group exceeds either the max bandwidth or the proportional
> share then throttle that group.
> 
> This kind of approach will probably not help in controlling latencies as it
> will depend on underlying IO scheduler. Consider following scenario. 
> 
> Assume there are two groups. One group is running multiple sequential readers
> and other group has a random reader. sequential readers will get a nice 100ms
> slice

Do you refer to each reader within group1, or to all readers?  It would be
daft if each reader in group1 were to get 100ms.

> each and then a random reader from group2 will get to dispatch the
> request. So latency of this random reader will depend on how many sequential
> readers are running in other group and that is a weak isolation between groups.

And yet that is what you appear to mean.

But surely nobody would do that - the 100ms would be assigned to and
distributed amongst all readers in group1?

> When we control things at IO scheduler level, we assign one time slice to one
> group and then pick next entity to run. So effectively after one time slice
> (max 180ms, if prio 0 sequential reader is running), random reader in other
> group will get to run. Hence we achieve better isolation between groups as
> response time of process in a differnt group is generally not dependent on
> number of processes running in competing group.  

I don't understand why you're comparing this implementation with such
an obviously dumb competing design!

> So a higher level solution is most likely limited to only shaping bandwidth
> without any control on latencies.
> 
> Stacking group scheduler on top of CFQ can lead to issues
> ---------------------------------------------------------
> IO throttling and dm-ioband both are second level controller. That is these
> controllers are implemented in higher layers than io schedulers. So they
> control the IO at higher layer based on group policies and later IO
> schedulers take care of dispatching these bios to disk.
> 
> Implementing a second level controller has the advantage of being able to
> provide bandwidth control even on logical block devices in the IO stack
> which don't have any IO schedulers attached to these. But they can also 
> interefere with IO scheduling policy of underlying IO scheduler and change
> the effective behavior. Following are some of the issues which I think
> should be visible in second level controller in one form or other.
> 
>   Prio with-in group
>   ------------------
>   A second level controller can potentially interefere with behavior of
>   different prio processes with-in a group. bios are buffered at higher layer
>   in single queue and release of bios is FIFO and not proportionate to the
>   ioprio of the process. This can result in a particular prio level not
>   getting fair share.

That's an administrator error, isn't it?  Should have put the
different-priority processes into different groups.

>   Buffering at higher layer can delay read requests for more than slice idle
>   period of CFQ (default 8 ms). That means, it is possible that we are waiting
>   for a request from the queue but it is buffered at higher layer and then idle
>   timer will fire. It means that queue will losse its share at the same time
>   overall throughput will be impacted as we lost those 8 ms.

That sounds like a bug.

>   Read Vs Write
>   -------------
>   Writes can overwhelm readers hence second level controller FIFO release
>   will run into issue here. If there is a single queue maintained then reads
>   will suffer large latencies. If there separate queues for reads and writes
>   then it will be hard to decide in what ratio to dispatch reads and writes as
>   it is IO scheduler's decision to decide when and how much read/write to
>   dispatch. This is another place where higher level controller will not be in
>   sync with lower level io scheduler and can change the effective policies of
>   underlying io scheduler.

The IO schedulers already take care of read-vs-write and already take
care of preventing large writes-starve-reads latencies (or at least,
they're supposed to).

>   CFQ IO context Issues
>   ---------------------
>   Buffering at higher layer means submission of bios later with the help of
>   a worker thread.

Why?

If it's a read, we just block the userspace process.

If it's a delayed write, the IO submission already happens in a kernel thread.

If it's a synchronous write, we have to block the userspace caller
anyway.

Async reads might be an issue, dunno.

> This changes the io context information at CFQ layer which
>   assigns the request to submitting thread. Change of io context info again
>   leads to issues of idle timer expiry and issue of a process not getting fair
>   share and reduced throughput.

But we already have that problem with delayed writeback, which is a
huge thing - often it's the majority of IO.

>   Throughput with noop, deadline and AS
>   ---------------------------------------------
>   I think an higher level controller will result in reduced overall throughput
>   (as compared to io scheduler based io controller) and more seeks with noop,
>   deadline and AS.
> 
>   The reason being, that it is likely that IO with-in a group will be related
>   and will be relatively close as compared to IO across the groups. For example,
>   thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
>   control, IO from various groups will go into a single queue at lower level
>   controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
>   G4....) causing more seeks and reduced throughput. (Agreed that merging will
>   help up to some extent but still....).
> 
>   Instead, in case of lower level controller, IO scheduler maintains one queue
>   per group hence there is no interleaving of IO between groups. And if IO is
>   related with-in group, then we shoud get reduced number/amount of seek and
>   higher throughput.
> 
>   Latency can be a concern but that can be controlled by reducing the time
>   slice length of the queue.

Well maybe, maybe not.  If a group is throttled, it isn't submitting
new IO.  The unthrottled group is doing the IO submitting and that IO
will have decent locality.

> Fairness at logical device level vs at physical device level
> ------------------------------------------------------------
> 
> IO scheduler based controller has the limitation that it works only with the
> bottom most devices in the IO stack where IO scheduler is attached.
> 
> For example, assume a user has created a logical device lv0 using three
> underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
> in two groups doing IO on lv0. Also assume that weights of groups are in the
> ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
> 
> 			     T1    T2
> 			       \   /
> 			        lv0
> 			      /  |  \
> 			    sda sdb  sdc
> 
> 
> Now resource control will take place only on devices sda, sdb and sdc and
> not at lv0 level. So if IO from two tasks is relatively uniformly
> distributed across the disks then T1 and T2 will see the throughput ratio
> in proportion to weight specified. But if IO from T1 and T2 is going to
> different disks and there is no contention then at higher level they both
> will see same BW.
> 
> Here a second level controller can produce better fairness numbers at
> logical device but most likely at redued overall throughput of the system,
> because it will try to control IO even if there is no contention at phsical
> possibly leaving diksks unused in the system.
> 
> Hence, question comes that how important it is to control bandwidth at
> higher level logical devices also. The actual contention for resources is
> at the leaf block device so it probably makes sense to do any kind of
> control there and not at the intermediate devices. Secondly probably it
> also means better use of available resources.

hm.  What will be the effects of this limitation in real-world use?

> Limited Fairness
> ----------------
> Currently CFQ idles on a sequential reader queue to make sure it gets its
> fair share. A second level controller will find it tricky to anticipate.
> Either it will not have any anticipation logic and in that case it will not
> provide fairness to single readers in a group (as dm-ioband does) or if it
> starts anticipating then we should run into these strange situations where
> second level controller is anticipating on one queue/group and underlying
> IO scheduler might be anticipating on something else.

It depends on the size of the inter-group timeslices.  If the amount of
time for which a group is unthrottled is "large" comapred to the
typical anticipation times, this issue fades away.

And those timeslices _should_ be large.  Because as you mentioned
above, different groups are probably working different parts of the
disk.

> Need of device mapper tools
> ---------------------------
> A device mapper based solution will require creation of a ioband device
> on each physical/logical device one wants to control. So it requires usage
> of device mapper tools even for the people who are not using device mapper.
> At the same time creation of ioband device on each partition in the system to 
> control the IO can be cumbersome and overwhelming if system has got lots of
> disks and partitions with-in.
> 
> 
> IMHO, IO scheduler based IO controller is a reasonable approach to solve the
> problem of group bandwidth control, and can do hierarchical IO scheduling
> more tightly and efficiently.
> 
> But I am all ears to alternative approaches and suggestions how doing things
> can be done better and will be glad to implement it.
> 
> TODO
> ====
> - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
> - More testing to make sure there are no regressions in CFQ.
> 
> Testing
> =======
> 
> Environment
> ==========
> A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.

That's a bit of a toy.

Do we have testing results for more enterprisey hardware?  Big storage
arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)


> I am mostly
> running fio jobs which have been limited to 30 seconds run and then monitored
> the throughput and latency.
>  
> Test1: Random Reader Vs Random Writers
> ======================================
> Launched a random reader and then increasing number of random writers to see
> the effect on random reader BW and max lantecies.
> 
> [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> 
> [Vanilla CFQ, No groups]
> <--------------random writers-------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   5737KiB/s   5737KiB/s   5737KiB/s   164K usec   503KiB/s    159K usec   
> 2   2055KiB/s   1984KiB/s   4039KiB/s   1459K usec  150KiB/s    170K usec   
> 4   1238KiB/s   932KiB/s    4419KiB/s   4332K usec  153KiB/s    225K usec   
> 8   1059KiB/s   929KiB/s    7901KiB/s   1260K usec  118KiB/s    377K usec   
> 16  604KiB/s    483KiB/s    8519KiB/s   3081K usec  47KiB/s     756K usec   
> 32  367KiB/s    222KiB/s    9643KiB/s   5940K usec  22KiB/s     923K usec   
> 
> Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> number of random writers in group1 and one random reader in group2 using fio.
> 
> [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
> <--------------random writers(group1)-------------> <-random reader(group2)->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   18115KiB/s  18115KiB/s  18115KiB/s  604K usec   345KiB/s    176K usec   
> 2   3752KiB/s   3676KiB/s   7427KiB/s   4367K usec  402KiB/s    187K usec   
> 4   1951KiB/s   1863KiB/s   7642KiB/s   1989K usec  384KiB/s    181K usec   
> 8   755KiB/s    629KiB/s    5683KiB/s   2133K usec  366KiB/s    319K usec   
> 16  418KiB/s    369KiB/s    6276KiB/s   1323K usec  352KiB/s    287K usec   
> 32  236KiB/s    191KiB/s    6518KiB/s   1910K usec  337KiB/s    273K usec   

That's a good result.

> Also ran the same test with IO controller CFQ in flat mode to see if there
> are any major deviations from Vanilla CFQ. Does not look like any.
> 
> [IO controller CFQ; No groups ]
> <--------------random writers-------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   5696KiB/s   5696KiB/s   5696KiB/s   259K usec   500KiB/s    194K usec   
> 2   2483KiB/s   2197KiB/s   4680KiB/s   887K usec   150KiB/s    159K usec   
> 4   1471KiB/s   1433KiB/s   5817KiB/s   962K usec   126KiB/s    189K usec   
> 8   691KiB/s    580KiB/s    5159KiB/s   2752K usec  197KiB/s    246K usec   
> 16  781KiB/s    698KiB/s    11892KiB/s  943K usec   61KiB/s     529K usec   
> 32  415KiB/s    324KiB/s    12461KiB/s  4614K usec  17KiB/s     737K usec   
> 
> Notes:
> - With vanilla CFQ, random writers can overwhelm a random reader. Bring down
>   its throughput and bump up latencies significantly.

Isn't that a CFQ shortcoming which we should address separately?  If
so, the comparisons aren't presently valid because we're comparing with
a CFQ which has known, should-be-fixed problems.

> - With IO controller, one can provide isolation to the random reader group and
>   maintain consitent view of bandwidth and latencies. 
> 
> Test2: Random Reader Vs Sequential Reader
> ========================================
> Launched a random reader and then increasing number of sequential readers to
> see the effect on BW and latencies of random reader.
> 
> [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> 
> [ Vanilla CFQ, No groups ]
> <---------------seq readers---------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   23318KiB/s  23318KiB/s  23318KiB/s  55940 usec  36KiB/s     247K usec   
> 2   14732KiB/s  11406KiB/s  26126KiB/s  142K usec   20KiB/s     446K usec   
> 4   9417KiB/s   5169KiB/s   27338KiB/s  404K usec   10KiB/s     993K usec   
> 8   3360KiB/s   3041KiB/s   25850KiB/s  954K usec   60KiB/s     956K usec   
> 16  1888KiB/s   1457KiB/s   26763KiB/s  1871K usec  28KiB/s     1868K usec  
> 
> Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> number of sequential readers in group1 and one random reader in group2 using
> fio.
> 
> [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
> <---------------group1--------------------------->  <------group2--------->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   13733KiB/s  13733KiB/s  13733KiB/s  247K usec   330KiB/s    154K usec   
> 2   8553KiB/s   4963KiB/s   13514KiB/s  472K usec   322KiB/s    174K usec   
> 4   5045KiB/s   1367KiB/s   13134KiB/s  947K usec   318KiB/s    178K usec   
> 8   1774KiB/s   1420KiB/s   13035KiB/s  1871K usec  323KiB/s    233K usec   
> 16  959KiB/s    518KiB/s    12691KiB/s  3809K usec  324KiB/s    208K usec   
> 
> Also ran the same test with IO controller CFQ in flat mode to see if there
> are any major deviations from Vanilla CFQ. Does not look like any.
> 
> [IO controller CFQ; No groups ]
> <---------------seq readers---------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   23028KiB/s  23028KiB/s  23028KiB/s  47460 usec  36KiB/s     253K usec   
> 2   14452KiB/s  11176KiB/s  25628KiB/s  145K usec   20KiB/s     447K usec   
> 4   8815KiB/s   5720KiB/s   27121KiB/s  396K usec   10KiB/s     968K usec   
> 8   3335KiB/s   2827KiB/s   24866KiB/s  960K usec   62KiB/s     955K usec   
> 16  1784KiB/s   1311KiB/s   26537KiB/s  1883K usec  26KiB/s     1866K usec  
> 
> Notes:
> - The BW and latencies of random reader in group 2 seems to be stable and
>   bounded and does not get impacted much as number of sequential readers
>   increase in group1. Hence provding good isolation.
> 
> - Throughput of sequential readers comes down and latencies go up as half
>   of disk bandwidth (in terms of time) has been reserved for random reader
>   group.
> 
> Test3: Sequential Reader Vs Sequential Reader
> ============================================
> Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
> Launched increasing number of sequential readers in group1 and one sequential
> reader in group2 using fio and monitored how bandwidth is being distributed
> between two groups.
> 
> First 5 columns give stats about job in group1 and last two columns give
> stats about job in group2.
> 
> <---------------group1--------------------------->  <------group2--------->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   8970KiB/s   8970KiB/s   8970KiB/s   230K usec   20681KiB/s  124K usec   
> 2   6783KiB/s   3202KiB/s   9984KiB/s   546K usec   19682KiB/s  139K usec   
> 4   4641KiB/s   1029KiB/s   9280KiB/s   1185K usec  19235KiB/s  172K usec   
> 8   1435KiB/s   1079KiB/s   9926KiB/s   2461K usec  19501KiB/s  153K usec   
> 16  764KiB/s    398KiB/s    9395KiB/s   4986K usec  19367KiB/s  172K usec   
> 
> Note: group2 is getting double the bandwidth of group1 even in the face
> of increasing number of readers in group1.
> 
> Test4 (Isolation between two KVM virtual machines)
> ==================================================
> Created two KVM virtual machines. Partitioned a disk on host in two partitions
> and gave one partition to each virtual machine. Put both the virtual machines
> in two different cgroup of weight 1000 and 500 each. Virtual machines created
> ext3 file system on the partitions exported from host and did buffered writes.
> Host seems writes as synchronous and virtual machine with higher weight gets
> double the disk time of virtual machine of lower weight. Used deadline
> scheduler in this test case.
> 
> Some more details about configuration are in documentation patch.
> 
> Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> ===================================================================
> Fairness for async writes is tricky and biggest reason is that async writes
> are cached in higher layers (page cahe) as well as possibly in file system
> layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> in proportional manner.
> 
> For example, consider two dd threads reading /dev/zero as input file and doing
> writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> be forced to write out some pages to disk before more pages can be dirtied. But
> not necessarily dirty pages of same thread are picked. It can very well pick
> the inode of lesser priority dd thread and do some writeout. So effectively
> higher weight dd is doing writeouts of lower weight dd pages and we don't see
> service differentation.
> 
> IOW, the core problem with buffered write fairness is that higher weight thread
> does not throw enought IO traffic at IO controller to keep the queue
> continuously backlogged. In my testing, there are many .2 to .8 second
> intervals where higher weight queue is empty and in that duration lower weight
> queue get lots of job done giving the impression that there was no service
> differentiation.
> 
> In summary, from IO controller point of view async writes support is there.
> Because page cache has not been designed in such a manner that higher 
> prio/weight writer can do more write out as compared to lower prio/weight
> writer, gettting service differentiation is hard and it is visible in some
> cases and not visible in some cases.

Here's where it all falls to pieces.

For async writeback we just don't care about IO priorities.  Because
from the point of view of the userspace task, the write was async!  It
occurred at memory bandwidth speed.

It's only when the kernel's dirty memory thresholds start to get
exceeded that we start to care about prioritisation.  And at that time,
all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
consumes just as much memory as a low-ioprio dirty page.

So when balance_dirty_pages() hits, what do we want to do?

I suppose that all we can do is to block low-ioprio processes more
agressively at the VFS layer, to reduce the rate at which they're
dirtying memory so as to give high-ioprio processes more of the disk
bandwidth.

But you've gone and implemented all of this stuff at the io-controller
level and not at the VFS level so you're, umm, screwed.

Importantly screwed!  It's a very common workload pattern, and one
which causes tremendous amounts of IO to be generated very quickly,
traditionally causing bad latency effects all over the place.  And we
have no answer to this.

> Vanilla CFQ Vs IO Controller CFQ
> ================================
> We have not fundamentally changed CFQ, instead enhanced it to also support
> hierarchical io scheduling. In the process invariably there are small changes
> here and there as new scenarios come up. Running some tests here and comparing
> both the CFQ's to see if there is any major deviation in behavior.
> 
> Test1: Sequential Readers
> =========================
> [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> 
> IO scheduler: IO controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> 
> Test2: Sequential Writers
> =========================
> [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> 
> Test3: Random Readers
> =========================
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> 
> Test4: Random Writers
> =====================
> [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> 
> Notes:
>  - Does not look like that anything has changed significantly.
> 
> Previous versions of the patches were posted here.
> ------------------------------------------------
> 
> (V1) http://lkml.org/lkml/2009/3/11/486
> (V2) http://lkml.org/lkml/2009/5/5/275
> (V3) http://lkml.org/lkml/2009/5/26/472
> (V4) http://lkml.org/lkml/2009/6/8/580
> (V5) http://lkml.org/lkml/2009/6/19/279
> (V6) http://lkml.org/lkml/2009/7/2/369
> (V7) http://lkml.org/lkml/2009/7/24/253
> (V8) http://lkml.org/lkml/2009/8/16/204
> (V9) http://lkml.org/lkml/2009/8/28/327
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* Re: IO scheduler based IO controller V10
@ 2009-09-24 21:33   ` Andrew Morton
  0 siblings, 0 replies; 349+ messages in thread
From: Andrew Morton @ 2009-09-24 21:33 UTC (permalink / raw)
  Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir,
	paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer,
	nauman, mingo, vgoyal, m-ikeda, riel, lizf, fchecconi,
	containers, linux-kernel, s-uchida, righi.andrea, torvalds

On Thu, 24 Sep 2009 15:25:04 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> 
> Hi All,
> 
> Here is the V10 of the IO controller patches generated on top of 2.6.31.
> 

Thanks for the writeup.  It really helps and is most worthwhile for a
project of this importance, size and complexity.


>  
> What problem are we trying to solve
> ===================================
> Provide group IO scheduling feature in Linux along the lines of other resource
> controllers like cpu.
> 
> IOW, provide facility so that a user can group applications using cgroups and
> control the amount of disk time/bandwidth received by a group based on its
> weight. 
> 
> How to solve the problem
> =========================
> 
> Different people have solved the issue differetnly. So far looks it looks
> like we seem to have following two core requirements when it comes to
> fairness at group level.
> 
> - Control bandwidth seen by groups.
> - Control on latencies when a request gets backlogged in group.
> 
> At least there are now three patchsets available (including this one).
> 
> IO throttling
> -------------
> This is a bandwidth controller which keeps track of IO rate of a group and
> throttles the process in the group if it exceeds the user specified limit.
> 
> dm-ioband
> ---------
> This is a proportional bandwidth controller implemented as device mapper
> driver and provides fair access in terms of amount of IO done (not in terms
> of disk time as CFQ does).
> 
> So one will setup one or more dm-ioband devices on top of physical/logical
> block device, configure the ioband device and pass information like grouping
> etc. Now this device will keep track of bios flowing through it and control
> the flow of bios based on group policies.
> 
> IO scheduler based IO controller
> --------------------------------
> Here we have viewed the problem of IO contoller as hierarchical group
> scheduling (along the lines of CFS group scheduling) issue. Currently one can
> view linux IO schedulers as flat where there is one root group and all the IO
> belongs to that group.
> 
> This patchset basically modifies IO schedulers to also support hierarchical
> group scheduling. CFQ already provides fairness among different processes. I 
> have extended it support group IO schduling. Also took some of the code out
> of CFQ and put in a common layer so that same group scheduling code can be
> used by noop, deadline and AS to support group scheduling. 
> 
> Pros/Cons
> =========
> There are pros and cons to each of the approach. Following are some of the
> thoughts.
> 
> Max bandwidth vs proportional bandwidth
> ---------------------------------------
> IO throttling is a max bandwidth controller and not a proportional one.
> Additionaly it provides fairness in terms of amount of IO done (and not in
> terms of disk time as CFQ does).
> 
> Personally, I think that proportional weight controller is useful to more
> people than just max bandwidth controller. In addition, IO scheduler based
> controller can also be enhanced to do max bandwidth control. So it can 
> satisfy wider set of requirements.
> 
> Fairness in terms of disk time vs size of IO
> ---------------------------------------------
> An higher level controller will most likely be limited to providing fairness
> in terms of size/number of IO done and will find it hard to provide fairness
> in terms of disk time used (as CFQ provides between various prio levels). This
> is because only IO scheduler knows how much disk time a queue has used and
> information about queues and disk time used is not exported to higher
> layers.
> 
> So a seeky application will still run away with lot of disk time and bring
> down the overall throughput of the the disk.

But that's only true if the thing is poorly implemented.

A high-level controller will need some view of the busyness of the
underlying device(s).  That could be "proportion of idle time", or
"average length of queue" or "average request latency" or some mix of
these or something else altogether.

But these things are simple to calculate, and are simple to feed back
to the higher-level controller and probably don't require any changes
to to IO scheduler at all, which is a great advantage.


And I must say that high-level throttling based upon feedback from
lower layers seems like a much better model to me than hacking away in
the IO scheduler layer.  Both from an implementation point of view and
from a "we can get it to work on things other than block devices" point
of view.

> Currently dm-ioband provides fairness in terms of number/size of IO.
> 
> Latencies and isolation between groups
> --------------------------------------
> An higher level controller is generally implementing a bandwidth throttling
> solution where if a group exceeds either the max bandwidth or the proportional
> share then throttle that group.
> 
> This kind of approach will probably not help in controlling latencies as it
> will depend on underlying IO scheduler. Consider following scenario. 
> 
> Assume there are two groups. One group is running multiple sequential readers
> and other group has a random reader. sequential readers will get a nice 100ms
> slice

Do you refer to each reader within group1, or to all readers?  It would be
daft if each reader in group1 were to get 100ms.

> each and then a random reader from group2 will get to dispatch the
> request. So latency of this random reader will depend on how many sequential
> readers are running in other group and that is a weak isolation between groups.

And yet that is what you appear to mean.

But surely nobody would do that - the 100ms would be assigned to and
distributed amongst all readers in group1?

> When we control things at IO scheduler level, we assign one time slice to one
> group and then pick next entity to run. So effectively after one time slice
> (max 180ms, if prio 0 sequential reader is running), random reader in other
> group will get to run. Hence we achieve better isolation between groups as
> response time of process in a differnt group is generally not dependent on
> number of processes running in competing group.  

I don't understand why you're comparing this implementation with such
an obviously dumb competing design!

> So a higher level solution is most likely limited to only shaping bandwidth
> without any control on latencies.
> 
> Stacking group scheduler on top of CFQ can lead to issues
> ---------------------------------------------------------
> IO throttling and dm-ioband both are second level controller. That is these
> controllers are implemented in higher layers than io schedulers. So they
> control the IO at higher layer based on group policies and later IO
> schedulers take care of dispatching these bios to disk.
> 
> Implementing a second level controller has the advantage of being able to
> provide bandwidth control even on logical block devices in the IO stack
> which don't have any IO schedulers attached to these. But they can also 
> interefere with IO scheduling policy of underlying IO scheduler and change
> the effective behavior. Following are some of the issues which I think
> should be visible in second level controller in one form or other.
> 
>   Prio with-in group
>   ------------------
>   A second level controller can potentially interefere with behavior of
>   different prio processes with-in a group. bios are buffered at higher layer
>   in single queue and release of bios is FIFO and not proportionate to the
>   ioprio of the process. This can result in a particular prio level not
>   getting fair share.

That's an administrator error, isn't it?  Should have put the
different-priority processes into different groups.

>   Buffering at higher layer can delay read requests for more than slice idle
>   period of CFQ (default 8 ms). That means, it is possible that we are waiting
>   for a request from the queue but it is buffered at higher layer and then idle
>   timer will fire. It means that queue will losse its share at the same time
>   overall throughput will be impacted as we lost those 8 ms.

That sounds like a bug.

>   Read Vs Write
>   -------------
>   Writes can overwhelm readers hence second level controller FIFO release
>   will run into issue here. If there is a single queue maintained then reads
>   will suffer large latencies. If there separate queues for reads and writes
>   then it will be hard to decide in what ratio to dispatch reads and writes as
>   it is IO scheduler's decision to decide when and how much read/write to
>   dispatch. This is another place where higher level controller will not be in
>   sync with lower level io scheduler and can change the effective policies of
>   underlying io scheduler.

The IO schedulers already take care of read-vs-write and already take
care of preventing large writes-starve-reads latencies (or at least,
they're supposed to).

>   CFQ IO context Issues
>   ---------------------
>   Buffering at higher layer means submission of bios later with the help of
>   a worker thread.

Why?

If it's a read, we just block the userspace process.

If it's a delayed write, the IO submission already happens in a kernel thread.

If it's a synchronous write, we have to block the userspace caller
anyway.

Async reads might be an issue, dunno.

> This changes the io context information at CFQ layer which
>   assigns the request to submitting thread. Change of io context info again
>   leads to issues of idle timer expiry and issue of a process not getting fair
>   share and reduced throughput.

But we already have that problem with delayed writeback, which is a
huge thing - often it's the majority of IO.

>   Throughput with noop, deadline and AS
>   ---------------------------------------------
>   I think an higher level controller will result in reduced overall throughput
>   (as compared to io scheduler based io controller) and more seeks with noop,
>   deadline and AS.
> 
>   The reason being, that it is likely that IO with-in a group will be related
>   and will be relatively close as compared to IO across the groups. For example,
>   thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
>   control, IO from various groups will go into a single queue at lower level
>   controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
>   G4....) causing more seeks and reduced throughput. (Agreed that merging will
>   help up to some extent but still....).
> 
>   Instead, in case of lower level controller, IO scheduler maintains one queue
>   per group hence there is no interleaving of IO between groups. And if IO is
>   related with-in group, then we shoud get reduced number/amount of seek and
>   higher throughput.
> 
>   Latency can be a concern but that can be controlled by reducing the time
>   slice length of the queue.

Well maybe, maybe not.  If a group is throttled, it isn't submitting
new IO.  The unthrottled group is doing the IO submitting and that IO
will have decent locality.

> Fairness at logical device level vs at physical device level
> ------------------------------------------------------------
> 
> IO scheduler based controller has the limitation that it works only with the
> bottom most devices in the IO stack where IO scheduler is attached.
> 
> For example, assume a user has created a logical device lv0 using three
> underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
> in two groups doing IO on lv0. Also assume that weights of groups are in the
> ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.
> 
> 			     T1    T2
> 			       \   /
> 			        lv0
> 			      /  |  \
> 			    sda sdb  sdc
> 
> 
> Now resource control will take place only on devices sda, sdb and sdc and
> not at lv0 level. So if IO from two tasks is relatively uniformly
> distributed across the disks then T1 and T2 will see the throughput ratio
> in proportion to weight specified. But if IO from T1 and T2 is going to
> different disks and there is no contention then at higher level they both
> will see same BW.
> 
> Here a second level controller can produce better fairness numbers at
> logical device but most likely at redued overall throughput of the system,
> because it will try to control IO even if there is no contention at phsical
> possibly leaving diksks unused in the system.
> 
> Hence, question comes that how important it is to control bandwidth at
> higher level logical devices also. The actual contention for resources is
> at the leaf block device so it probably makes sense to do any kind of
> control there and not at the intermediate devices. Secondly probably it
> also means better use of available resources.

hm.  What will be the effects of this limitation in real-world use?

> Limited Fairness
> ----------------
> Currently CFQ idles on a sequential reader queue to make sure it gets its
> fair share. A second level controller will find it tricky to anticipate.
> Either it will not have any anticipation logic and in that case it will not
> provide fairness to single readers in a group (as dm-ioband does) or if it
> starts anticipating then we should run into these strange situations where
> second level controller is anticipating on one queue/group and underlying
> IO scheduler might be anticipating on something else.

It depends on the size of the inter-group timeslices.  If the amount of
time for which a group is unthrottled is "large" comapred to the
typical anticipation times, this issue fades away.

And those timeslices _should_ be large.  Because as you mentioned
above, different groups are probably working different parts of the
disk.

> Need of device mapper tools
> ---------------------------
> A device mapper based solution will require creation of a ioband device
> on each physical/logical device one wants to control. So it requires usage
> of device mapper tools even for the people who are not using device mapper.
> At the same time creation of ioband device on each partition in the system to 
> control the IO can be cumbersome and overwhelming if system has got lots of
> disks and partitions with-in.
> 
> 
> IMHO, IO scheduler based IO controller is a reasonable approach to solve the
> problem of group bandwidth control, and can do hierarchical IO scheduling
> more tightly and efficiently.
> 
> But I am all ears to alternative approaches and suggestions how doing things
> can be done better and will be glad to implement it.
> 
> TODO
> ====
> - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
> - More testing to make sure there are no regressions in CFQ.
> 
> Testing
> =======
> 
> Environment
> ==========
> A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem.

That's a bit of a toy.

Do we have testing results for more enterprisey hardware?  Big storage
arrays?  SSD?  Infiniband?  iscsi?  nfs? (lol, gotcha)


> I am mostly
> running fio jobs which have been limited to 30 seconds run and then monitored
> the throughput and latency.
>  
> Test1: Random Reader Vs Random Writers
> ======================================
> Launched a random reader and then increasing number of random writers to see
> the effect on random reader BW and max lantecies.
> 
> [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> 
> [Vanilla CFQ, No groups]
> <--------------random writers-------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   5737KiB/s   5737KiB/s   5737KiB/s   164K usec   503KiB/s    159K usec   
> 2   2055KiB/s   1984KiB/s   4039KiB/s   1459K usec  150KiB/s    170K usec   
> 4   1238KiB/s   932KiB/s    4419KiB/s   4332K usec  153KiB/s    225K usec   
> 8   1059KiB/s   929KiB/s    7901KiB/s   1260K usec  118KiB/s    377K usec   
> 16  604KiB/s    483KiB/s    8519KiB/s   3081K usec  47KiB/s     756K usec   
> 32  367KiB/s    222KiB/s    9643KiB/s   5940K usec  22KiB/s     923K usec   
> 
> Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> number of random writers in group1 and one random reader in group2 using fio.
> 
> [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
> <--------------random writers(group1)-------------> <-random reader(group2)->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   18115KiB/s  18115KiB/s  18115KiB/s  604K usec   345KiB/s    176K usec   
> 2   3752KiB/s   3676KiB/s   7427KiB/s   4367K usec  402KiB/s    187K usec   
> 4   1951KiB/s   1863KiB/s   7642KiB/s   1989K usec  384KiB/s    181K usec   
> 8   755KiB/s    629KiB/s    5683KiB/s   2133K usec  366KiB/s    319K usec   
> 16  418KiB/s    369KiB/s    6276KiB/s   1323K usec  352KiB/s    287K usec   
> 32  236KiB/s    191KiB/s    6518KiB/s   1910K usec  337KiB/s    273K usec   

That's a good result.

> Also ran the same test with IO controller CFQ in flat mode to see if there
> are any major deviations from Vanilla CFQ. Does not look like any.
> 
> [IO controller CFQ; No groups ]
> <--------------random writers-------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   5696KiB/s   5696KiB/s   5696KiB/s   259K usec   500KiB/s    194K usec   
> 2   2483KiB/s   2197KiB/s   4680KiB/s   887K usec   150KiB/s    159K usec   
> 4   1471KiB/s   1433KiB/s   5817KiB/s   962K usec   126KiB/s    189K usec   
> 8   691KiB/s    580KiB/s    5159KiB/s   2752K usec  197KiB/s    246K usec   
> 16  781KiB/s    698KiB/s    11892KiB/s  943K usec   61KiB/s     529K usec   
> 32  415KiB/s    324KiB/s    12461KiB/s  4614K usec  17KiB/s     737K usec   
> 
> Notes:
> - With vanilla CFQ, random writers can overwhelm a random reader. Bring down
>   its throughput and bump up latencies significantly.

Isn't that a CFQ shortcoming which we should address separately?  If
so, the comparisons aren't presently valid because we're comparing with
a CFQ which has known, should-be-fixed problems.

> - With IO controller, one can provide isolation to the random reader group and
>   maintain consitent view of bandwidth and latencies. 
> 
> Test2: Random Reader Vs Sequential Reader
> ========================================
> Launched a random reader and then increasing number of sequential readers to
> see the effect on BW and latencies of random reader.
> 
> [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]
> 
> [ Vanilla CFQ, No groups ]
> <---------------seq readers---------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   23318KiB/s  23318KiB/s  23318KiB/s  55940 usec  36KiB/s     247K usec   
> 2   14732KiB/s  11406KiB/s  26126KiB/s  142K usec   20KiB/s     446K usec   
> 4   9417KiB/s   5169KiB/s   27338KiB/s  404K usec   10KiB/s     993K usec   
> 8   3360KiB/s   3041KiB/s   25850KiB/s  954K usec   60KiB/s     956K usec   
> 16  1888KiB/s   1457KiB/s   26763KiB/s  1871K usec  28KiB/s     1868K usec  
> 
> Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
> number of sequential readers in group1 and one random reader in group2 using
> fio.
> 
> [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
> <---------------group1--------------------------->  <------group2--------->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   13733KiB/s  13733KiB/s  13733KiB/s  247K usec   330KiB/s    154K usec   
> 2   8553KiB/s   4963KiB/s   13514KiB/s  472K usec   322KiB/s    174K usec   
> 4   5045KiB/s   1367KiB/s   13134KiB/s  947K usec   318KiB/s    178K usec   
> 8   1774KiB/s   1420KiB/s   13035KiB/s  1871K usec  323KiB/s    233K usec   
> 16  959KiB/s    518KiB/s    12691KiB/s  3809K usec  324KiB/s    208K usec   
> 
> Also ran the same test with IO controller CFQ in flat mode to see if there
> are any major deviations from Vanilla CFQ. Does not look like any.
> 
> [IO controller CFQ; No groups ]
> <---------------seq readers---------------------->  <------random reader-->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   23028KiB/s  23028KiB/s  23028KiB/s  47460 usec  36KiB/s     253K usec   
> 2   14452KiB/s  11176KiB/s  25628KiB/s  145K usec   20KiB/s     447K usec   
> 4   8815KiB/s   5720KiB/s   27121KiB/s  396K usec   10KiB/s     968K usec   
> 8   3335KiB/s   2827KiB/s   24866KiB/s  960K usec   62KiB/s     955K usec   
> 16  1784KiB/s   1311KiB/s   26537KiB/s  1883K usec  26KiB/s     1866K usec  
> 
> Notes:
> - The BW and latencies of random reader in group 2 seems to be stable and
>   bounded and does not get impacted much as number of sequential readers
>   increase in group1. Hence provding good isolation.
> 
> - Throughput of sequential readers comes down and latencies go up as half
>   of disk bandwidth (in terms of time) has been reserved for random reader
>   group.
> 
> Test3: Sequential Reader Vs Sequential Reader
> ============================================
> Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
> Launched increasing number of sequential readers in group1 and one sequential
> reader in group2 using fio and monitored how bandwidth is being distributed
> between two groups.
> 
> First 5 columns give stats about job in group1 and last two columns give
> stats about job in group2.
> 
> <---------------group1--------------------------->  <------group2--------->
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
> 1   8970KiB/s   8970KiB/s   8970KiB/s   230K usec   20681KiB/s  124K usec   
> 2   6783KiB/s   3202KiB/s   9984KiB/s   546K usec   19682KiB/s  139K usec   
> 4   4641KiB/s   1029KiB/s   9280KiB/s   1185K usec  19235KiB/s  172K usec   
> 8   1435KiB/s   1079KiB/s   9926KiB/s   2461K usec  19501KiB/s  153K usec   
> 16  764KiB/s    398KiB/s    9395KiB/s   4986K usec  19367KiB/s  172K usec   
> 
> Note: group2 is getting double the bandwidth of group1 even in the face
> of increasing number of readers in group1.
> 
> Test4 (Isolation between two KVM virtual machines)
> ==================================================
> Created two KVM virtual machines. Partitioned a disk on host in two partitions
> and gave one partition to each virtual machine. Put both the virtual machines
> in two different cgroup of weight 1000 and 500 each. Virtual machines created
> ext3 file system on the partitions exported from host and did buffered writes.
> Host seems writes as synchronous and virtual machine with higher weight gets
> double the disk time of virtual machine of lower weight. Used deadline
> scheduler in this test case.
> 
> Some more details about configuration are in documentation patch.
> 
> Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
> ===================================================================
> Fairness for async writes is tricky and biggest reason is that async writes
> are cached in higher layers (page cahe) as well as possibly in file system
> layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
> in proportional manner.
> 
> For example, consider two dd threads reading /dev/zero as input file and doing
> writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
> be forced to write out some pages to disk before more pages can be dirtied. But
> not necessarily dirty pages of same thread are picked. It can very well pick
> the inode of lesser priority dd thread and do some writeout. So effectively
> higher weight dd is doing writeouts of lower weight dd pages and we don't see
> service differentation.
> 
> IOW, the core problem with buffered write fairness is that higher weight thread
> does not throw enought IO traffic at IO controller to keep the queue
> continuously backlogged. In my testing, there are many .2 to .8 second
> intervals where higher weight queue is empty and in that duration lower weight
> queue get lots of job done giving the impression that there was no service
> differentiation.
> 
> In summary, from IO controller point of view async writes support is there.
> Because page cache has not been designed in such a manner that higher 
> prio/weight writer can do more write out as compared to lower prio/weight
> writer, gettting service differentiation is hard and it is visible in some
> cases and not visible in some cases.

Here's where it all falls to pieces.

For async writeback we just don't care about IO priorities.  Because
from the point of view of the userspace task, the write was async!  It
occurred at memory bandwidth speed.

It's only when the kernel's dirty memory thresholds start to get
exceeded that we start to care about prioritisation.  And at that time,
all dirty memory (within a memcg?) is equal - a high-ioprio dirty page
consumes just as much memory as a low-ioprio dirty page.

So when balance_dirty_pages() hits, what do we want to do?

I suppose that all we can do is to block low-ioprio processes more
agressively at the VFS layer, to reduce the rate at which they're
dirtying memory so as to give high-ioprio processes more of the disk
bandwidth.

But you've gone and implemented all of this stuff at the io-controller
level and not at the VFS level so you're, umm, screwed.

Importantly screwed!  It's a very common workload pattern, and one
which causes tremendous amounts of IO to be generated very quickly,
traditionally causing bad latency effects all over the place.  And we
have no answer to this.

> Vanilla CFQ Vs IO Controller CFQ
> ================================
> We have not fundamentally changed CFQ, instead enhanced it to also support
> hierarchical io scheduling. In the process invariably there are small changes
> here and there as new scenarios come up. Running some tests here and comparing
> both the CFQ's to see if there is any major deviation in behavior.
> 
> Test1: Sequential Readers
> =========================
> [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
> 2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
> 4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
> 8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
> 16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  
> 
> IO scheduler: IO controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
> 2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
> 4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
> 8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
> 16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  
> 
> Test2: Sequential Writers
> =========================
> [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
> 2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
> 4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
> 8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
> 16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
> 2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
> 4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
> 8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
> 16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  
> 
> Test3: Random Readers
> =========================
> [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
> 2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
> 4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
> 8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
> 16  38KiB/s     8KiB/s      328KiB/s    3965 msec   
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
> 2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
> 4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
> 8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
> 16  43KiB/s     9KiB/s      327KiB/s    3905 msec   
> 
> Test4: Random Writers
> =====================
> [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]
> 
> IO scheduler: Vanilla CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
> 2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
> 4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
> 8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
> 16  66KiB/s     22KiB/s     829KiB/s    1308 msec   
> 
> IO scheduler: IO Controller CFQ
> 
> nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
> 1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
> 2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
> 4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
> 8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
> 16  71KiB/s     29KiB/s     814KiB/s    1457 msec   
> 
> Notes:
>  - Does not look like that anything has changed significantly.
> 
> Previous versions of the patches were posted here.
> ------------------------------------------------
> 
> (V1) http://lkml.org/lkml/2009/3/11/486
> (V2) http://lkml.org/lkml/2009/5/5/275
> (V3) http://lkml.org/lkml/2009/5/26/472
> (V4) http://lkml.org/lkml/2009/6/8/580
> (V5) http://lkml.org/lkml/2009/6/19/279
> (V6) http://lkml.org/lkml/2009/7/2/369
> (V7) http://lkml.org/lkml/2009/7/24/253
> (V8) http://lkml.org/lkml/2009/8/16/204
> (V9) http://lkml.org/lkml/2009/8/28/327
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* IO scheduler based IO controller V10
@ 2009-09-24 19:25 Vivek Goyal
  0 siblings, 0 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-24 19:25 UTC (permalink / raw)
  To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA
  Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	paolo.valente-rcYM44yAMweonA0d6jMUrA,
	jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg,
	jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI,
	riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	righi.andrea-Re5JQEeQqe8AvxtiuMwx3w,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b


Hi All,

Here is the V10 of the IO controller patches generated on top of 2.6.31.

For ease of patching, a consolidated patch is available here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v10.patch

Changes from V9
===============
- Brought back the mechanism of idle trees (cache of recently served io
  queues). BFQ had originally implemented it and I had got rid of it. Later
  I realized that it helps providing fairness when io queue and io groups are
  running at same level. Hence brought the mechanism back.

  This cache helps in determining whether a task getting back into tree
  is a streaming reader who just consumed full slice legth or a new process
  (if not in cache) or a random reader who just got a small slice lenth and
  now got backlogged again. 

- Implemented "wait busy" for sequential reader queues. So we wait for one
  extra idle period for these queues to become busy so that group does not
  loose fairness. This works even if group_idle=0.

- Fixed an issue where readers don't preempt writers with-in a group when
  readers get backlogged. (implemented late preemption).

- Fixed the issue reported by Gui where Anticipatory was not expiring the
  queue.

- Did more modification to AS so that it lets common layer know that it is
  anticipation on next requeust and common fair queuing layer does not try
  to do excessive queue expiratrions.

- Started charging the queue only for allocated slice length (if fairness
  is not set) if it consumed more than allocated slice. Otherwise that
  queue can miss a dispatch round doubling the max latencies. This idea
  also borrowed from BFQ.

- Allowed preemption where a reader can preempt other writer running in 
  sibling groups or a meta data reader can preempt other non metadata
  reader in sibling group.

- Fixed freed_request() issue pointed out by Nauman.
 
What problem are we trying to solve
===================================
Provide group IO scheduling feature in Linux along the lines of other resource
controllers like cpu.

IOW, provide facility so that a user can group applications using cgroups and
control the amount of disk time/bandwidth received by a group based on its
weight. 

How to solve the problem
=========================

Different people have solved the issue differetnly. So far looks it looks
like we seem to have following two core requirements when it comes to
fairness at group level.

- Control bandwidth seen by groups.
- Control on latencies when a request gets backlogged in group.

At least there are now three patchsets available (including this one).

IO throttling
-------------
This is a bandwidth controller which keeps track of IO rate of a group and
throttles the process in the group if it exceeds the user specified limit.

dm-ioband
---------
This is a proportional bandwidth controller implemented as device mapper
driver and provides fair access in terms of amount of IO done (not in terms
of disk time as CFQ does).

So one will setup one or more dm-ioband devices on top of physical/logical
block device, configure the ioband device and pass information like grouping
etc. Now this device will keep track of bios flowing through it and control
the flow of bios based on group policies.

IO scheduler based IO controller
--------------------------------
Here we have viewed the problem of IO contoller as hierarchical group
scheduling (along the lines of CFS group scheduling) issue. Currently one can
view linux IO schedulers as flat where there is one root group and all the IO
belongs to that group.

This patchset basically modifies IO schedulers to also support hierarchical
group scheduling. CFQ already provides fairness among different processes. I 
have extended it support group IO schduling. Also took some of the code out
of CFQ and put in a common layer so that same group scheduling code can be
used by noop, deadline and AS to support group scheduling. 

Pros/Cons
=========
There are pros and cons to each of the approach. Following are some of the
thoughts.

Max bandwidth vs proportional bandwidth
---------------------------------------
IO throttling is a max bandwidth controller and not a proportional one.
Additionaly it provides fairness in terms of amount of IO done (and not in
terms of disk time as CFQ does).

Personally, I think that proportional weight controller is useful to more
people than just max bandwidth controller. In addition, IO scheduler based
controller can also be enhanced to do max bandwidth control. So it can 
satisfy wider set of requirements.

Fairness in terms of disk time vs size of IO
---------------------------------------------
An higher level controller will most likely be limited to providing fairness
in terms of size/number of IO done and will find it hard to provide fairness
in terms of disk time used (as CFQ provides between various prio levels). This
is because only IO scheduler knows how much disk time a queue has used and
information about queues and disk time used is not exported to higher
layers.

So a seeky application will still run away with lot of disk time and bring
down the overall throughput of the the disk.

Currently dm-ioband provides fairness in terms of number/size of IO.

Latencies and isolation between groups
--------------------------------------
An higher level controller is generally implementing a bandwidth throttling
solution where if a group exceeds either the max bandwidth or the proportional
share then throttle that group.

This kind of approach will probably not help in controlling latencies as it
will depend on underlying IO scheduler. Consider following scenario. 

Assume there are two groups. One group is running multiple sequential readers
and other group has a random reader. sequential readers will get a nice 100ms
slice each and then a random reader from group2 will get to dispatch the
request. So latency of this random reader will depend on how many sequential
readers are running in other group and that is a weak isolation between groups.

When we control things at IO scheduler level, we assign one time slice to one
group and then pick next entity to run. So effectively after one time slice
(max 180ms, if prio 0 sequential reader is running), random reader in other
group will get to run. Hence we achieve better isolation between groups as
response time of process in a differnt group is generally not dependent on
number of processes running in competing group.  

So a higher level solution is most likely limited to only shaping bandwidth
without any control on latencies.

Stacking group scheduler on top of CFQ can lead to issues
---------------------------------------------------------
IO throttling and dm-ioband both are second level controller. That is these
controllers are implemented in higher layers than io schedulers. So they
control the IO at higher layer based on group policies and later IO
schedulers take care of dispatching these bios to disk.

Implementing a second level controller has the advantage of being able to
provide bandwidth control even on logical block devices in the IO stack
which don't have any IO schedulers attached to these. But they can also 
interefere with IO scheduling policy of underlying IO scheduler and change
the effective behavior. Following are some of the issues which I think
should be visible in second level controller in one form or other.

  Prio with-in group
  ------------------
  A second level controller can potentially interefere with behavior of
  different prio processes with-in a group. bios are buffered at higher layer
  in single queue and release of bios is FIFO and not proportionate to the
  ioprio of the process. This can result in a particular prio level not
  getting fair share.

  Buffering at higher layer can delay read requests for more than slice idle
  period of CFQ (default 8 ms). That means, it is possible that we are waiting
  for a request from the queue but it is buffered at higher layer and then idle
  timer will fire. It means that queue will losse its share at the same time
  overall throughput will be impacted as we lost those 8 ms.
  
  Read Vs Write
  -------------
  Writes can overwhelm readers hence second level controller FIFO release
  will run into issue here. If there is a single queue maintained then reads
  will suffer large latencies. If there separate queues for reads and writes
  then it will be hard to decide in what ratio to dispatch reads and writes as
  it is IO scheduler's decision to decide when and how much read/write to
  dispatch. This is another place where higher level controller will not be in
  sync with lower level io scheduler and can change the effective policies of
  underlying io scheduler.

  CFQ IO context Issues
  ---------------------
  Buffering at higher layer means submission of bios later with the help of
  a worker thread. This changes the io context information at CFQ layer which
  assigns the request to submitting thread. Change of io context info again
  leads to issues of idle timer expiry and issue of a process not getting fair
  share and reduced throughput.

  Throughput with noop, deadline and AS
  ---------------------------------------------
  I think an higher level controller will result in reduced overall throughput
  (as compared to io scheduler based io controller) and more seeks with noop,
  deadline and AS.

  The reason being, that it is likely that IO with-in a group will be related
  and will be relatively close as compared to IO across the groups. For example,
  thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
  control, IO from various groups will go into a single queue at lower level
  controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
  G4....) causing more seeks and reduced throughput. (Agreed that merging will
  help up to some extent but still....).

  Instead, in case of lower level controller, IO scheduler maintains one queue
  per group hence there is no interleaving of IO between groups. And if IO is
  related with-in group, then we shoud get reduced number/amount of seek and
  higher throughput.

  Latency can be a concern but that can be controlled by reducing the time
  slice length of the queue.

Fairness at logical device level vs at physical device level
------------------------------------------------------------

IO scheduler based controller has the limitation that it works only with the
bottom most devices in the IO stack where IO scheduler is attached.

For example, assume a user has created a logical device lv0 using three
underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
in two groups doing IO on lv0. Also assume that weights of groups are in the
ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.

			     T1    T2
			       \   /
			        lv0
			      /  |  \
			    sda sdb  sdc


Now resource control will take place only on devices sda, sdb and sdc and
not at lv0 level. So if IO from two tasks is relatively uniformly
distributed across the disks then T1 and T2 will see the throughput ratio
in proportion to weight specified. But if IO from T1 and T2 is going to
different disks and there is no contention then at higher level they both
will see same BW.

Here a second level controller can produce better fairness numbers at
logical device but most likely at redued overall throughput of the system,
because it will try to control IO even if there is no contention at phsical
possibly leaving diksks unused in the system.

Hence, question comes that how important it is to control bandwidth at
higher level logical devices also. The actual contention for resources is
at the leaf block device so it probably makes sense to do any kind of
control there and not at the intermediate devices. Secondly probably it
also means better use of available resources.

Limited Fairness
----------------
Currently CFQ idles on a sequential reader queue to make sure it gets its
fair share. A second level controller will find it tricky to anticipate.
Either it will not have any anticipation logic and in that case it will not
provide fairness to single readers in a group (as dm-ioband does) or if it
starts anticipating then we should run into these strange situations where
second level controller is anticipating on one queue/group and underlying
IO scheduler might be anticipating on something else.

Need of device mapper tools
---------------------------
A device mapper based solution will require creation of a ioband device
on each physical/logical device one wants to control. So it requires usage
of device mapper tools even for the people who are not using device mapper.
At the same time creation of ioband device on each partition in the system to 
control the IO can be cumbersome and overwhelming if system has got lots of
disks and partitions with-in.


IMHO, IO scheduler based IO controller is a reasonable approach to solve the
problem of group bandwidth control, and can do hierarchical IO scheduling
more tightly and efficiently.

But I am all ears to alternative approaches and suggestions how doing things
can be done better and will be glad to implement it.

TODO
====
- code cleanups, testing, bug fixing, optimizations, benchmarking etc...
- More testing to make sure there are no regressions in CFQ.

Testing
=======

Environment
==========
A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. I am mostly
running fio jobs which have been limited to 30 seconds run and then monitored
the throughput and latency.
 
Test1: Random Reader Vs Random Writers
======================================
Launched a random reader and then increasing number of random writers to see
the effect on random reader BW and max lantecies.

[fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]

[Vanilla CFQ, No groups]
<--------------random writers-------------------->  <------random reader-->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   5737KiB/s   5737KiB/s   5737KiB/s   164K usec   503KiB/s    159K usec   
2   2055KiB/s   1984KiB/s   4039KiB/s   1459K usec  150KiB/s    170K usec   
4   1238KiB/s   932KiB/s    4419KiB/s   4332K usec  153KiB/s    225K usec   
8   1059KiB/s   929KiB/s    7901KiB/s   1260K usec  118KiB/s    377K usec   
16  604KiB/s    483KiB/s    8519KiB/s   3081K usec  47KiB/s     756K usec   
32  367KiB/s    222KiB/s    9643KiB/s   5940K usec  22KiB/s     923K usec   

Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
number of random writers in group1 and one random reader in group2 using fio.

[IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
<--------------random writers(group1)-------------> <-random reader(group2)->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   18115KiB/s  18115KiB/s  18115KiB/s  604K usec   345KiB/s    176K usec   
2   3752KiB/s   3676KiB/s   7427KiB/s   4367K usec  402KiB/s    187K usec   
4   1951KiB/s   1863KiB/s   7642KiB/s   1989K usec  384KiB/s    181K usec   
8   755KiB/s    629KiB/s    5683KiB/s   2133K usec  366KiB/s    319K usec   
16  418KiB/s    369KiB/s    6276KiB/s   1323K usec  352KiB/s    287K usec   
32  236KiB/s    191KiB/s    6518KiB/s   1910K usec  337KiB/s    273K usec   

Also ran the same test with IO controller CFQ in flat mode to see if there
are any major deviations from Vanilla CFQ. Does not look like any.

[IO controller CFQ; No groups ]
<--------------random writers-------------------->  <------random reader-->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   5696KiB/s   5696KiB/s   5696KiB/s   259K usec   500KiB/s    194K usec   
2   2483KiB/s   2197KiB/s   4680KiB/s   887K usec   150KiB/s    159K usec   
4   1471KiB/s   1433KiB/s   5817KiB/s   962K usec   126KiB/s    189K usec   
8   691KiB/s    580KiB/s    5159KiB/s   2752K usec  197KiB/s    246K usec   
16  781KiB/s    698KiB/s    11892KiB/s  943K usec   61KiB/s     529K usec   
32  415KiB/s    324KiB/s    12461KiB/s  4614K usec  17KiB/s     737K usec   

Notes:
- With vanilla CFQ, random writers can overwhelm a random reader. Bring down
  its throughput and bump up latencies significantly.

- With IO controller, one can provide isolation to the random reader group and
  maintain consitent view of bandwidth and latencies. 

Test2: Random Reader Vs Sequential Reader
========================================
Launched a random reader and then increasing number of sequential readers to
see the effect on BW and latencies of random reader.

[fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]

[ Vanilla CFQ, No groups ]
<---------------seq readers---------------------->  <------random reader-->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   23318KiB/s  23318KiB/s  23318KiB/s  55940 usec  36KiB/s     247K usec   
2   14732KiB/s  11406KiB/s  26126KiB/s  142K usec   20KiB/s     446K usec   
4   9417KiB/s   5169KiB/s   27338KiB/s  404K usec   10KiB/s     993K usec   
8   3360KiB/s   3041KiB/s   25850KiB/s  954K usec   60KiB/s     956K usec   
16  1888KiB/s   1457KiB/s   26763KiB/s  1871K usec  28KiB/s     1868K usec  

Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
number of sequential readers in group1 and one random reader in group2 using
fio.

[IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
<---------------group1--------------------------->  <------group2--------->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   13733KiB/s  13733KiB/s  13733KiB/s  247K usec   330KiB/s    154K usec   
2   8553KiB/s   4963KiB/s   13514KiB/s  472K usec   322KiB/s    174K usec   
4   5045KiB/s   1367KiB/s   13134KiB/s  947K usec   318KiB/s    178K usec   
8   1774KiB/s   1420KiB/s   13035KiB/s  1871K usec  323KiB/s    233K usec   
16  959KiB/s    518KiB/s    12691KiB/s  3809K usec  324KiB/s    208K usec   

Also ran the same test with IO controller CFQ in flat mode to see if there
are any major deviations from Vanilla CFQ. Does not look like any.

[IO controller CFQ; No groups ]
<---------------seq readers---------------------->  <------random reader-->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   23028KiB/s  23028KiB/s  23028KiB/s  47460 usec  36KiB/s     253K usec   
2   14452KiB/s  11176KiB/s  25628KiB/s  145K usec   20KiB/s     447K usec   
4   8815KiB/s   5720KiB/s   27121KiB/s  396K usec   10KiB/s     968K usec   
8   3335KiB/s   2827KiB/s   24866KiB/s  960K usec   62KiB/s     955K usec   
16  1784KiB/s   1311KiB/s   26537KiB/s  1883K usec  26KiB/s     1866K usec  

Notes:
- The BW and latencies of random reader in group 2 seems to be stable and
  bounded and does not get impacted much as number of sequential readers
  increase in group1. Hence provding good isolation.

- Throughput of sequential readers comes down and latencies go up as half
  of disk bandwidth (in terms of time) has been reserved for random reader
  group.

Test3: Sequential Reader Vs Sequential Reader
============================================
Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
Launched increasing number of sequential readers in group1 and one sequential
reader in group2 using fio and monitored how bandwidth is being distributed
between two groups.

First 5 columns give stats about job in group1 and last two columns give
stats about job in group2.

<---------------group1--------------------------->  <------group2--------->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   8970KiB/s   8970KiB/s   8970KiB/s   230K usec   20681KiB/s  124K usec   
2   6783KiB/s   3202KiB/s   9984KiB/s   546K usec   19682KiB/s  139K usec   
4   4641KiB/s   1029KiB/s   9280KiB/s   1185K usec  19235KiB/s  172K usec   
8   1435KiB/s   1079KiB/s   9926KiB/s   2461K usec  19501KiB/s  153K usec   
16  764KiB/s    398KiB/s    9395KiB/s   4986K usec  19367KiB/s  172K usec   

Note: group2 is getting double the bandwidth of group1 even in the face
of increasing number of readers in group1.

Test4 (Isolation between two KVM virtual machines)
==================================================
Created two KVM virtual machines. Partitioned a disk on host in two partitions
and gave one partition to each virtual machine. Put both the virtual machines
in two different cgroup of weight 1000 and 500 each. Virtual machines created
ext3 file system on the partitions exported from host and did buffered writes.
Host seems writes as synchronous and virtual machine with higher weight gets
double the disk time of virtual machine of lower weight. Used deadline
scheduler in this test case.

Some more details about configuration are in documentation patch.

Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with buffered write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher 
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

Vanilla CFQ Vs IO Controller CFQ
================================
We have not fundamentally changed CFQ, instead enhanced it to also support
hierarchical io scheduling. In the process invariably there are small changes
here and there as new scenarios come up. Running some tests here and comparing
both the CFQ's to see if there is any major deviation in behavior.

Test1: Sequential Readers
=========================
[fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]

IO scheduler: Vanilla CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  

IO scheduler: IO controller CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  

Test2: Sequential Writers
=========================
[fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]

IO scheduler: Vanilla CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  

IO scheduler: IO Controller CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  

Test3: Random Readers
=========================
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]

IO scheduler: Vanilla CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
16  38KiB/s     8KiB/s      328KiB/s    3965 msec   

IO scheduler: IO Controller CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
16  43KiB/s     9KiB/s      327KiB/s    3905 msec   

Test4: Random Writers
=====================
[fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]

IO scheduler: Vanilla CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
16  66KiB/s     22KiB/s     829KiB/s    1308 msec   

IO scheduler: IO Controller CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
16  71KiB/s     29KiB/s     814KiB/s    1457 msec   

Notes:
 - Does not look like that anything has changed significantly.

Previous versions of the patches were posted here.
------------------------------------------------

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279
(V6) http://lkml.org/lkml/2009/7/2/369
(V7) http://lkml.org/lkml/2009/7/24/253
(V8) http://lkml.org/lkml/2009/8/16/204
(V9) http://lkml.org/lkml/2009/8/28/327

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

* IO scheduler based IO controller V10
@ 2009-09-24 19:25 Vivek Goyal
  2009-09-24 21:33   ` Andrew Morton
                   ` (3 more replies)
  0 siblings, 4 replies; 349+ messages in thread
From: Vivek Goyal @ 2009-09-24 19:25 UTC (permalink / raw)
  To: linux-kernel, jens.axboe
  Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi,
	paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng,
	jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm,
	peterz, jmarchan, torvalds, mingo, riel


Hi All,

Here is the V10 of the IO controller patches generated on top of 2.6.31.

For ease of patching, a consolidated patch is available here.

http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v10.patch

Changes from V9
===============
- Brought back the mechanism of idle trees (cache of recently served io
  queues). BFQ had originally implemented it and I had got rid of it. Later
  I realized that it helps providing fairness when io queue and io groups are
  running at same level. Hence brought the mechanism back.

  This cache helps in determining whether a task getting back into tree
  is a streaming reader who just consumed full slice legth or a new process
  (if not in cache) or a random reader who just got a small slice lenth and
  now got backlogged again. 

- Implemented "wait busy" for sequential reader queues. So we wait for one
  extra idle period for these queues to become busy so that group does not
  loose fairness. This works even if group_idle=0.

- Fixed an issue where readers don't preempt writers with-in a group when
  readers get backlogged. (implemented late preemption).

- Fixed the issue reported by Gui where Anticipatory was not expiring the
  queue.

- Did more modification to AS so that it lets common layer know that it is
  anticipation on next requeust and common fair queuing layer does not try
  to do excessive queue expiratrions.

- Started charging the queue only for allocated slice length (if fairness
  is not set) if it consumed more than allocated slice. Otherwise that
  queue can miss a dispatch round doubling the max latencies. This idea
  also borrowed from BFQ.

- Allowed preemption where a reader can preempt other writer running in 
  sibling groups or a meta data reader can preempt other non metadata
  reader in sibling group.

- Fixed freed_request() issue pointed out by Nauman.
 
What problem are we trying to solve
===================================
Provide group IO scheduling feature in Linux along the lines of other resource
controllers like cpu.

IOW, provide facility so that a user can group applications using cgroups and
control the amount of disk time/bandwidth received by a group based on its
weight. 

How to solve the problem
=========================

Different people have solved the issue differetnly. So far looks it looks
like we seem to have following two core requirements when it comes to
fairness at group level.

- Control bandwidth seen by groups.
- Control on latencies when a request gets backlogged in group.

At least there are now three patchsets available (including this one).

IO throttling
-------------
This is a bandwidth controller which keeps track of IO rate of a group and
throttles the process in the group if it exceeds the user specified limit.

dm-ioband
---------
This is a proportional bandwidth controller implemented as device mapper
driver and provides fair access in terms of amount of IO done (not in terms
of disk time as CFQ does).

So one will setup one or more dm-ioband devices on top of physical/logical
block device, configure the ioband device and pass information like grouping
etc. Now this device will keep track of bios flowing through it and control
the flow of bios based on group policies.

IO scheduler based IO controller
--------------------------------
Here we have viewed the problem of IO contoller as hierarchical group
scheduling (along the lines of CFS group scheduling) issue. Currently one can
view linux IO schedulers as flat where there is one root group and all the IO
belongs to that group.

This patchset basically modifies IO schedulers to also support hierarchical
group scheduling. CFQ already provides fairness among different processes. I 
have extended it support group IO schduling. Also took some of the code out
of CFQ and put in a common layer so that same group scheduling code can be
used by noop, deadline and AS to support group scheduling. 

Pros/Cons
=========
There are pros and cons to each of the approach. Following are some of the
thoughts.

Max bandwidth vs proportional bandwidth
---------------------------------------
IO throttling is a max bandwidth controller and not a proportional one.
Additionaly it provides fairness in terms of amount of IO done (and not in
terms of disk time as CFQ does).

Personally, I think that proportional weight controller is useful to more
people than just max bandwidth controller. In addition, IO scheduler based
controller can also be enhanced to do max bandwidth control. So it can 
satisfy wider set of requirements.

Fairness in terms of disk time vs size of IO
---------------------------------------------
An higher level controller will most likely be limited to providing fairness
in terms of size/number of IO done and will find it hard to provide fairness
in terms of disk time used (as CFQ provides between various prio levels). This
is because only IO scheduler knows how much disk time a queue has used and
information about queues and disk time used is not exported to higher
layers.

So a seeky application will still run away with lot of disk time and bring
down the overall throughput of the the disk.

Currently dm-ioband provides fairness in terms of number/size of IO.

Latencies and isolation between groups
--------------------------------------
An higher level controller is generally implementing a bandwidth throttling
solution where if a group exceeds either the max bandwidth or the proportional
share then throttle that group.

This kind of approach will probably not help in controlling latencies as it
will depend on underlying IO scheduler. Consider following scenario. 

Assume there are two groups. One group is running multiple sequential readers
and other group has a random reader. sequential readers will get a nice 100ms
slice each and then a random reader from group2 will get to dispatch the
request. So latency of this random reader will depend on how many sequential
readers are running in other group and that is a weak isolation between groups.

When we control things at IO scheduler level, we assign one time slice to one
group and then pick next entity to run. So effectively after one time slice
(max 180ms, if prio 0 sequential reader is running), random reader in other
group will get to run. Hence we achieve better isolation between groups as
response time of process in a differnt group is generally not dependent on
number of processes running in competing group.  

So a higher level solution is most likely limited to only shaping bandwidth
without any control on latencies.

Stacking group scheduler on top of CFQ can lead to issues
---------------------------------------------------------
IO throttling and dm-ioband both are second level controller. That is these
controllers are implemented in higher layers than io schedulers. So they
control the IO at higher layer based on group policies and later IO
schedulers take care of dispatching these bios to disk.

Implementing a second level controller has the advantage of being able to
provide bandwidth control even on logical block devices in the IO stack
which don't have any IO schedulers attached to these. But they can also 
interefere with IO scheduling policy of underlying IO scheduler and change
the effective behavior. Following are some of the issues which I think
should be visible in second level controller in one form or other.

  Prio with-in group
  ------------------
  A second level controller can potentially interefere with behavior of
  different prio processes with-in a group. bios are buffered at higher layer
  in single queue and release of bios is FIFO and not proportionate to the
  ioprio of the process. This can result in a particular prio level not
  getting fair share.

  Buffering at higher layer can delay read requests for more than slice idle
  period of CFQ (default 8 ms). That means, it is possible that we are waiting
  for a request from the queue but it is buffered at higher layer and then idle
  timer will fire. It means that queue will losse its share at the same time
  overall throughput will be impacted as we lost those 8 ms.
  
  Read Vs Write
  -------------
  Writes can overwhelm readers hence second level controller FIFO release
  will run into issue here. If there is a single queue maintained then reads
  will suffer large latencies. If there separate queues for reads and writes
  then it will be hard to decide in what ratio to dispatch reads and writes as
  it is IO scheduler's decision to decide when and how much read/write to
  dispatch. This is another place where higher level controller will not be in
  sync with lower level io scheduler and can change the effective policies of
  underlying io scheduler.

  CFQ IO context Issues
  ---------------------
  Buffering at higher layer means submission of bios later with the help of
  a worker thread. This changes the io context information at CFQ layer which
  assigns the request to submitting thread. Change of io context info again
  leads to issues of idle timer expiry and issue of a process not getting fair
  share and reduced throughput.

  Throughput with noop, deadline and AS
  ---------------------------------------------
  I think an higher level controller will result in reduced overall throughput
  (as compared to io scheduler based io controller) and more seeks with noop,
  deadline and AS.

  The reason being, that it is likely that IO with-in a group will be related
  and will be relatively close as compared to IO across the groups. For example,
  thread pool of kvm-qemu doing IO for virtual machine. In case of higher level
  control, IO from various groups will go into a single queue at lower level
  controller and it might happen that IO is now interleaved (G1, G2, G1, G3,
  G4....) causing more seeks and reduced throughput. (Agreed that merging will
  help up to some extent but still....).

  Instead, in case of lower level controller, IO scheduler maintains one queue
  per group hence there is no interleaving of IO between groups. And if IO is
  related with-in group, then we shoud get reduced number/amount of seek and
  higher throughput.

  Latency can be a concern but that can be controlled by reducing the time
  slice length of the queue.

Fairness at logical device level vs at physical device level
------------------------------------------------------------

IO scheduler based controller has the limitation that it works only with the
bottom most devices in the IO stack where IO scheduler is attached.

For example, assume a user has created a logical device lv0 using three
underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2
in two groups doing IO on lv0. Also assume that weights of groups are in the
ratio of 2:1 so T1 should get double the BW of T2 on lv0 device.

			     T1    T2
			       \   /
			        lv0
			      /  |  \
			    sda sdb  sdc


Now resource control will take place only on devices sda, sdb and sdc and
not at lv0 level. So if IO from two tasks is relatively uniformly
distributed across the disks then T1 and T2 will see the throughput ratio
in proportion to weight specified. But if IO from T1 and T2 is going to
different disks and there is no contention then at higher level they both
will see same BW.

Here a second level controller can produce better fairness numbers at
logical device but most likely at redued overall throughput of the system,
because it will try to control IO even if there is no contention at phsical
possibly leaving diksks unused in the system.

Hence, question comes that how important it is to control bandwidth at
higher level logical devices also. The actual contention for resources is
at the leaf block device so it probably makes sense to do any kind of
control there and not at the intermediate devices. Secondly probably it
also means better use of available resources.

Limited Fairness
----------------
Currently CFQ idles on a sequential reader queue to make sure it gets its
fair share. A second level controller will find it tricky to anticipate.
Either it will not have any anticipation logic and in that case it will not
provide fairness to single readers in a group (as dm-ioband does) or if it
starts anticipating then we should run into these strange situations where
second level controller is anticipating on one queue/group and underlying
IO scheduler might be anticipating on something else.

Need of device mapper tools
---------------------------
A device mapper based solution will require creation of a ioband device
on each physical/logical device one wants to control. So it requires usage
of device mapper tools even for the people who are not using device mapper.
At the same time creation of ioband device on each partition in the system to 
control the IO can be cumbersome and overwhelming if system has got lots of
disks and partitions with-in.


IMHO, IO scheduler based IO controller is a reasonable approach to solve the
problem of group bandwidth control, and can do hierarchical IO scheduling
more tightly and efficiently.

But I am all ears to alternative approaches and suggestions how doing things
can be done better and will be glad to implement it.

TODO
====
- code cleanups, testing, bug fixing, optimizations, benchmarking etc...
- More testing to make sure there are no regressions in CFQ.

Testing
=======

Environment
==========
A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. I am mostly
running fio jobs which have been limited to 30 seconds run and then monitored
the throughput and latency.
 
Test1: Random Reader Vs Random Writers
======================================
Launched a random reader and then increasing number of random writers to see
the effect on random reader BW and max lantecies.

[fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ]
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]

[Vanilla CFQ, No groups]
<--------------random writers-------------------->  <------random reader-->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   5737KiB/s   5737KiB/s   5737KiB/s   164K usec   503KiB/s    159K usec   
2   2055KiB/s   1984KiB/s   4039KiB/s   1459K usec  150KiB/s    170K usec   
4   1238KiB/s   932KiB/s    4419KiB/s   4332K usec  153KiB/s    225K usec   
8   1059KiB/s   929KiB/s    7901KiB/s   1260K usec  118KiB/s    377K usec   
16  604KiB/s    483KiB/s    8519KiB/s   3081K usec  47KiB/s     756K usec   
32  367KiB/s    222KiB/s    9643KiB/s   5940K usec  22KiB/s     923K usec   

Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
number of random writers in group1 and one random reader in group2 using fio.

[IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500]
<--------------random writers(group1)-------------> <-random reader(group2)->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   18115KiB/s  18115KiB/s  18115KiB/s  604K usec   345KiB/s    176K usec   
2   3752KiB/s   3676KiB/s   7427KiB/s   4367K usec  402KiB/s    187K usec   
4   1951KiB/s   1863KiB/s   7642KiB/s   1989K usec  384KiB/s    181K usec   
8   755KiB/s    629KiB/s    5683KiB/s   2133K usec  366KiB/s    319K usec   
16  418KiB/s    369KiB/s    6276KiB/s   1323K usec  352KiB/s    287K usec   
32  236KiB/s    191KiB/s    6518KiB/s   1910K usec  337KiB/s    273K usec   

Also ran the same test with IO controller CFQ in flat mode to see if there
are any major deviations from Vanilla CFQ. Does not look like any.

[IO controller CFQ; No groups ]
<--------------random writers-------------------->  <------random reader-->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   5696KiB/s   5696KiB/s   5696KiB/s   259K usec   500KiB/s    194K usec   
2   2483KiB/s   2197KiB/s   4680KiB/s   887K usec   150KiB/s    159K usec   
4   1471KiB/s   1433KiB/s   5817KiB/s   962K usec   126KiB/s    189K usec   
8   691KiB/s    580KiB/s    5159KiB/s   2752K usec  197KiB/s    246K usec   
16  781KiB/s    698KiB/s    11892KiB/s  943K usec   61KiB/s     529K usec   
32  415KiB/s    324KiB/s    12461KiB/s  4614K usec  17KiB/s     737K usec   

Notes:
- With vanilla CFQ, random writers can overwhelm a random reader. Bring down
  its throughput and bump up latencies significantly.

- With IO controller, one can provide isolation to the random reader group and
  maintain consitent view of bandwidth and latencies. 

Test2: Random Reader Vs Sequential Reader
========================================
Launched a random reader and then increasing number of sequential readers to
see the effect on BW and latencies of random reader.

[fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ]
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1]

[ Vanilla CFQ, No groups ]
<---------------seq readers---------------------->  <------random reader-->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   23318KiB/s  23318KiB/s  23318KiB/s  55940 usec  36KiB/s     247K usec   
2   14732KiB/s  11406KiB/s  26126KiB/s  142K usec   20KiB/s     446K usec   
4   9417KiB/s   5169KiB/s   27338KiB/s  404K usec   10KiB/s     993K usec   
8   3360KiB/s   3041KiB/s   25850KiB/s  954K usec   60KiB/s     956K usec   
16  1888KiB/s   1457KiB/s   26763KiB/s  1871K usec  28KiB/s     1868K usec  

Created two cgroups group1 and group2 of weights 500 each.  Launched increasing
number of sequential readers in group1 and one random reader in group2 using
fio.

[IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500]
<---------------group1--------------------------->  <------group2--------->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   13733KiB/s  13733KiB/s  13733KiB/s  247K usec   330KiB/s    154K usec   
2   8553KiB/s   4963KiB/s   13514KiB/s  472K usec   322KiB/s    174K usec   
4   5045KiB/s   1367KiB/s   13134KiB/s  947K usec   318KiB/s    178K usec   
8   1774KiB/s   1420KiB/s   13035KiB/s  1871K usec  323KiB/s    233K usec   
16  959KiB/s    518KiB/s    12691KiB/s  3809K usec  324KiB/s    208K usec   

Also ran the same test with IO controller CFQ in flat mode to see if there
are any major deviations from Vanilla CFQ. Does not look like any.

[IO controller CFQ; No groups ]
<---------------seq readers---------------------->  <------random reader-->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   23028KiB/s  23028KiB/s  23028KiB/s  47460 usec  36KiB/s     253K usec   
2   14452KiB/s  11176KiB/s  25628KiB/s  145K usec   20KiB/s     447K usec   
4   8815KiB/s   5720KiB/s   27121KiB/s  396K usec   10KiB/s     968K usec   
8   3335KiB/s   2827KiB/s   24866KiB/s  960K usec   62KiB/s     955K usec   
16  1784KiB/s   1311KiB/s   26537KiB/s  1883K usec  26KiB/s     1866K usec  

Notes:
- The BW and latencies of random reader in group 2 seems to be stable and
  bounded and does not get impacted much as number of sequential readers
  increase in group1. Hence provding good isolation.

- Throughput of sequential readers comes down and latencies go up as half
  of disk bandwidth (in terms of time) has been reserved for random reader
  group.

Test3: Sequential Reader Vs Sequential Reader
============================================
Created two cgroups group1 and group2 of weights 500 and 1000 respectively.
Launched increasing number of sequential readers in group1 and one sequential
reader in group2 using fio and monitored how bandwidth is being distributed
between two groups.

First 5 columns give stats about job in group1 and last two columns give
stats about job in group2.

<---------------group1--------------------------->  <------group2--------->
nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 
1   8970KiB/s   8970KiB/s   8970KiB/s   230K usec   20681KiB/s  124K usec   
2   6783KiB/s   3202KiB/s   9984KiB/s   546K usec   19682KiB/s  139K usec   
4   4641KiB/s   1029KiB/s   9280KiB/s   1185K usec  19235KiB/s  172K usec   
8   1435KiB/s   1079KiB/s   9926KiB/s   2461K usec  19501KiB/s  153K usec   
16  764KiB/s    398KiB/s    9395KiB/s   4986K usec  19367KiB/s  172K usec   

Note: group2 is getting double the bandwidth of group1 even in the face
of increasing number of readers in group1.

Test4 (Isolation between two KVM virtual machines)
==================================================
Created two KVM virtual machines. Partitioned a disk on host in two partitions
and gave one partition to each virtual machine. Put both the virtual machines
in two different cgroup of weight 1000 and 500 each. Virtual machines created
ext3 file system on the partitions exported from host and did buffered writes.
Host seems writes as synchronous and virtual machine with higher weight gets
double the disk time of virtual machine of lower weight. Used deadline
scheduler in this test case.

Some more details about configuration are in documentation patch.

Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
===================================================================
Fairness for async writes is tricky and biggest reason is that async writes
are cached in higher layers (page cahe) as well as possibly in file system
layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
in proportional manner.

For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.

IOW, the core problem with buffered write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.

In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher 
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.

Vanilla CFQ Vs IO Controller CFQ
================================
We have not fundamentally changed CFQ, instead enhanced it to also support
hierarchical io scheduling. In the process invariably there are small changes
here and there as new scenarios come up. Running some tests here and comparing
both the CFQ's to see if there is any major deviation in behavior.

Test1: Sequential Readers
=========================
[fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]

IO scheduler: Vanilla CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   35499KiB/s  35499KiB/s  35499KiB/s  19195 usec  
2   17089KiB/s  13600KiB/s  30690KiB/s  118K usec   
4   9165KiB/s   5421KiB/s   29411KiB/s  380K usec   
8   3815KiB/s   3423KiB/s   29312KiB/s  830K usec   
16  1911KiB/s   1554KiB/s   28921KiB/s  1756K usec  

IO scheduler: IO controller CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   34494KiB/s  34494KiB/s  34494KiB/s  14482 usec  
2   16983KiB/s  13632KiB/s  30616KiB/s  123K usec   
4   9237KiB/s   5809KiB/s   29631KiB/s  372K usec   
8   3901KiB/s   3505KiB/s   29162KiB/s  822K usec   
16  1895KiB/s   1653KiB/s   28945KiB/s  1778K usec  

Test2: Sequential Writers
=========================
[fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ]

IO scheduler: Vanilla CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   22669KiB/s  22669KiB/s  22669KiB/s  401K usec   
2   14760KiB/s  7419KiB/s   22179KiB/s  571K usec   
4   5862KiB/s   5746KiB/s   23174KiB/s  444K usec   
8   3377KiB/s   2199KiB/s   22427KiB/s  1057K usec  
16  2229KiB/s   556KiB/s    20601KiB/s  5099K usec  

IO scheduler: IO Controller CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   22911KiB/s  22911KiB/s  22911KiB/s  37319 usec  
2   11752KiB/s  11632KiB/s  23383KiB/s  245K usec   
4   6663KiB/s   5409KiB/s   23207KiB/s  384K usec   
8   3161KiB/s   2460KiB/s   22566KiB/s  935K usec   
16  1888KiB/s   795KiB/s    21349KiB/s  3009K usec  

Test3: Random Readers
=========================
[fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]

IO scheduler: Vanilla CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   484KiB/s    484KiB/s    484KiB/s    22596 usec  
2   229KiB/s    196KiB/s    425KiB/s    51111 usec  
4   119KiB/s    73KiB/s     405KiB/s    2344 msec   
8   93KiB/s     23KiB/s     399KiB/s    2246 msec   
16  38KiB/s     8KiB/s      328KiB/s    3965 msec   

IO scheduler: IO Controller CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   483KiB/s    483KiB/s    483KiB/s    29391 usec  
2   229KiB/s    196KiB/s    426KiB/s    51625 usec  
4   132KiB/s    88KiB/s     417KiB/s    2313 msec   
8   79KiB/s     18KiB/s     389KiB/s    2298 msec   
16  43KiB/s     9KiB/s      327KiB/s    3905 msec   

Test4: Random Writers
=====================
[fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16]

IO scheduler: Vanilla CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   14641KiB/s  14641KiB/s  14641KiB/s  93045 usec  
2   7896KiB/s   1348KiB/s   9245KiB/s   82778 usec  
4   2657KiB/s   265KiB/s    6025KiB/s   216K usec   
8   951KiB/s    122KiB/s    3386KiB/s   1148K usec  
16  66KiB/s     22KiB/s     829KiB/s    1308 msec   

IO scheduler: IO Controller CFQ

nr  Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 
1   14454KiB/s  14454KiB/s  14454KiB/s  74623 usec  
2   4595KiB/s   4104KiB/s   8699KiB/s   135K usec   
4   3113KiB/s   334KiB/s    5782KiB/s   200K usec   
8   1146KiB/s   95KiB/s     3832KiB/s   593K usec   
16  71KiB/s     29KiB/s     814KiB/s    1457 msec   

Notes:
 - Does not look like that anything has changed significantly.

Previous versions of the patches were posted here.
------------------------------------------------

(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279
(V6) http://lkml.org/lkml/2009/7/2/369
(V7) http://lkml.org/lkml/2009/7/24/253
(V8) http://lkml.org/lkml/2009/8/16/204
(V9) http://lkml.org/lkml/2009/8/28/327

Thanks
Vivek

^ permalink raw reply	[flat|nested] 349+ messages in thread

end of thread, other threads:[~2009-10-08 10:23 UTC | newest]

Thread overview: 349+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo
2009-10-02 11:04 ` Jens Axboe
     [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2009-10-02 11:04   ` Jens Axboe
2009-10-02 12:49   ` Vivek Goyal
2009-10-02 12:49 ` Vivek Goyal
2009-10-02 12:49   ` Vivek Goyal
     [not found]   ` <20091002124921.GA4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-02 15:27     ` Corrado Zoccolo
2009-10-02 15:27       ` Corrado Zoccolo
2009-10-02 15:31       ` Vivek Goyal
2009-10-02 15:31         ` Vivek Goyal
     [not found]       ` <4e5e476b0910020827s23e827b1n847c64e355999d4a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-02 15:31         ` Vivek Goyal
2009-10-02 15:32         ` Mike Galbraith
2009-10-02 15:32       ` Mike Galbraith
2009-10-02 15:32         ` Mike Galbraith
2009-10-02 15:40         ` Vivek Goyal
2009-10-02 15:40           ` Vivek Goyal
     [not found]           ` <20091002154020.GC4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-02 16:03             ` Mike Galbraith
2009-10-02 16:50             ` Valdis.Kletnieks-PjAqaU27lzQ
2009-10-02 16:03           ` Mike Galbraith
2009-10-02 16:50           ` Valdis.Kletnieks
2009-10-02 16:50             ` Valdis.Kletnieks
     [not found]             ` <12774.1254502217-+bZmOdGhbsPr6rcHtW+onFJE71vCis6O@public.gmane.org>
2009-10-02 19:58               ` Vivek Goyal
2009-10-02 19:58             ` Vivek Goyal
2009-10-02 19:58               ` Vivek Goyal
2009-10-02 22:14               ` Corrado Zoccolo
2009-10-02 22:14                 ` Corrado Zoccolo
     [not found]                 ` <4e5e476b0910021514i1b461229t667bed94fd67f140-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-02 22:27                   ` Vivek Goyal
2009-10-02 22:27                 ` Vivek Goyal
2009-10-02 22:27                   ` Vivek Goyal
2009-10-03 12:43                   ` Corrado Zoccolo
2009-10-03 13:38                     ` Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) Vivek Goyal
2009-10-03 13:38                       ` Vivek Goyal
2009-10-04  9:15                       ` Corrado Zoccolo
2009-10-04 12:11                         ` Vivek Goyal
2009-10-04 12:46                           ` Corrado Zoccolo
2009-10-04 16:20                             ` Fabio Checconi
     [not found]                               ` <20091004162005.GH4650-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>
2009-10-05 21:21                                 ` Corrado Zoccolo
2009-10-05 21:21                               ` Corrado Zoccolo
2009-10-05 21:21                                 ` Corrado Zoccolo
     [not found]                             ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-05 15:06                               ` Jeff Moyer
2009-10-05 15:06                                 ` Jeff Moyer
2009-10-05 21:09                                 ` Corrado Zoccolo
2009-10-05 21:09                                   ` Corrado Zoccolo
     [not found]                                   ` <4e5e476b0910051409x33f8365flf32e8e7548d72e79-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-06  8:41                                     ` Jens Axboe
2009-10-06  8:41                                   ` Jens Axboe
2009-10-06  8:41                                     ` Jens Axboe
     [not found]                                     ` <20091006084120.GJ5216-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-06  9:00                                       ` Corrado Zoccolo
2009-10-06  9:00                                     ` Corrado Zoccolo
2009-10-06  9:00                                       ` Corrado Zoccolo
     [not found]                                       ` <4e5e476b0910060200i7c028b3fr4c235bf5f18c3aa1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-06 18:53                                         ` Jens Axboe
2009-10-06 18:53                                       ` Jens Axboe
2009-10-06 18:53                                         ` Jens Axboe
     [not found]                                 ` <x49my457uef.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>
2009-10-05 21:09                                   ` Corrado Zoccolo
2009-10-06 21:36                               ` Vivek Goyal
2009-10-06 21:36                             ` Vivek Goyal
2009-10-06 21:36                               ` Vivek Goyal
     [not found]                     ` <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-03 13:38                       ` Vivek Goyal
     [not found]                   ` <20091002222756.GG4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-03 12:43                     ` IO scheduler based IO controller V10 Corrado Zoccolo
     [not found]               ` <20091002195815.GE4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-02 22:14                 ` Corrado Zoccolo
     [not found]         ` <1254497520.10392.11.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02 15:40           ` Vivek Goyal
  -- strict thread matches above, loose matches on Subject: below --
2009-10-02 10:55 Corrado Zoccolo
2009-09-24 19:25 Vivek Goyal
2009-09-24 21:33 ` Andrew Morton
2009-09-24 21:33   ` Andrew Morton
2009-09-25  1:09   ` KAMEZAWA Hiroyuki
     [not found]     ` <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2009-09-25  1:18       ` KAMEZAWA Hiroyuki
2009-09-25  4:14       ` Vivek Goyal
2009-09-25  1:18     ` KAMEZAWA Hiroyuki
2009-09-25  1:18       ` KAMEZAWA Hiroyuki
2009-09-25  5:29       ` Balbir Singh
2009-09-25  7:09         ` Ryo Tsuruta
2009-09-25  7:09           ` Ryo Tsuruta
     [not found]         ` <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>
2009-09-25  7:09           ` Ryo Tsuruta
     [not found]       ` <20090925101821.1de8091a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2009-09-25  5:29         ` Balbir Singh
2009-09-25  4:14     ` Vivek Goyal
2009-09-25  4:14       ` Vivek Goyal
     [not found]   ` <20090924143315.781cd0ac.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2009-09-25  1:09     ` KAMEZAWA Hiroyuki
2009-09-25  5:04     ` Vivek Goyal
2009-09-25  5:04   ` Vivek Goyal
2009-09-25  5:04     ` Vivek Goyal
     [not found]     ` <20090925050429.GB12555-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-25  9:07       ` Ryo Tsuruta
2009-09-25  9:07     ` Ryo Tsuruta
2009-09-25  9:07       ` Ryo Tsuruta
2009-09-25 14:33       ` Vivek Goyal
2009-09-25 14:33         ` Vivek Goyal
     [not found]         ` <20090925143337.GA15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-28  7:30           ` Ryo Tsuruta
2009-09-28  7:30         ` Ryo Tsuruta
     [not found]       ` <20090925.180724.104041942.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-25 14:33         ` Vivek Goyal
2009-09-25 15:04         ` Rik van Riel
2009-09-25 15:04       ` Rik van Riel
2009-09-25 15:04         ` Rik van Riel
2009-09-28  7:38         ` Ryo Tsuruta
2009-09-28  7:38           ` Ryo Tsuruta
     [not found]         ` <4ABCDBFF.1020203-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-28  7:38           ` Ryo Tsuruta
2009-09-25  2:20 ` Ulrich Lukas
     [not found]   ` <4ABC28DE.7050809-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org>
2009-09-25 20:26     ` Vivek Goyal
2009-09-25 20:26   ` Vivek Goyal
2009-09-25 20:26     ` Vivek Goyal
2009-09-26 14:51     ` Mike Galbraith
     [not found]       ` <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-09-27  6:55         ` Mike Galbraith
2009-09-27  6:55       ` Mike Galbraith
     [not found]         ` <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-09-27 16:42           ` Jens Axboe
2009-09-27 16:42         ` Jens Axboe
     [not found]           ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-09-27 18:15             ` Mike Galbraith
2009-09-30 19:58             ` Mike Galbraith
2009-09-27 18:15           ` Mike Galbraith
2009-09-28  4:04             ` Mike Galbraith
2009-09-28  5:55               ` Mike Galbraith
2009-09-28 17:48               ` Vivek Goyal
2009-09-28 17:48                 ` Vivek Goyal
2009-09-28 18:24                 ` Mike Galbraith
     [not found]                 ` <20090928174809.GB3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-28 18:24                   ` Mike Galbraith
     [not found]               ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-09-28  5:55                 ` Mike Galbraith
2009-09-28 17:48                 ` Vivek Goyal
     [not found]             ` <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-09-28  4:04               ` Mike Galbraith
2009-09-30 19:58           ` Mike Galbraith
     [not found]             ` <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-09-30 20:05               ` Mike Galbraith
2009-09-30 20:05             ` Mike Galbraith
2009-09-30 20:24               ` Vivek Goyal
2009-09-30 20:24                 ` Vivek Goyal
     [not found]                 ` <20090930202447.GA28236-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-01  7:33                   ` Mike Galbraith
2009-10-01  7:33                 ` Mike Galbraith
     [not found]                   ` <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-01 18:58                     ` Jens Axboe
2009-10-01 18:58                       ` Jens Axboe
2009-10-02  6:23                       ` Mike Galbraith
     [not found]                         ` <1254464628.7158.101.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02  8:04                           ` Jens Axboe
2009-10-02  8:04                         ` Jens Axboe
2009-10-02  8:04                           ` Jens Axboe
2009-10-02  8:53                           ` Mike Galbraith
     [not found]                             ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02  9:00                               ` Mike Galbraith
2009-10-02  9:55                               ` Jens Axboe
2009-10-02  9:00                             ` Mike Galbraith
2009-10-02  9:55                             ` Jens Axboe
2009-10-02 12:22                               ` Mike Galbraith
     [not found]                               ` <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 12:22                                 ` Mike Galbraith
     [not found]                           ` <20091002080417.GG14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02  8:53                             ` Mike Galbraith
2009-10-02  9:24                             ` Ingo Molnar
2009-10-02  9:24                           ` Ingo Molnar
2009-10-02  9:24                             ` Ingo Molnar
2009-10-02  9:28                             ` Jens Axboe
2009-10-02  9:28                               ` Jens Axboe
     [not found]                               ` <20091002092839.GA26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 14:24                                 ` Linus Torvalds
2009-10-02 14:24                                   ` Linus Torvalds
2009-10-02 14:45                                   ` Mike Galbraith
2009-10-02 14:57                                     ` Jens Axboe
     [not found]                                     ` <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02 14:57                                       ` Jens Axboe
2009-10-02 14:56                                   ` Jens Axboe
2009-10-02 14:56                                     ` Jens Axboe
2009-10-02 15:14                                     ` Linus Torvalds
2009-10-02 15:14                                       ` Linus Torvalds
2009-10-02 16:01                                       ` jim owens
2009-10-02 16:01                                         ` jim owens
2009-10-02 17:11                                       ` Jens Axboe
2009-10-02 17:11                                         ` Jens Axboe
2009-10-02 17:20                                         ` Ingo Molnar
2009-10-02 17:20                                           ` Ingo Molnar
2009-10-02 17:25                                           ` Jens Axboe
2009-10-02 17:25                                             ` Jens Axboe
2009-10-02 17:28                                             ` Ingo Molnar
2009-10-02 17:28                                               ` Ingo Molnar
     [not found]                                               ` <20091002172842.GA4884-X9Un+BFzKDI@public.gmane.org>
2009-10-02 17:37                                                 ` Jens Axboe
2009-10-02 17:37                                               ` Jens Axboe
     [not found]                                                 ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 17:56                                                   ` Ingo Molnar
2009-10-02 18:13                                                   ` Mike Galbraith
2009-10-02 17:56                                                 ` Ingo Molnar
2009-10-02 17:56                                                   ` Ingo Molnar
     [not found]                                                   ` <20091002175629.GA14860-X9Un+BFzKDI@public.gmane.org>
2009-10-02 18:04                                                     ` Jens Axboe
2009-10-02 18:04                                                   ` Jens Axboe
     [not found]                                                     ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 18:22                                                       ` Mike Galbraith
2009-10-02 18:36                                                       ` Theodore Tso
2009-10-02 18:22                                                     ` Mike Galbraith
2009-10-02 18:26                                                       ` Jens Axboe
2009-10-02 18:33                                                         ` Mike Galbraith
     [not found]                                                         ` <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 18:33                                                           ` Mike Galbraith
     [not found]                                                       ` <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02 18:26                                                         ` Jens Axboe
2009-10-02 18:36                                                     ` Theodore Tso
2009-10-02 18:45                                                       ` Jens Axboe
2009-10-02 18:45                                                         ` Jens Axboe
2009-10-02 19:01                                                         ` Ingo Molnar
2009-10-02 19:09                                                           ` Jens Axboe
2009-10-02 19:09                                                             ` Jens Axboe
     [not found]                                                           ` <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org>
2009-10-02 19:09                                                             ` Jens Axboe
     [not found]                                                         ` <20091002184549.GS31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 19:01                                                           ` Ingo Molnar
     [not found]                                                       ` <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org>
2009-10-02 18:45                                                         ` Jens Axboe
2009-10-02 18:13                                                 ` Mike Galbraith
2009-10-02 18:19                                                   ` Jens Axboe
2009-10-02 18:57                                                     ` Mike Galbraith
2009-10-02 20:47                                                       ` Mike Galbraith
     [not found]                                                       ` <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02 20:47                                                         ` Mike Galbraith
2009-10-03  5:48                                                     ` Mike Galbraith
2009-10-03  5:56                                                       ` Mike Galbraith
2009-10-03  7:24                                                         ` Jens Axboe
2009-10-03  9:00                                                           ` Mike Galbraith
2009-10-03  9:12                                                             ` Corrado Zoccolo
2009-10-03  9:12                                                               ` Corrado Zoccolo
     [not found]                                                               ` <4e5e476b0910030212y50f97d97nc2e17c35d855cc63-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-03 13:18                                                                 ` Jens Axboe
2009-10-03 13:18                                                               ` Jens Axboe
2009-10-03 13:18                                                                 ` Jens Axboe
     [not found]                                                             ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-03  9:12                                                               ` Corrado Zoccolo
2009-10-03 13:17                                                               ` Jens Axboe
2009-10-03 13:17                                                             ` Jens Axboe
2009-10-03 13:17                                                               ` Jens Axboe
     [not found]                                                           ` <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-03  9:00                                                             ` Mike Galbraith
2009-10-03 11:29                                                         ` Vivek Goyal
2009-10-03 11:29                                                           ` Vivek Goyal
     [not found]                                                         ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-03  7:24                                                           ` Jens Axboe
2009-10-03 11:29                                                           ` Vivek Goyal
     [not found]                                                       ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-03  5:56                                                         ` Mike Galbraith
2009-10-03  7:20                                                         ` Ingo Molnar
2009-10-03  7:20                                                       ` Ingo Molnar
2009-10-03  7:20                                                         ` Ingo Molnar
     [not found]                                                         ` <20091003072021.GB21407-X9Un+BFzKDI@public.gmane.org>
2009-10-03  7:25                                                           ` Jens Axboe
2009-10-03  7:25                                                         ` Jens Axboe
2009-10-03  7:25                                                           ` Jens Axboe
2009-10-03  8:53                                                           ` Mike Galbraith
     [not found]                                                           ` <20091003072540.GW31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-03  8:53                                                             ` Mike Galbraith
2009-10-03  9:01                                                             ` Corrado Zoccolo
2009-10-03  9:01                                                           ` Corrado Zoccolo
     [not found]                                                     ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 18:57                                                       ` Mike Galbraith
2009-10-03  5:48                                                       ` Mike Galbraith
     [not found]                                                   ` <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02 18:19                                                     ` Jens Axboe
     [not found]                                             ` <20091002172554.GJ31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 17:28                                               ` Ingo Molnar
     [not found]                                           ` <20091002172046.GA2376-X9Un+BFzKDI@public.gmane.org>
2009-10-02 17:25                                             ` Jens Axboe
     [not found]                                         ` <20091002171129.GG31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 17:20                                           ` Ingo Molnar
     [not found]                                       ` <alpine.LFD.2.01.0910020811490.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-10-02 16:01                                         ` jim owens
2009-10-02 17:11                                         ` Jens Axboe
2009-10-02 16:33                                     ` Ray Lee
2009-10-02 17:13                                       ` Jens Axboe
2009-10-02 17:13                                         ` Jens Axboe
     [not found]                                       ` <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-02 17:13                                         ` Jens Axboe
     [not found]                                     ` <20091002145610.GD31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 15:14                                       ` Linus Torvalds
2009-10-02 16:33                                       ` Ray Lee
     [not found]                                   ` <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2009-10-02 14:45                                     ` Mike Galbraith
2009-10-02 14:56                                     ` Jens Axboe
2009-10-02 16:22                                     ` Ingo Molnar
2009-10-02 16:22                                   ` Ingo Molnar
2009-10-02 16:22                                     ` Ingo Molnar
     [not found]                             ` <20091002092409.GA19529-X9Un+BFzKDI@public.gmane.org>
2009-10-02  9:28                               ` Jens Axboe
2009-10-02  9:36                               ` Mike Galbraith
2009-10-02  9:36                             ` Mike Galbraith
2009-10-02 16:37                               ` Ingo Molnar
2009-10-02 16:37                                 ` Ingo Molnar
     [not found]                               ` <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02 16:37                                 ` Ingo Molnar
     [not found]                       ` <20091001185816.GU14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02  6:23                         ` Mike Galbraith
2009-10-02 18:08                     ` Jens Axboe
2009-10-02 18:08                   ` Jens Axboe
2009-10-02 18:29                     ` Mike Galbraith
2009-10-02 18:36                       ` Jens Axboe
     [not found]                       ` <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-10-02 18:36                         ` Jens Axboe
     [not found]                     ` <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2009-10-02 18:29                       ` Mike Galbraith
     [not found]               ` <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-09-30 20:24                 ` Vivek Goyal
2009-09-27 17:00     ` Corrado Zoccolo
2009-09-28 14:56       ` Vivek Goyal
2009-09-28 14:56         ` Vivek Goyal
     [not found]         ` <20090928145655.GB8192-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-28 15:35           ` Corrado Zoccolo
2009-09-28 15:35         ` Corrado Zoccolo
2009-09-28 17:14           ` Vivek Goyal
2009-09-28 17:14             ` Vivek Goyal
2009-09-29  7:10             ` Corrado Zoccolo
     [not found]             ` <20090928171420.GA3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-29  7:10               ` Corrado Zoccolo
2009-09-28 17:51           ` Mike Galbraith
2009-09-28 18:18             ` Vivek Goyal
2009-09-28 18:18               ` Vivek Goyal
2009-09-28 18:53               ` Mike Galbraith
2009-09-29  7:14                 ` Corrado Zoccolo
     [not found]                 ` <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-09-29  7:14                   ` Corrado Zoccolo
     [not found]               ` <20090928181846.GC3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-28 18:53                 ` Mike Galbraith
2009-09-29  5:55             ` Mike Galbraith
     [not found]             ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>
2009-09-28 18:18               ` Vivek Goyal
2009-09-29  5:55               ` Mike Galbraith
     [not found]           ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-09-28 17:14             ` Vivek Goyal
2009-09-28 17:51             ` Mike Galbraith
     [not found]       ` <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-09-28 14:56         ` Vivek Goyal
     [not found]     ` <20090925202636.GC15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-26 14:51       ` Mike Galbraith
2009-09-27 17:00       ` Corrado Zoccolo
2009-09-29  0:37 ` Nauman Rafique
2009-09-29  0:37   ` Nauman Rafique
2009-09-29  3:22   ` Vivek Goyal
2009-09-29  3:22     ` Vivek Goyal
     [not found]     ` <20090929032255.GA10664-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-29  9:56       ` Ryo Tsuruta
2009-09-29  9:56     ` Ryo Tsuruta
2009-09-29 10:49       ` Takuya Yoshikawa
2009-09-29 14:10       ` Vivek Goyal
2009-09-29 14:10         ` Vivek Goyal
2009-09-29 19:53         ` Nauman Rafique
     [not found]         ` <20090929141049.GA12141-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-29 19:53           ` Nauman Rafique
2009-09-30  8:43           ` Ryo Tsuruta
2009-09-30  8:43         ` Ryo Tsuruta
2009-09-30 11:05           ` Vivek Goyal
2009-09-30 11:05             ` Vivek Goyal
2009-10-01  6:41             ` Ryo Tsuruta
2009-10-01  6:41               ` Ryo Tsuruta
     [not found]               ` <20091001.154125.104044685.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-10-01 13:31                 ` Vivek Goyal
2009-10-01 13:31               ` Vivek Goyal
2009-10-01 13:31                 ` Vivek Goyal
2009-10-02  2:57                 ` Vivek Goyal
2009-10-02  2:57                   ` Vivek Goyal
2009-10-02 20:27                   ` Munehiro Ikeda
2009-10-02 20:27                     ` Munehiro Ikeda
     [not found]                     ` <4AC6623F.70600-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>
2009-10-05 10:38                       ` Ryo Tsuruta
2009-10-05 10:38                     ` Ryo Tsuruta
2009-10-05 10:38                       ` Ryo Tsuruta
     [not found]                       ` <20091005.193808.104033719.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-10-05 12:31                         ` Vivek Goyal
2009-10-05 12:31                       ` Vivek Goyal
2009-10-05 12:31                         ` Vivek Goyal
2009-10-05 14:55                         ` Ryo Tsuruta
2009-10-05 14:55                           ` Ryo Tsuruta
2009-10-05 17:10                           ` Vivek Goyal
2009-10-05 17:10                             ` Vivek Goyal
2009-10-05 18:11                             ` Nauman Rafique
2009-10-05 18:11                               ` Nauman Rafique
     [not found]                               ` <e98e18940910051111r110dc776l5105bf931761b842-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-06  7:17                                 ` Ryo Tsuruta
2009-10-06  7:17                               ` Ryo Tsuruta
2009-10-06  7:17                                 ` Ryo Tsuruta
2009-10-06 11:22                                 ` Vivek Goyal
2009-10-06 11:22                                   ` Vivek Goyal
2009-10-07 14:38                                   ` Ryo Tsuruta
2009-10-07 14:38                                     ` Ryo Tsuruta
2009-10-07 15:09                                     ` Vivek Goyal
2009-10-07 15:09                                       ` Vivek Goyal
2009-10-08  2:18                                       ` Ryo Tsuruta
2009-10-08  2:18                                         ` Ryo Tsuruta
     [not found]                                       ` <20091007150929.GB3674-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-08  2:18                                         ` Ryo Tsuruta
2009-10-07 16:41                                     ` Rik van Riel
2009-10-07 16:41                                       ` Rik van Riel
     [not found]                                       ` <4ACCC4B7.4050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-08 10:22                                         ` Ryo Tsuruta
2009-10-08 10:22                                       ` Ryo Tsuruta
2009-10-08 10:22                                         ` Ryo Tsuruta
     [not found]                                     ` <20091007.233805.183040347.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-10-07 15:09                                       ` Vivek Goyal
2009-10-07 16:41                                       ` Rik van Riel
     [not found]                                   ` <20091006112201.GA27866-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-07 14:38                                     ` Ryo Tsuruta
     [not found]                                 ` <20091006.161744.189719641.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-10-06 11:22                                   ` Vivek Goyal
     [not found]                             ` <20091005171023.GG22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-05 18:11                               ` Nauman Rafique
     [not found]                           ` <20091005.235535.193690928.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-10-05 17:10                             ` Vivek Goyal
     [not found]                         ` <20091005123148.GB22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-05 14:55                           ` Ryo Tsuruta
     [not found]                   ` <20091002025731.GA2738-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-02 20:27                     ` Munehiro Ikeda
     [not found]                 ` <20091001133109.GA4058-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-02  2:57                   ` Vivek Goyal
     [not found]             ` <20090930110500.GA26631-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-10-01  6:41               ` Ryo Tsuruta
     [not found]           ` <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-30 11:05             ` Vivek Goyal
     [not found]       ` <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>
2009-09-29 10:49         ` Takuya Yoshikawa
2009-09-29 14:10         ` Vivek Goyal
2009-09-30  3:11         ` Vivek Goyal
2009-09-30  3:11       ` Vivek Goyal
2009-09-30  3:11         ` Vivek Goyal
     [not found]   ` <e98e18940909281737q142c788dpd20b8bdc05dd0eff-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-09-29  3:22     ` Vivek Goyal
     [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2009-09-24 21:33   ` Andrew Morton
2009-09-25  2:20   ` Ulrich Lukas
2009-09-29  0:37   ` Nauman Rafique
2009-09-24 19:25 Vivek Goyal

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.