* Re: IO scheduler based IO controller V10 @ 2009-10-02 10:55 Corrado Zoccolo 2009-10-02 11:04 ` Jens Axboe ` (2 more replies) 0 siblings, 3 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-02 10:55 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel Hi Jens, On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: >> >> * Jens Axboe <jens.axboe@oracle.com> wrote: >> > > It's really not that simple, if we go and do easy latency bits, then > throughput drops 30% or more. You can't say it's black and white latency > vs throughput issue, that's just not how the real world works. The > server folks would be most unpleased. Could we be more selective when the latency optimization is introduced? The code that is currently touched by Vivek's patch is: if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || (cfqd->hw_tag && CIC_SEEKY(cic))) enable_idle = 0; basically, when fairness=1, it becomes just: if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle) enable_idle = 0; Note that, even if we enable idling here, the cfq_arm_slice_timer will use a different idle window for seeky (2ms) than for normal I/O. I think that the 2ms idle window is good for a single rotational SATA disk scenario, even if it supports NCQ. Realistic access times for those disks are still around 8ms (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby request may pay off, not only in latency and fairness, but also in throughput. What we don't want to do is to enable idling for NCQ enabled SSDs (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs. If we agree that hardware RAIDs should be marked as non-rotational, then that code could become: if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic))) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle; if (cic->ttime_mean > idle_time) enable_idle = 0; else enable_idle = 1; } Thanks, Corrado > > -- > Jens Axboe > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo @ 2009-10-02 11:04 ` Jens Axboe [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2009-10-02 12:49 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 11:04 UTC (permalink / raw) To: Corrado Zoccolo Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, Oct 02 2009, Corrado Zoccolo wrote: > Hi Jens, > On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > > On Fri, Oct 02 2009, Ingo Molnar wrote: > >> > >> * Jens Axboe <jens.axboe@oracle.com> wrote: > >> > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. You can't say it's black and white latency > > vs throughput issue, that's just not how the real world works. The > > server folks would be most unpleased. > Could we be more selective when the latency optimization is introduced? > > The code that is currently touched by Vivek's patch is: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > basically, when fairness=1, it becomes just: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle) > enable_idle = 0; > > Note that, even if we enable idling here, the cfq_arm_slice_timer will use > a different idle window for seeky (2ms) than for normal I/O. > > I think that the 2ms idle window is good for a single rotational SATA > disk scenario, even if it supports NCQ. Realistic access times for > those disks are still around 8ms (but it is proportional to seek > lenght), and waiting 2ms to see if we get a nearby request may pay > off, not only in latency and fairness, but also in throughput. I agree, that change looks good. > What we don't want to do is to enable idling for NCQ enabled SSDs > (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs. Right, it was part of the bigger SSD optimization stuff I did a few revisions back. > If we agree that hardware RAIDs should be marked as non-rotational, then that > code could become: > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle; > if (cic->ttime_mean > idle_time) > enable_idle = 0; > else > enable_idle = 1; > } Yes agree on that too. We probably should make a different flag for hardware raids, telling the io scheduler that this device is really composed if several others. If it's composited only by SSD's (or has a frontend similar to that), then non-rotational applies. But yes, we should pass that information down. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> @ 2009-10-02 11:04 ` Jens Axboe 2009-10-02 12:49 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 11:04 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02 2009, Corrado Zoccolo wrote: > Hi Jens, > On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > On Fri, Oct 02 2009, Ingo Molnar wrote: > >> > >> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > >> > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. You can't say it's black and white latency > > vs throughput issue, that's just not how the real world works. The > > server folks would be most unpleased. > Could we be more selective when the latency optimization is introduced? > > The code that is currently touched by Vivek's patch is: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > basically, when fairness=1, it becomes just: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle) > enable_idle = 0; > > Note that, even if we enable idling here, the cfq_arm_slice_timer will use > a different idle window for seeky (2ms) than for normal I/O. > > I think that the 2ms idle window is good for a single rotational SATA > disk scenario, even if it supports NCQ. Realistic access times for > those disks are still around 8ms (but it is proportional to seek > lenght), and waiting 2ms to see if we get a nearby request may pay > off, not only in latency and fairness, but also in throughput. I agree, that change looks good. > What we don't want to do is to enable idling for NCQ enabled SSDs > (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs. Right, it was part of the bigger SSD optimization stuff I did a few revisions back. > If we agree that hardware RAIDs should be marked as non-rotational, then that > code could become: > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle; > if (cic->ttime_mean > idle_time) > enable_idle = 0; > else > enable_idle = 1; > } Yes agree on that too. We probably should make a different flag for hardware raids, telling the io scheduler that this device is really composed if several others. If it's composited only by SSD's (or has a frontend similar to that), then non-rotational applies. But yes, we should pass that information down. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2009-10-02 11:04 ` Jens Axboe @ 2009-10-02 12:49 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 12:49 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > Hi Jens, > On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > On Fri, Oct 02 2009, Ingo Molnar wrote: > >> > >> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > >> > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. You can't say it's black and white latency > > vs throughput issue, that's just not how the real world works. The > > server folks would be most unpleased. > Could we be more selective when the latency optimization is introduced? > > The code that is currently touched by Vivek's patch is: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > basically, when fairness=1, it becomes just: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle) > enable_idle = 0; > Actually I am not touching this code. Looking at the V10, I have not changed anything here in idling code. I think we are seeing latency improvements with fairness=1 because, CFQ does pure roundrobin and once a seeky reader expires, it is put at the end of the queue. I retained the same behavior if fairness=0 but if fairness=1, then I don't put the seeky reader at the end of queue, instead it gets vdisktime based on the disk it has used. So it should get placed ahead of sync readers. I think following is the code snippet in "elevator-fq.c" which is making a difference. /* * We don't want to charge more than allocated slice otherwise * this * queue can miss one dispatch round doubling max latencies. On * the * other hand we don't want to charge less than allocated slice as * we stick to CFQ theme of queue loosing its share if it does not * use the slice and moves to the back of service tree (almost). */ if (!ioq->efqd->fairness) queue_charge = allocated_slice; So if a sync readers consumes 100ms and an seeky reader dispatches only one request, then in CFQ, seeky reader gets to dispatch next request after another 100ms. With fairness=1, it should get a lower vdisktime when it comes with a new request because its last slice usage was less (like CFS sleepers as mike said). But this will make a difference only if there are more than one processes in the system otherwise a vtime jump will take place by the time seeky readers gets backlogged. Anyway, once I started timestamping the queues and started keeping a cache of expired queues, then any queue which got new request almost immediately, should get a lower vdisktime assigned if it did not use the full time slice in the previous dispatch round. Hence with fairness=1, seeky readers kind of get more share of disk (fair share), because these are now placed ahead of streaming readers and hence get better latencies. In short, most likely, better latencies are being experienced because seeky reader is getting lower time stamp (vdisktime), because it did not use its full time slice in previous dispatch round, and not because we kept the idling enabled on seeky reader. Thanks Vivek > Note that, even if we enable idling here, the cfq_arm_slice_timer will use > a different idle window for seeky (2ms) than for normal I/O. > > I think that the 2ms idle window is good for a single rotational SATA disk scenario, > even if it supports NCQ. Realistic access times for those disks are still around 8ms > (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby > request may pay off, not only in latency and fairness, but also in throughput. > > What we don't want to do is to enable idling for NCQ enabled SSDs > (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs. > If we agree that hardware RAIDs should be marked as non-rotational, then that > code could become: > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle; > if (cic->ttime_mean > idle_time) > enable_idle = 0; > else > enable_idle = 1; > } > > Thanks, > Corrado > > > > > -- > > Jens Axboe > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo @ 2009-10-02 12:49 ` Vivek Goyal [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2009-10-02 12:49 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 12:49 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jens Axboe, Ingo Molnar, Mike Galbraith, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > Hi Jens, > On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > > On Fri, Oct 02 2009, Ingo Molnar wrote: > >> > >> * Jens Axboe <jens.axboe@oracle.com> wrote: > >> > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. You can't say it's black and white latency > > vs throughput issue, that's just not how the real world works. The > > server folks would be most unpleased. > Could we be more selective when the latency optimization is introduced? > > The code that is currently touched by Vivek's patch is: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > basically, when fairness=1, it becomes just: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle) > enable_idle = 0; > Actually I am not touching this code. Looking at the V10, I have not changed anything here in idling code. I think we are seeing latency improvements with fairness=1 because, CFQ does pure roundrobin and once a seeky reader expires, it is put at the end of the queue. I retained the same behavior if fairness=0 but if fairness=1, then I don't put the seeky reader at the end of queue, instead it gets vdisktime based on the disk it has used. So it should get placed ahead of sync readers. I think following is the code snippet in "elevator-fq.c" which is making a difference. /* * We don't want to charge more than allocated slice otherwise * this * queue can miss one dispatch round doubling max latencies. On * the * other hand we don't want to charge less than allocated slice as * we stick to CFQ theme of queue loosing its share if it does not * use the slice and moves to the back of service tree (almost). */ if (!ioq->efqd->fairness) queue_charge = allocated_slice; So if a sync readers consumes 100ms and an seeky reader dispatches only one request, then in CFQ, seeky reader gets to dispatch next request after another 100ms. With fairness=1, it should get a lower vdisktime when it comes with a new request because its last slice usage was less (like CFS sleepers as mike said). But this will make a difference only if there are more than one processes in the system otherwise a vtime jump will take place by the time seeky readers gets backlogged. Anyway, once I started timestamping the queues and started keeping a cache of expired queues, then any queue which got new request almost immediately, should get a lower vdisktime assigned if it did not use the full time slice in the previous dispatch round. Hence with fairness=1, seeky readers kind of get more share of disk (fair share), because these are now placed ahead of streaming readers and hence get better latencies. In short, most likely, better latencies are being experienced because seeky reader is getting lower time stamp (vdisktime), because it did not use its full time slice in previous dispatch round, and not because we kept the idling enabled on seeky reader. Thanks Vivek > Note that, even if we enable idling here, the cfq_arm_slice_timer will use > a different idle window for seeky (2ms) than for normal I/O. > > I think that the 2ms idle window is good for a single rotational SATA disk scenario, > even if it supports NCQ. Realistic access times for those disks are still around 8ms > (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby > request may pay off, not only in latency and fairness, but also in throughput. > > What we don't want to do is to enable idling for NCQ enabled SSDs > (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs. > If we agree that hardware RAIDs should be marked as non-rotational, then that > code could become: > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle; > if (cic->ttime_mean > idle_time) > enable_idle = 0; > else > enable_idle = 1; > } > > Thanks, > Corrado > > > > > -- > > Jens Axboe > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 12:49 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 12:49 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > Hi Jens, > On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > > On Fri, Oct 02 2009, Ingo Molnar wrote: > >> > >> * Jens Axboe <jens.axboe@oracle.com> wrote: > >> > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. You can't say it's black and white latency > > vs throughput issue, that's just not how the real world works. The > > server folks would be most unpleased. > Could we be more selective when the latency optimization is introduced? > > The code that is currently touched by Vivek's patch is: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > basically, when fairness=1, it becomes just: > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle) > enable_idle = 0; > Actually I am not touching this code. Looking at the V10, I have not changed anything here in idling code. I think we are seeing latency improvements with fairness=1 because, CFQ does pure roundrobin and once a seeky reader expires, it is put at the end of the queue. I retained the same behavior if fairness=0 but if fairness=1, then I don't put the seeky reader at the end of queue, instead it gets vdisktime based on the disk it has used. So it should get placed ahead of sync readers. I think following is the code snippet in "elevator-fq.c" which is making a difference. /* * We don't want to charge more than allocated slice otherwise * this * queue can miss one dispatch round doubling max latencies. On * the * other hand we don't want to charge less than allocated slice as * we stick to CFQ theme of queue loosing its share if it does not * use the slice and moves to the back of service tree (almost). */ if (!ioq->efqd->fairness) queue_charge = allocated_slice; So if a sync readers consumes 100ms and an seeky reader dispatches only one request, then in CFQ, seeky reader gets to dispatch next request after another 100ms. With fairness=1, it should get a lower vdisktime when it comes with a new request because its last slice usage was less (like CFS sleepers as mike said). But this will make a difference only if there are more than one processes in the system otherwise a vtime jump will take place by the time seeky readers gets backlogged. Anyway, once I started timestamping the queues and started keeping a cache of expired queues, then any queue which got new request almost immediately, should get a lower vdisktime assigned if it did not use the full time slice in the previous dispatch round. Hence with fairness=1, seeky readers kind of get more share of disk (fair share), because these are now placed ahead of streaming readers and hence get better latencies. In short, most likely, better latencies are being experienced because seeky reader is getting lower time stamp (vdisktime), because it did not use its full time slice in previous dispatch round, and not because we kept the idling enabled on seeky reader. Thanks Vivek > Note that, even if we enable idling here, the cfq_arm_slice_timer will use > a different idle window for seeky (2ms) than for normal I/O. > > I think that the 2ms idle window is good for a single rotational SATA disk scenario, > even if it supports NCQ. Realistic access times for those disks are still around 8ms > (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby > request may pay off, not only in latency and fairness, but also in throughput. > > What we don't want to do is to enable idling for NCQ enabled SSDs > (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs. > If we agree that hardware RAIDs should be marked as non-rotational, then that > code could become: > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle; > if (cic->ttime_mean > idle_time) > enable_idle = 0; > else > enable_idle = 1; > } > > Thanks, > Corrado > > > > > -- > > Jens Axboe > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002124921.GA4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 2009-10-02 12:49 ` Vivek Goyal @ 2009-10-02 15:27 ` Corrado Zoccolo -1 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-02 15:27 UTC (permalink / raw) To: Vivek Goyal, Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > Actually I am not touching this code. Looking at the V10, I have not > changed anything here in idling code. I based my analisys on the original patch: http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html Mike, can you confirm which version of the fairness patch did you use in your tests? Corrado > Thanks > Vivek > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 15:27 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-02 15:27 UTC (permalink / raw) To: Vivek Goyal, Mike Galbraith Cc: Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > Actually I am not touching this code. Looking at the V10, I have not > changed anything here in idling code. I based my analisys on the original patch: http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html Mike, can you confirm which version of the fairness patch did you use in your tests? Corrado > Thanks > Vivek > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 15:27 ` Corrado Zoccolo @ 2009-10-02 15:31 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 15:31 UTC (permalink / raw) To: Corrado Zoccolo Cc: Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, Oct 02, 2009 at 05:27:55PM +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > Actually I am not touching this code. Looking at the V10, I have not > > changed anything here in idling code. > > I based my analisys on the original patch: > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > Oh.., you are talking about fairness for seeky process patch. I thought you are talking about current IO controller patches. Actually they both have this notion of "fairness=1" parameter but do different things in patches, hence the confusion. Thanks Vivek > Mike, can you confirm which version of the fairness patch did you use > in your tests? > > Corrado > > > Thanks > > Vivek > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 15:31 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 15:31 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Fri, Oct 02, 2009 at 05:27:55PM +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > Actually I am not touching this code. Looking at the V10, I have not > > changed anything here in idling code. > > I based my analisys on the original patch: > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > Oh.., you are talking about fairness for seeky process patch. I thought you are talking about current IO controller patches. Actually they both have this notion of "fairness=1" parameter but do different things in patches, hence the confusion. Thanks Vivek > Mike, can you confirm which version of the fairness patch did you use > in your tests? > > Corrado > > > Thanks > > Vivek > > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0910020827s23e827b1n847c64e355999d4a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4e5e476b0910020827s23e827b1n847c64e355999d4a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-10-02 15:31 ` Vivek Goyal 2009-10-02 15:32 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 15:31 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02, 2009 at 05:27:55PM +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > Actually I am not touching this code. Looking at the V10, I have not > > changed anything here in idling code. > > I based my analisys on the original patch: > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > Oh.., you are talking about fairness for seeky process patch. I thought you are talking about current IO controller patches. Actually they both have this notion of "fairness=1" parameter but do different things in patches, hence the confusion. Thanks Vivek > Mike, can you confirm which version of the fairness patch did you use > in your tests? > > Corrado > > > Thanks > > Vivek > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <4e5e476b0910020827s23e827b1n847c64e355999d4a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-02 15:31 ` Vivek Goyal @ 2009-10-02 15:32 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 15:32 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > Actually I am not touching this code. Looking at the V10, I have not > > changed anything here in idling code. > > I based my analisys on the original patch: > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > > Mike, can you confirm which version of the fairness patch did you use > in your tests? That would be this one-liner. o CFQ provides fair access to disk in terms of disk time used to processes. Fairness is provided for the applications which have their think time with in slice_idle (8ms default) limit. o CFQ currently disables idling for seeky processes. So even if a process has think time with-in slice_idle limits, it will still not get fair share of disk. Disabling idling for a seeky process seems good from throughput perspective but not necessarily from fairness perspecitve. 0 Do not disable idling based on seek pattern of process if a user has set /sys/block/<disk>/queue/iosched/fairness = 1. Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- block/cfq-iosched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data * enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (cfqd->hw_tag && CIC_SEEKY(cic))) + (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic))) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 15:27 ` Corrado Zoccolo @ 2009-10-02 15:32 ` Mike Galbraith -1 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 15:32 UTC (permalink / raw) To: Corrado Zoccolo Cc: Vivek Goyal, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > Actually I am not touching this code. Looking at the V10, I have not > > changed anything here in idling code. > > I based my analisys on the original patch: > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > > Mike, can you confirm which version of the fairness patch did you use > in your tests? That would be this one-liner. o CFQ provides fair access to disk in terms of disk time used to processes. Fairness is provided for the applications which have their think time with in slice_idle (8ms default) limit. o CFQ currently disables idling for seeky processes. So even if a process has think time with-in slice_idle limits, it will still not get fair share of disk. Disabling idling for a seeky process seems good from throughput perspective but not necessarily from fairness perspecitve. 0 Do not disable idling based on seek pattern of process if a user has set /sys/block/<disk>/queue/iosched/fairness = 1. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- block/cfq-iosched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data * enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (cfqd->hw_tag && CIC_SEEKY(cic))) + (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic))) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 15:32 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 15:32 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > Actually I am not touching this code. Looking at the V10, I have not > > changed anything here in idling code. > > I based my analisys on the original patch: > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > > Mike, can you confirm which version of the fairness patch did you use > in your tests? That would be this one-liner. o CFQ provides fair access to disk in terms of disk time used to processes. Fairness is provided for the applications which have their think time with in slice_idle (8ms default) limit. o CFQ currently disables idling for seeky processes. So even if a process has think time with-in slice_idle limits, it will still not get fair share of disk. Disabling idling for a seeky process seems good from throughput perspective but not necessarily from fairness perspecitve. 0 Do not disable idling based on seek pattern of process if a user has set /sys/block/<disk>/queue/iosched/fairness = 1. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> --- block/cfq-iosched.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data * enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (cfqd->hw_tag && CIC_SEEKY(cic))) + (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic))) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 15:32 ` Mike Galbraith @ 2009-10-02 15:40 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 15:40 UTC (permalink / raw) To: Mike Galbraith Cc: Corrado Zoccolo, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote: > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote: > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > > > Actually I am not touching this code. Looking at the V10, I have not > > > changed anything here in idling code. > > > > I based my analisys on the original patch: > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > > > > Mike, can you confirm which version of the fairness patch did you use > > in your tests? > > That would be this one-liner. > Ok. Thanks. Sorry, I got confused and thought that you are using "io controller patches" with fairness=1. In that case, Corrado's suggestion of refining it further and disabling idling for seeky process only on non-rotational media (SSD and hardware RAID), makes sense to me. Thanks Vivek > o CFQ provides fair access to disk in terms of disk time used to processes. > Fairness is provided for the applications which have their think time with > in slice_idle (8ms default) limit. > > o CFQ currently disables idling for seeky processes. So even if a process > has think time with-in slice_idle limits, it will still not get fair share > of disk. Disabling idling for a seeky process seems good from throughput > perspective but not necessarily from fairness perspecitve. > > 0 Do not disable idling based on seek pattern of process if a user has set > /sys/block/<disk>/queue/iosched/fairness = 1. > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > block/cfq-iosched.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data * > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (cfqd->hw_tag && CIC_SEEKY(cic))) > + (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > if (cic->ttime_mean > cfqd->cfq_slice_idle) > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 15:40 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 15:40 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, Corrado Zoccolo, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote: > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote: > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > > > Actually I am not touching this code. Looking at the V10, I have not > > > changed anything here in idling code. > > > > I based my analisys on the original patch: > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > > > > Mike, can you confirm which version of the fairness patch did you use > > in your tests? > > That would be this one-liner. > Ok. Thanks. Sorry, I got confused and thought that you are using "io controller patches" with fairness=1. In that case, Corrado's suggestion of refining it further and disabling idling for seeky process only on non-rotational media (SSD and hardware RAID), makes sense to me. Thanks Vivek > o CFQ provides fair access to disk in terms of disk time used to processes. > Fairness is provided for the applications which have their think time with > in slice_idle (8ms default) limit. > > o CFQ currently disables idling for seeky processes. So even if a process > has think time with-in slice_idle limits, it will still not get fair share > of disk. Disabling idling for a seeky process seems good from throughput > perspective but not necessarily from fairness perspecitve. > > 0 Do not disable idling based on seek pattern of process if a user has set > /sys/block/<disk>/queue/iosched/fairness = 1. > > Signed-off-by: Vivek Goyal <vgoyal@redhat.com> > --- > block/cfq-iosched.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data * > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (cfqd->hw_tag && CIC_SEEKY(cic))) > + (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > if (cic->ttime_mean > cfqd->cfq_slice_idle) > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002154020.GC4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002154020.GC4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-02 16:03 ` Mike Galbraith 2009-10-02 16:50 ` Valdis.Kletnieks-PjAqaU27lzQ 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 16:03 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, 2009-10-02 at 11:40 -0400, Vivek Goyal wrote: > On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote: > > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > > > > > Actually I am not touching this code. Looking at the V10, I have not > > > > changed anything here in idling code. > > > > > > I based my analisys on the original patch: > > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > > > > > > Mike, can you confirm which version of the fairness patch did you use > > > in your tests? > > > > That would be this one-liner. > > > > Ok. Thanks. Sorry, I got confused and thought that you are using "io > controller patches" with fairness=1. > > In that case, Corrado's suggestion of refining it further and disabling idling > for seeky process only on non-rotational media (SSD and hardware RAID), makes > sense to me. One thing that might help with that is to have new tasks start out life meeting the seeky criteria. If there's anything going on, they will be. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091002154020.GC4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-02 16:03 ` Mike Galbraith @ 2009-10-02 16:50 ` Valdis.Kletnieks-PjAqaU27lzQ 1 sibling, 0 replies; 349+ messages in thread From: Valdis.Kletnieks-PjAqaU27lzQ @ 2009-10-02 16:50 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b [-- Attachment #1.1: Type: text/plain, Size: 563 bytes --] On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > In that case, Corrado's suggestion of refining it further and disabling idling > for seeky process only on non-rotational media (SSD and hardware RAID), makes > sense to me. Umm... I got petabytes of hardware RAID across the hall that very definitely *is* rotating. Did you mean "SSD and disk systems with big honking caches that cover up the rotation"? Because "RAID" and "big honking caches" are not *quite* the same thing, and I can just see that corner case coming out to bite somebody on the ass... [-- Attachment #1.2: Type: application/pgp-signature, Size: 227 bytes --] [-- Attachment #2: Type: text/plain, Size: 206 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 15:40 ` Vivek Goyal (?) (?) @ 2009-10-02 16:03 ` Mike Galbraith -1 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 16:03 UTC (permalink / raw) To: Vivek Goyal Cc: Corrado Zoccolo, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, 2009-10-02 at 11:40 -0400, Vivek Goyal wrote: > On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote: > > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > > > > > Actually I am not touching this code. Looking at the V10, I have not > > > > changed anything here in idling code. > > > > > > I based my analisys on the original patch: > > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > > > > > > Mike, can you confirm which version of the fairness patch did you use > > > in your tests? > > > > That would be this one-liner. > > > > Ok. Thanks. Sorry, I got confused and thought that you are using "io > controller patches" with fairness=1. > > In that case, Corrado's suggestion of refining it further and disabling idling > for seeky process only on non-rotational media (SSD and hardware RAID), makes > sense to me. One thing that might help with that is to have new tasks start out life meeting the seeky criteria. If there's anything going on, they will be. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 15:40 ` Vivek Goyal @ 2009-10-02 16:50 ` Valdis.Kletnieks -1 siblings, 0 replies; 349+ messages in thread From: Valdis.Kletnieks @ 2009-10-02 16:50 UTC (permalink / raw) To: Vivek Goyal Cc: Mike Galbraith, Corrado Zoccolo, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel [-- Attachment #1: Type: text/plain, Size: 563 bytes --] On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > In that case, Corrado's suggestion of refining it further and disabling idling > for seeky process only on non-rotational media (SSD and hardware RAID), makes > sense to me. Umm... I got petabytes of hardware RAID across the hall that very definitely *is* rotating. Did you mean "SSD and disk systems with big honking caches that cover up the rotation"? Because "RAID" and "big honking caches" are not *quite* the same thing, and I can just see that corner case coming out to bite somebody on the ass... [-- Attachment #2: Type: application/pgp-signature, Size: 227 bytes --] ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 16:50 ` Valdis.Kletnieks 0 siblings, 0 replies; 349+ messages in thread From: Valdis.Kletnieks @ 2009-10-02 16:50 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval, peterz, Corrado Zoccolo, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds [-- Attachment #1.1: Type: text/plain, Size: 563 bytes --] On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > In that case, Corrado's suggestion of refining it further and disabling idling > for seeky process only on non-rotational media (SSD and hardware RAID), makes > sense to me. Umm... I got petabytes of hardware RAID across the hall that very definitely *is* rotating. Did you mean "SSD and disk systems with big honking caches that cover up the rotation"? Because "RAID" and "big honking caches" are not *quite* the same thing, and I can just see that corner case coming out to bite somebody on the ass... [-- Attachment #1.2: Type: application/pgp-signature, Size: 227 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <12774.1254502217-+bZmOdGhbsPr6rcHtW+onFJE71vCis6O@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <12774.1254502217-+bZmOdGhbsPr6rcHtW+onFJE71vCis6O@public.gmane.org> @ 2009-10-02 19:58 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 19:58 UTC (permalink / raw) To: Valdis.Kletnieks-PjAqaU27lzQ Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks-PjAqaU27lzQ@public.gmane.org wrote: > On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > > > In that case, Corrado's suggestion of refining it further and disabling idling > > for seeky process only on non-rotational media (SSD and hardware RAID), makes > > sense to me. > > Umm... I got petabytes of hardware RAID across the hall that very definitely > *is* rotating. Did you mean "SSD and disk systems with big honking caches > that cover up the rotation"? Because "RAID" and "big honking caches" are > not *quite* the same thing, and I can just see that corner case coming out > to bite somebody on the ass... > I guess both. The systems which have big caches and cover up for rotation, we probably need not idle for seeky process. An in case of big hardware RAID, having multiple rotating disks, instead of idling and keeping rest of the disks free, we probably are better off dispatching requests from next queue (hoping it is going to a different disk altogether). Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 16:50 ` Valdis.Kletnieks @ 2009-10-02 19:58 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 19:58 UTC (permalink / raw) To: Valdis.Kletnieks Cc: Mike Galbraith, Corrado Zoccolo, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote: > On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > > > In that case, Corrado's suggestion of refining it further and disabling idling > > for seeky process only on non-rotational media (SSD and hardware RAID), makes > > sense to me. > > Umm... I got petabytes of hardware RAID across the hall that very definitely > *is* rotating. Did you mean "SSD and disk systems with big honking caches > that cover up the rotation"? Because "RAID" and "big honking caches" are > not *quite* the same thing, and I can just see that corner case coming out > to bite somebody on the ass... > I guess both. The systems which have big caches and cover up for rotation, we probably need not idle for seeky process. An in case of big hardware RAID, having multiple rotating disks, instead of idling and keeping rest of the disks free, we probably are better off dispatching requests from next queue (hoping it is going to a different disk altogether). Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 19:58 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 19:58 UTC (permalink / raw) To: Valdis.Kletnieks Cc: dhaval, peterz, Corrado Zoccolo, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote: > On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > > > In that case, Corrado's suggestion of refining it further and disabling idling > > for seeky process only on non-rotational media (SSD and hardware RAID), makes > > sense to me. > > Umm... I got petabytes of hardware RAID across the hall that very definitely > *is* rotating. Did you mean "SSD and disk systems with big honking caches > that cover up the rotation"? Because "RAID" and "big honking caches" are > not *quite* the same thing, and I can just see that corner case coming out > to bite somebody on the ass... > I guess both. The systems which have big caches and cover up for rotation, we probably need not idle for seeky process. An in case of big hardware RAID, having multiple rotating disks, instead of idling and keeping rest of the disks free, we probably are better off dispatching requests from next queue (hoping it is going to a different disk altogether). Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 19:58 ` Vivek Goyal @ 2009-10-02 22:14 ` Corrado Zoccolo -1 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-02 22:14 UTC (permalink / raw) To: Vivek Goyal Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote: >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: >> >> Umm... I got petabytes of hardware RAID across the hall that very definitely >> *is* rotating. Did you mean "SSD and disk systems with big honking caches >> that cover up the rotation"? Because "RAID" and "big honking caches" are >> not *quite* the same thing, and I can just see that corner case coming out >> to bite somebody on the ass... >> > > I guess both. The systems which have big caches and cover up for rotation, > we probably need not idle for seeky process. An in case of big hardware > RAID, having multiple rotating disks, instead of idling and keeping rest > of the disks free, we probably are better off dispatching requests from > next queue (hoping it is going to a different disk altogether). In fact I think that the 'rotating' flag name is misleading. All the checks we are doing are actually checking if the device truly supports multiple parallel operations, and this feature is shared by hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single NCQ-enabled SATA disk. If we really wanted a "seek is cheap" flag, we could measure seek time in the io-scheduler itself, but in the current code base we don't have it used in this meaning anywhere. Thanks, Corrado > > Thanks > Vivek > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 22:14 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-02 22:14 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote: >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: >> >> Umm... I got petabytes of hardware RAID across the hall that very definitely >> *is* rotating. Did you mean "SSD and disk systems with big honking caches >> that cover up the rotation"? Because "RAID" and "big honking caches" are >> not *quite* the same thing, and I can just see that corner case coming out >> to bite somebody on the ass... >> > > I guess both. The systems which have big caches and cover up for rotation, > we probably need not idle for seeky process. An in case of big hardware > RAID, having multiple rotating disks, instead of idling and keeping rest > of the disks free, we probably are better off dispatching requests from > next queue (hoping it is going to a different disk altogether). In fact I think that the 'rotating' flag name is misleading. All the checks we are doing are actually checking if the device truly supports multiple parallel operations, and this feature is shared by hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single NCQ-enabled SATA disk. If we really wanted a "seek is cheap" flag, we could measure seek time in the io-scheduler itself, but in the current code base we don't have it used in this meaning anywhere. Thanks, Corrado > > Thanks > Vivek > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0910021514i1b461229t667bed94fd67f140-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4e5e476b0910021514i1b461229t667bed94fd67f140-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-10-02 22:27 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 22:27 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks-PjAqaU27lzQ@public.gmane.org wrote: > >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > >> > >> Umm... I got petabytes of hardware RAID across the hall that very definitely > >> *is* rotating. Did you mean "SSD and disk systems with big honking caches > >> that cover up the rotation"? Because "RAID" and "big honking caches" are > >> not *quite* the same thing, and I can just see that corner case coming out > >> to bite somebody on the ass... > >> > > > > I guess both. The systems which have big caches and cover up for rotation, > > we probably need not idle for seeky process. An in case of big hardware > > RAID, having multiple rotating disks, instead of idling and keeping rest > > of the disks free, we probably are better off dispatching requests from > > next queue (hoping it is going to a different disk altogether). > > In fact I think that the 'rotating' flag name is misleading. > All the checks we are doing are actually checking if the device truly > supports multiple parallel operations, and this feature is shared by > hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single > NCQ-enabled SATA disk. > While we are at it, what happens to notion of priority of tasks on SSDs? Without idling there is not continuous time slice and there is no fairness. So ioprio is out of the window for SSDs? On SSDs, will it make more sense to provide fairness in terms of number or IO or size of IO and not in terms of time slices. Thanks Vivek > If we really wanted a "seek is cheap" flag, we could measure seek time > in the io-scheduler itself, but in the current code base we don't have > it used in this meaning anywhere. > > Thanks, > Corrado > > > > > Thanks > > Vivek > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- > The self-confidence of a warrior is not the self-confidence of the average > man. The average man seeks certainty in the eyes of the onlooker and calls > that self-confidence. The warrior seeks impeccability in his own eyes and > calls that humbleness. > Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 22:14 ` Corrado Zoccolo @ 2009-10-02 22:27 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 22:27 UTC (permalink / raw) To: Corrado Zoccolo Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote: > >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > >> > >> Umm... I got petabytes of hardware RAID across the hall that very definitely > >> *is* rotating. Did you mean "SSD and disk systems with big honking caches > >> that cover up the rotation"? Because "RAID" and "big honking caches" are > >> not *quite* the same thing, and I can just see that corner case coming out > >> to bite somebody on the ass... > >> > > > > I guess both. The systems which have big caches and cover up for rotation, > > we probably need not idle for seeky process. An in case of big hardware > > RAID, having multiple rotating disks, instead of idling and keeping rest > > of the disks free, we probably are better off dispatching requests from > > next queue (hoping it is going to a different disk altogether). > > In fact I think that the 'rotating' flag name is misleading. > All the checks we are doing are actually checking if the device truly > supports multiple parallel operations, and this feature is shared by > hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single > NCQ-enabled SATA disk. > While we are at it, what happens to notion of priority of tasks on SSDs? Without idling there is not continuous time slice and there is no fairness. So ioprio is out of the window for SSDs? On SSDs, will it make more sense to provide fairness in terms of number or IO or size of IO and not in terms of time slices. Thanks Vivek > If we really wanted a "seek is cheap" flag, we could measure seek time > in the io-scheduler itself, but in the current code base we don't have > it used in this meaning anywhere. > > Thanks, > Corrado > > > > > Thanks > > Vivek > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- > The self-confidence of a warrior is not the self-confidence of the average > man. The average man seeks certainty in the eyes of the onlooker and calls > that self-confidence. The warrior seeks impeccability in his own eyes and > calls that humbleness. > Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 22:27 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 22:27 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: > On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote: > >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: > >> > >> Umm... I got petabytes of hardware RAID across the hall that very definitely > >> *is* rotating. Did you mean "SSD and disk systems with big honking caches > >> that cover up the rotation"? Because "RAID" and "big honking caches" are > >> not *quite* the same thing, and I can just see that corner case coming out > >> to bite somebody on the ass... > >> > > > > I guess both. The systems which have big caches and cover up for rotation, > > we probably need not idle for seeky process. An in case of big hardware > > RAID, having multiple rotating disks, instead of idling and keeping rest > > of the disks free, we probably are better off dispatching requests from > > next queue (hoping it is going to a different disk altogether). > > In fact I think that the 'rotating' flag name is misleading. > All the checks we are doing are actually checking if the device truly > supports multiple parallel operations, and this feature is shared by > hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single > NCQ-enabled SATA disk. > While we are at it, what happens to notion of priority of tasks on SSDs? Without idling there is not continuous time slice and there is no fairness. So ioprio is out of the window for SSDs? On SSDs, will it make more sense to provide fairness in terms of number or IO or size of IO and not in terms of time slices. Thanks Vivek > If we really wanted a "seek is cheap" flag, we could measure seek time > in the io-scheduler itself, but in the current code base we don't have > it used in this meaning anywhere. > > Thanks, > Corrado > > > > > Thanks > > Vivek > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- > The self-confidence of a warrior is not the self-confidence of the average > man. The average man seeks certainty in the eyes of the onlooker and calls > that self-confidence. The warrior seeks impeccability in his own eyes and > calls that humbleness. > Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 22:27 ` Vivek Goyal (?) @ 2009-10-03 12:43 ` Corrado Zoccolo 2009-10-03 13:38 ` Vivek Goyal [not found] ` <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-03 12:43 UTC (permalink / raw) To: Vivek Goyal Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: >> In fact I think that the 'rotating' flag name is misleading. >> All the checks we are doing are actually checking if the device truly >> supports multiple parallel operations, and this feature is shared by >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single >> NCQ-enabled SATA disk. >> > > While we are at it, what happens to notion of priority of tasks on SSDs? This is not changed by proposed patch w.r.t. current CFQ. > Without idling there is not continuous time slice and there is no > fairness. So ioprio is out of the window for SSDs? I haven't NCQ enabled SSDs here, so I can't test it, but it seems to me that the way in which queues are sorted in the rr tree may still provide some sort of fairness and service differentiation for priorities, in terms of number of IOs. Non-NCQ SSDs, instead, will still have the idle window enabled, so it is not an issue for them. > > On SSDs, will it make more sense to provide fairness in terms of number or > IO or size of IO and not in terms of time slices. Not on all SSDs. There are still ones that have a non-negligible penalty on non-sequential access pattern (hopefully the ones without NCQ, but if we find otherwise, then we will have to benchmark access time in I/O scheduler to select the best policy). For those, time based may still be needed. Thanks, Corrado > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-03 12:43 ` Corrado Zoccolo @ 2009-10-03 13:38 ` Vivek Goyal [not found] ` <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-03 13:38 UTC (permalink / raw) To: Corrado Zoccolo Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote: > On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: > >> In fact I think that the 'rotating' flag name is misleading. > >> All the checks we are doing are actually checking if the device truly > >> supports multiple parallel operations, and this feature is shared by > >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single > >> NCQ-enabled SATA disk. > >> > > > > While we are at it, what happens to notion of priority of tasks on SSDs? > This is not changed by proposed patch w.r.t. current CFQ. This is a general question irrespective of current patch. Want to know what is our statement w.r.t ioprio and what it means for user? When do we support it and when do we not. > > Without idling there is not continuous time slice and there is no > > fairness. So ioprio is out of the window for SSDs? > I haven't NCQ enabled SSDs here, so I can't test it, but it seems to > me that the way in which queues are sorted in the rr tree may still > provide some sort of fairness and service differentiation for > priorities, in terms of number of IOs. I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do not. I guess this happens because sometimes idling is enabled and sometmes not because of dyanamic nature of hw_tag. I ran three fio reads for 10 seconds. First job is prio0, second prio4 and third prio7. (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec Note there is almost no difference between prio 0 and prio 4 job and prio7 job has been penalized heavily (gets less than 10% BW of prio 4 job). > Non-NCQ SSDs, instead, will still have the idle window enabled, so it > is not an issue for them. Agree. > > > > On SSDs, will it make more sense to provide fairness in terms of number or > > IO or size of IO and not in terms of time slices. > Not on all SSDs. There are still ones that have a non-negligible > penalty on non-sequential access pattern (hopefully the ones without > NCQ, but if we find otherwise, then we will have to benchmark access > time in I/O scheduler to select the best policy). For those, time > based may still be needed. Ok. So on better SSDs out there with NCQ, we probably don't support the notion of ioprio? Or, I am missing something. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) @ 2009-10-03 13:38 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-03 13:38 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote: > On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: > >> In fact I think that the 'rotating' flag name is misleading. > >> All the checks we are doing are actually checking if the device truly > >> supports multiple parallel operations, and this feature is shared by > >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single > >> NCQ-enabled SATA disk. > >> > > > > While we are at it, what happens to notion of priority of tasks on SSDs? > This is not changed by proposed patch w.r.t. current CFQ. This is a general question irrespective of current patch. Want to know what is our statement w.r.t ioprio and what it means for user? When do we support it and when do we not. > > Without idling there is not continuous time slice and there is no > > fairness. So ioprio is out of the window for SSDs? > I haven't NCQ enabled SSDs here, so I can't test it, but it seems to > me that the way in which queues are sorted in the rr tree may still > provide some sort of fairness and service differentiation for > priorities, in terms of number of IOs. I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do not. I guess this happens because sometimes idling is enabled and sometmes not because of dyanamic nature of hw_tag. I ran three fio reads for 10 seconds. First job is prio0, second prio4 and third prio7. (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec Note there is almost no difference between prio 0 and prio 4 job and prio7 job has been penalized heavily (gets less than 10% BW of prio 4 job). > Non-NCQ SSDs, instead, will still have the idle window enabled, so it > is not an issue for them. Agree. > > > > On SSDs, will it make more sense to provide fairness in terms of number or > > IO or size of IO and not in terms of time slices. > Not on all SSDs. There are still ones that have a non-negligible > penalty on non-sequential access pattern (hopefully the ones without > NCQ, but if we find otherwise, then we will have to benchmark access > time in I/O scheduler to select the best policy). For those, time > based may still be needed. Ok. So on better SSDs out there with NCQ, we probably don't support the notion of ioprio? Or, I am missing something. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-03 13:38 ` Vivek Goyal (?) @ 2009-10-04 9:15 ` Corrado Zoccolo 2009-10-04 12:11 ` Vivek Goyal -1 siblings, 1 reply; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-04 9:15 UTC (permalink / raw) To: Vivek Goyal Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel Hi Vivek, On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote: >> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: >> >> In fact I think that the 'rotating' flag name is misleading. >> >> All the checks we are doing are actually checking if the device truly >> >> supports multiple parallel operations, and this feature is shared by >> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single >> >> NCQ-enabled SATA disk. >> >> >> > >> > While we are at it, what happens to notion of priority of tasks on SSDs? >> This is not changed by proposed patch w.r.t. current CFQ. > > This is a general question irrespective of current patch. Want to know > what is our statement w.r.t ioprio and what it means for user? When do > we support it and when do we not. > >> > Without idling there is not continuous time slice and there is no >> > fairness. So ioprio is out of the window for SSDs? >> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to >> me that the way in which queues are sorted in the rr tree may still >> provide some sort of fairness and service differentiation for >> priorities, in terms of number of IOs. > > I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do > not. I guess this happens because sometimes idling is enabled and sometmes > not because of dyanamic nature of hw_tag. > My guess is that the formula that is used to handle this case is not very stable. The culprit code is (in cfq_service_tree_add): } else if (!add_front) { rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq->slice_resid; cfqq->slice_resid = 0; } else cfq_slice_offset is defined as: static unsigned long cfq_slice_offset(struct cfq_data *cfqd, struct cfq_queue *cfqq) { /* * just an approximation, should be ok. */ return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) - cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); } Can you try changing the latter to a simpler (we already observed that busy_queues is unstable, and I think that here it is not needed at all): return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); and remove the 'rb_key += cfqq->slice_resid; ' from the former. This should give a higher probability of being first on the queue to larger slice tasks, so it will work if we don't idle, but it needs some adjustment if we idle. > I ran three fio reads for 10 seconds. First job is prio0, second prio4 and > third prio7. > > (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec > (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec > (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec > > Note there is almost no difference between prio 0 and prio 4 job and prio7 > job has been penalized heavily (gets less than 10% BW of prio 4 job). > >> Non-NCQ SSDs, instead, will still have the idle window enabled, so it >> is not an issue for them. > > Agree. > >> > >> > On SSDs, will it make more sense to provide fairness in terms of number or >> > IO or size of IO and not in terms of time slices. >> Not on all SSDs. There are still ones that have a non-negligible >> penalty on non-sequential access pattern (hopefully the ones without >> NCQ, but if we find otherwise, then we will have to benchmark access >> time in I/O scheduler to select the best policy). For those, time >> based may still be needed. > > Ok. > > So on better SSDs out there with NCQ, we probably don't support the notion of > ioprio? Or, I am missing something. I think we try, but the current formula is simply not good enough. Thanks, Corrado > > Thanks > Vivek > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-04 9:15 ` Corrado Zoccolo @ 2009-10-04 12:11 ` Vivek Goyal 2009-10-04 12:46 ` Corrado Zoccolo 0 siblings, 1 reply; 349+ messages in thread From: Vivek Goyal @ 2009-10-04 12:11 UTC (permalink / raw) To: Corrado Zoccolo Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote: > Hi Vivek, > On Sat, Oct 3, 2009 at 3:38 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote: > >> On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > >> > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: > >> >> In fact I think that the 'rotating' flag name is misleading. > >> >> All the checks we are doing are actually checking if the device truly > >> >> supports multiple parallel operations, and this feature is shared by > >> >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single > >> >> NCQ-enabled SATA disk. > >> >> > >> > > >> > While we are at it, what happens to notion of priority of tasks on SSDs? > >> This is not changed by proposed patch w.r.t. current CFQ. > > > > This is a general question irrespective of current patch. Want to know > > what is our statement w.r.t ioprio and what it means for user? When do > > we support it and when do we not. > > > >> > Without idling there is not continuous time slice and there is no > >> > fairness. So ioprio is out of the window for SSDs? > >> I haven't NCQ enabled SSDs here, so I can't test it, but it seems to > >> me that the way in which queues are sorted in the rr tree may still > >> provide some sort of fairness and service differentiation for > >> priorities, in terms of number of IOs. > > > > I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do > > not. I guess this happens because sometimes idling is enabled and sometmes > > not because of dyanamic nature of hw_tag. > > > My guess is that the formula that is used to handle this case is not > very stable. In general I agree that formula to calculate the slice offset is very puzzling as busy_queues varies and that changes the position of the task sometimes. > The culprit code is (in cfq_service_tree_add): > } else if (!add_front) { > rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; > rb_key += cfqq->slice_resid; > cfqq->slice_resid = 0; > } else > > cfq_slice_offset is defined as: > > static unsigned long cfq_slice_offset(struct cfq_data *cfqd, > struct cfq_queue *cfqq) > { > /* > * just an approximation, should be ok. > */ > return (cfqd->busy_queues - 1) * (cfq_prio_slice(cfqd, 1, 0) - > cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); > } > > Can you try changing the latter to a simpler (we already observed that > busy_queues is unstable, and I think that here it is not needed at > all): > return -cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio)); > and remove the 'rb_key += cfqq->slice_resid; ' from the former. > > This should give a higher probability of being first on the queue to > larger slice tasks, so it will work if we don't idle, but it needs > some adjustment if we idle. I am not sure what's the intent here by removing busy_queues stuff. I have got two questions though. - Why don't we keep it simple round robin where a task is simply placed at the end of service tree. - Secondly, CFQ provides full slice length to queues only which are idling (in case of sequenatial reader). If we do not enable idling, as in case of NCQ enabled SSDs, then CFQ will expire the queue almost immediately and put the queue at the end of service tree (almost). So if we don't enable idling, at max we can provide fairness, we esseitially just let every queue dispatch one request and put at the end of the end of service tree. Hence no fairness.... Thanks Vivek > > > I ran three fio reads for 10 seconds. First job is prio0, second prio4 and > > third prio7. > > > > (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec > > (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec > > (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec > > > > Note there is almost no difference between prio 0 and prio 4 job and prio7 > > job has been penalized heavily (gets less than 10% BW of prio 4 job). > > > >> Non-NCQ SSDs, instead, will still have the idle window enabled, so it > >> is not an issue for them. > > > > Agree. > > > >> > > >> > On SSDs, will it make more sense to provide fairness in terms of number or > >> > IO or size of IO and not in terms of time slices. > >> Not on all SSDs. There are still ones that have a non-negligible > >> penalty on non-sequential access pattern (hopefully the ones without > >> NCQ, but if we find otherwise, then we will have to benchmark access > >> time in I/O scheduler to select the best policy). For those, time > >> based may still be needed. > > > > Ok. > > > > So on better SSDs out there with NCQ, we probably don't support the notion of > > ioprio? Or, I am missing something. > > I think we try, but the current formula is simply not good enough. > > Thanks, > Corrado > > > > > Thanks > > Vivek > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- > The self-confidence of a warrior is not the self-confidence of the average > man. The average man seeks certainty in the eyes of the onlooker and calls > that self-confidence. The warrior seeks impeccability in his own eyes and > calls that humbleness. > Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-04 12:11 ` Vivek Goyal @ 2009-10-04 12:46 ` Corrado Zoccolo 2009-10-04 16:20 ` Fabio Checconi ` (2 more replies) 0 siblings, 3 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-04 12:46 UTC (permalink / raw) To: Vivek Goyal Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel Hi Vivek, On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote: >> Hi Vivek, >> My guess is that the formula that is used to handle this case is not >> very stable. > > In general I agree that formula to calculate the slice offset is very > puzzling as busy_queues varies and that changes the position of the task > sometimes. > > I am not sure what's the intent here by removing busy_queues stuff. I have > got two questions though. In the ideal case steady state, busy_queues will be a constant. Since we are just comparing the values between themselves, we can just remove this constant completely. Whenever it is not constant, it seems to me that it can cause wrong behaviour, i.e. when the number of processes with ready I/O reduces, a later coming request can jump before older requests. So it seems it does more harm than good, hence I suggest to remove it. Moreover, I suggest removing also the slice_resid part, since its semantics doesn't seem consistent. When computed, it is not the residency, but the remaining time slice. Then it is used to postpone, instead of anticipate, the position of the queue in the RR, that seems counterintuitive (it would be intuitive, though, if it was actually a residency, not a remaining slice, i.e. you already got your full share, so you can wait longer to be serviced again). > > - Why don't we keep it simple round robin where a task is simply placed at > the end of service tree. This should work for the idling case, since we provide service differentiation by means of time slice. For non-idling case, though, the appropriate placement of queues in the tree (as given by my formula) can still provide it. > > - Secondly, CFQ provides full slice length to queues only which are > idling (in case of sequenatial reader). If we do not enable idling, as > in case of NCQ enabled SSDs, then CFQ will expire the queue almost > immediately and put the queue at the end of service tree (almost). > > So if we don't enable idling, at max we can provide fairness, we > esseitially just let every queue dispatch one request and put at the end > of the end of service tree. Hence no fairness.... We should distinguish the two terms fairness and service differentiation. Fairness is when every queue gets the same amount of service share. This is not what we want when priorities are different (we want the service differentiation, instead), but is what we get if we do just round robin without idling. To fix this, we can alter the placement in the tree, so that if we have Q1 with slice S1, and Q2 with slice S2, always ready to perform I/O, we get that Q1 is in front of the three with probability S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2). This is what my formula should achieve. Thanks, Corrado > > Thanks > Vivek > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-04 12:46 ` Corrado Zoccolo @ 2009-10-04 16:20 ` Fabio Checconi [not found] ` <20091004162005.GH4650-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> 2009-10-05 21:21 ` Corrado Zoccolo [not found] ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-06 21:36 ` Vivek Goyal 2 siblings, 2 replies; 349+ messages in thread From: Fabio Checconi @ 2009-10-04 16:20 UTC (permalink / raw) To: Corrado Zoccolo Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel > From: Corrado Zoccolo <czoccolo@gmail.com> > Date: Sun, Oct 04, 2009 02:46:44PM +0200 > > Hi Vivek, > On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote: > >> Hi Vivek, > >> My guess is that the formula that is used to handle this case is not > >> very stable. > > > > In general I agree that formula to calculate the slice offset is very > > puzzling as busy_queues varies and that changes the position of the task > > sometimes. > > > > I am not sure what's the intent here by removing busy_queues stuff. I have > > got two questions though. > > In the ideal case steady state, busy_queues will be a constant. Since > we are just comparing the values between themselves, we can just > remove this constant completely. > > Whenever it is not constant, it seems to me that it can cause wrong > behaviour, i.e. when the number of processes with ready I/O reduces, a > later coming request can jump before older requests. > So it seems it does more harm than good, hence I suggest to remove it. > > Moreover, I suggest removing also the slice_resid part, since its > semantics doesn't seem consistent. > When computed, it is not the residency, but the remaining time slice. > Then it is used to postpone, instead of anticipate, the position of > the queue in the RR, that seems counterintuitive (it would be > intuitive, though, if it was actually a residency, not a remaining > slice, i.e. you already got your full share, so you can wait longer to > be serviced again). > > > > > - Why don't we keep it simple round robin where a task is simply placed at > > the end of service tree. > > This should work for the idling case, since we provide service > differentiation by means of time slice. > For non-idling case, though, the appropriate placement of queues in > the tree (as given by my formula) can still provide it. > > > > > - Secondly, CFQ provides full slice length to queues only which are > > idling (in case of sequenatial reader). If we do not enable idling, as > > in case of NCQ enabled SSDs, then CFQ will expire the queue almost > > immediately and put the queue at the end of service tree (almost). > > > > So if we don't enable idling, at max we can provide fairness, we > > esseitially just let every queue dispatch one request and put at the end > > of the end of service tree. Hence no fairness.... > > We should distinguish the two terms fairness and service > differentiation. Fairness is when every queue gets the same amount of > service share. This is not what we want when priorities are different > (we want the service differentiation, instead), but is what we get if > we do just round robin without idling. > > To fix this, we can alter the placement in the tree, so that if we > have Q1 with slice S1, and Q2 with slice S2, always ready to perform > I/O, we get that Q1 is in front of the three with probability > S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2). > This is what my formula should achieve. > But if the ``always ready to perform I/O'' assumption held then even RR would have provided service differentiation, always seeing backlogged queues and serving them according to their weights. In this case the problem is what Vivek described some time ago as the interlocked service of sync queues, where the scheduler is trying to differentiate between the queues, but they are not always asking for service (as they are synchronous and they are backlogged only for short time intervals). ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091004162005.GH4650-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org>]
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) [not found] ` <20091004162005.GH4650-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> @ 2009-10-05 21:21 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-05 21:21 UTC (permalink / raw) To: Fabio Checconi Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sun, Oct 4, 2009 at 6:20 PM, Fabio Checconi <fchecconi-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote: > But if the ``always ready to perform I/O'' assumption held then even RR > would have provided service differentiation, always seeing backlogged > queues and serving them according to their weights. Right, this property is too strong. But also a weaker "the two queues have think times less than the disk access time" will be enough to achieve the same goal by means of proper placement in the RR tree. If both think times are greater than access time, then each queue will get a service level equivalent to it being the only queue in the system, so in this case service differentiation will not apply (do we need to differentiate when everyone gets exactly what he needs?). If one think time is less, and the other is more than the access time, then we should decide what kind of fairness we want to have, especially if the one with larger think time has also higher priority. > In this case the problem is what Vivek described some time ago as the > interlocked service of sync queues, where the scheduler is trying to > differentiate between the queues, but they are not always asking for > service (as they are synchronous and they are backlogged only for short > time intervals). Corrado ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-04 16:20 ` Fabio Checconi @ 2009-10-05 21:21 ` Corrado Zoccolo 2009-10-05 21:21 ` Corrado Zoccolo 1 sibling, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-05 21:21 UTC (permalink / raw) To: Fabio Checconi Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Sun, Oct 4, 2009 at 6:20 PM, Fabio Checconi <fchecconi@gmail.com> wrote: > But if the ``always ready to perform I/O'' assumption held then even RR > would have provided service differentiation, always seeing backlogged > queues and serving them according to their weights. Right, this property is too strong. But also a weaker "the two queues have think times less than the disk access time" will be enough to achieve the same goal by means of proper placement in the RR tree. If both think times are greater than access time, then each queue will get a service level equivalent to it being the only queue in the system, so in this case service differentiation will not apply (do we need to differentiate when everyone gets exactly what he needs?). If one think time is less, and the other is more than the access time, then we should decide what kind of fairness we want to have, especially if the one with larger think time has also higher priority. > In this case the problem is what Vivek described some time ago as the > interlocked service of sync queues, where the scheduler is trying to > differentiate between the queues, but they are not always asking for > service (as they are synchronous and they are backlogged only for short > time intervals). Corrado ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) @ 2009-10-05 21:21 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-05 21:21 UTC (permalink / raw) To: Fabio Checconi Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Sun, Oct 4, 2009 at 6:20 PM, Fabio Checconi <fchecconi@gmail.com> wrote: > But if the ``always ready to perform I/O'' assumption held then even RR > would have provided service differentiation, always seeing backlogged > queues and serving them according to their weights. Right, this property is too strong. But also a weaker "the two queues have think times less than the disk access time" will be enough to achieve the same goal by means of proper placement in the RR tree. If both think times are greater than access time, then each queue will get a service level equivalent to it being the only queue in the system, so in this case service differentiation will not apply (do we need to differentiate when everyone gets exactly what he needs?). If one think time is less, and the other is more than the access time, then we should decide what kind of fairness we want to have, especially if the one with larger think time has also higher priority. > In this case the problem is what Vivek described some time ago as the > interlocked service of sync queues, where the scheduler is trying to > differentiate between the queues, but they are not always asking for > service (as they are synchronous and they are backlogged only for short > time intervals). Corrado ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-04 12:46 ` Corrado Zoccolo @ 2009-10-05 15:06 ` Jeff Moyer [not found] ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-06 21:36 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: Jeff Moyer @ 2009-10-05 15:06 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Corrado Zoccolo <czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > Moreover, I suggest removing also the slice_resid part, since its > semantics doesn't seem consistent. > When computed, it is not the residency, but the remaining time slice. It stands for residual, not residency. Make more sense? Cheers, Jeff ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) @ 2009-10-05 15:06 ` Jeff Moyer 0 siblings, 0 replies; 349+ messages in thread From: Jeff Moyer @ 2009-10-05 15:06 UTC (permalink / raw) To: Corrado Zoccolo Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel Corrado Zoccolo <czoccolo@gmail.com> writes: > Moreover, I suggest removing also the slice_resid part, since its > semantics doesn't seem consistent. > When computed, it is not the residency, but the remaining time slice. It stands for residual, not residency. Make more sense? Cheers, Jeff ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-05 15:06 ` Jeff Moyer @ 2009-10-05 21:09 ` Corrado Zoccolo -1 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-05 21:09 UTC (permalink / raw) To: Jeff Moyer Cc: Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: > Corrado Zoccolo <czoccolo@gmail.com> writes: > >> Moreover, I suggest removing also the slice_resid part, since its >> semantics doesn't seem consistent. >> When computed, it is not the residency, but the remaining time slice. > > It stands for residual, not residency. Make more sense? It makes sense when computed, but not when used in rb_key computation. Why should we postpone queues that where preempted, instead of giving them a boost? Thanks, Corrado > > Cheers, > Jeff > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) @ 2009-10-05 21:09 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-05 21:09 UTC (permalink / raw) To: Jeff Moyer Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: > Corrado Zoccolo <czoccolo@gmail.com> writes: > >> Moreover, I suggest removing also the slice_resid part, since its >> semantics doesn't seem consistent. >> When computed, it is not the residency, but the remaining time slice. > > It stands for residual, not residency. Make more sense? It makes sense when computed, but not when used in rb_key computation. Why should we postpone queues that where preempted, instead of giving them a boost? Thanks, Corrado > > Cheers, > Jeff > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0910051409x33f8365flf32e8e7548d72e79-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) [not found] ` <4e5e476b0910051409x33f8365flf32e8e7548d72e79-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-10-06 8:41 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-06 8:41 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, Jeff Moyer, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Oct 05 2009, Corrado Zoccolo wrote: > On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > Corrado Zoccolo <czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes: > > > >> Moreover, I suggest removing also the slice_resid part, since its > >> semantics doesn't seem consistent. > >> When computed, it is not the residency, but the remaining time slice. > > > > It stands for residual, not residency. Make more sense? > It makes sense when computed, but not when used in rb_key computation. > Why should we postpone queues that where preempted, instead of giving > them a boost? We should not, if it is/was working correctly, it should allow both for increase/descrease of tree position (hence it's a long and can go negative) to account for both over and under time. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-05 21:09 ` Corrado Zoccolo @ 2009-10-06 8:41 ` Jens Axboe -1 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-06 8:41 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Mon, Oct 05 2009, Corrado Zoccolo wrote: > On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: > > Corrado Zoccolo <czoccolo@gmail.com> writes: > > > >> Moreover, I suggest removing also the slice_resid part, since its > >> semantics doesn't seem consistent. > >> When computed, it is not the residency, but the remaining time slice. > > > > It stands for residual, not residency. Make more sense? > It makes sense when computed, but not when used in rb_key computation. > Why should we postpone queues that where preempted, instead of giving > them a boost? We should not, if it is/was working correctly, it should allow both for increase/descrease of tree position (hence it's a long and can go negative) to account for both over and under time. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) @ 2009-10-06 8:41 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-06 8:41 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, Jeff Moyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Mon, Oct 05 2009, Corrado Zoccolo wrote: > On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: > > Corrado Zoccolo <czoccolo@gmail.com> writes: > > > >> Moreover, I suggest removing also the slice_resid part, since its > >> semantics doesn't seem consistent. > >> When computed, it is not the residency, but the remaining time slice. > > > > It stands for residual, not residency. Make more sense? > It makes sense when computed, but not when used in rb_key computation. > Why should we postpone queues that where preempted, instead of giving > them a boost? We should not, if it is/was working correctly, it should allow both for increase/descrease of tree position (hence it's a long and can go negative) to account for both over and under time. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091006084120.GJ5216-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) [not found] ` <20091006084120.GJ5216-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-06 9:00 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-06 9:00 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, Jeff Moyer, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > On Mon, Oct 05 2009, Corrado Zoccolo wrote: >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: >> > It stands for residual, not residency. Make more sense? >> It makes sense when computed, but not when used in rb_key computation. >> Why should we postpone queues that where preempted, instead of giving >> them a boost? > > We should not, if it is/was working correctly, it should allow both for > increase/descrease of tree position (hence it's a long and can go > negative) to account for both over and under time. I'm doing some tests with and without it. How it is working now is: definition: if (timed_out && !cfq_cfqq_slice_new(cfqq)) { cfqq->slice_resid = cfqq->slice_end - jiffies; cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid); } * here resid is > 0 if there was residual time, and < 0 if the queue overrun its slice. use: rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq->slice_resid; cfqq->slice_resid = 0; * here if residual is > 0, we postpone, i.e. penalize. If residual is < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it. So this is likely not what we want. I did some tests with and without it, or changing the sign, and it doesn't matter at all for pure sync workloads. The only case in which it matters a little, from my experiments, is for sync vs async workload. Here, since async queues are preempted, the current form of the code penalizes them, so they get larger delays, and we get more bandwidth for sync. This is, btw, the only positive outcome (I can think of) from the current form of the code, and I think we could obtain it more easily by unconditionally adding a delay for async queues: rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; if (!cfq_cfqq_sync(cfqq)) { rb_key += CFQ_ASYNC_DELAY; } removing completely the resid stuff (or at least leaving us with the ability of using it with the proper sign). Corrado > > -- > Jens Axboe > > _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-06 8:41 ` Jens Axboe @ 2009-10-06 9:00 ` Corrado Zoccolo -1 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-06 9:00 UTC (permalink / raw) To: Jens Axboe Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > On Mon, Oct 05 2009, Corrado Zoccolo wrote: >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: >> > It stands for residual, not residency. Make more sense? >> It makes sense when computed, but not when used in rb_key computation. >> Why should we postpone queues that where preempted, instead of giving >> them a boost? > > We should not, if it is/was working correctly, it should allow both for > increase/descrease of tree position (hence it's a long and can go > negative) to account for both over and under time. I'm doing some tests with and without it. How it is working now is: definition: if (timed_out && !cfq_cfqq_slice_new(cfqq)) { cfqq->slice_resid = cfqq->slice_end - jiffies; cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid); } * here resid is > 0 if there was residual time, and < 0 if the queue overrun its slice. use: rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq->slice_resid; cfqq->slice_resid = 0; * here if residual is > 0, we postpone, i.e. penalize. If residual is < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it. So this is likely not what we want. I did some tests with and without it, or changing the sign, and it doesn't matter at all for pure sync workloads. The only case in which it matters a little, from my experiments, is for sync vs async workload. Here, since async queues are preempted, the current form of the code penalizes them, so they get larger delays, and we get more bandwidth for sync. This is, btw, the only positive outcome (I can think of) from the current form of the code, and I think we could obtain it more easily by unconditionally adding a delay for async queues: rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; if (!cfq_cfqq_sync(cfqq)) { rb_key += CFQ_ASYNC_DELAY; } removing completely the resid stuff (or at least leaving us with the ability of using it with the proper sign). Corrado > > -- > Jens Axboe > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) @ 2009-10-06 9:00 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-06 9:00 UTC (permalink / raw) To: Jens Axboe Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, Jeff Moyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > On Mon, Oct 05 2009, Corrado Zoccolo wrote: >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: >> > It stands for residual, not residency. Make more sense? >> It makes sense when computed, but not when used in rb_key computation. >> Why should we postpone queues that where preempted, instead of giving >> them a boost? > > We should not, if it is/was working correctly, it should allow both for > increase/descrease of tree position (hence it's a long and can go > negative) to account for both over and under time. I'm doing some tests with and without it. How it is working now is: definition: if (timed_out && !cfq_cfqq_slice_new(cfqq)) { cfqq->slice_resid = cfqq->slice_end - jiffies; cfq_log_cfqq(cfqd, cfqq, "resid=%ld", cfqq->slice_resid); } * here resid is > 0 if there was residual time, and < 0 if the queue overrun its slice. use: rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq->slice_resid; cfqq->slice_resid = 0; * here if residual is > 0, we postpone, i.e. penalize. If residual is < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it. So this is likely not what we want. I did some tests with and without it, or changing the sign, and it doesn't matter at all for pure sync workloads. The only case in which it matters a little, from my experiments, is for sync vs async workload. Here, since async queues are preempted, the current form of the code penalizes them, so they get larger delays, and we get more bandwidth for sync. This is, btw, the only positive outcome (I can think of) from the current form of the code, and I think we could obtain it more easily by unconditionally adding a delay for async queues: rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; if (!cfq_cfqq_sync(cfqq)) { rb_key += CFQ_ASYNC_DELAY; } removing completely the resid stuff (or at least leaving us with the ability of using it with the proper sign). Corrado > > -- > Jens Axboe > > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0910060200i7c028b3fr4c235bf5f18c3aa1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) [not found] ` <4e5e476b0910060200i7c028b3fr4c235bf5f18c3aa1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-10-06 18:53 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-06 18:53 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, Jeff Moyer, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Tue, Oct 06 2009, Corrado Zoccolo wrote: > On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > On Mon, Oct 05 2009, Corrado Zoccolo wrote: > >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >> > It stands for residual, not residency. Make more sense? > >> It makes sense when computed, but not when used in rb_key computation. > >> Why should we postpone queues that where preempted, instead of giving > >> them a boost? > > > > We should not, if it is/was working correctly, it should allow both for > > increase/descrease of tree position (hence it's a long and can go > > negative) to account for both over and under time. > > I'm doing some tests with and without it. > How it is working now is: > definition: > if (timed_out && !cfq_cfqq_slice_new(cfqq)) { > cfqq->slice_resid = cfqq->slice_end - jiffies; > cfq_log_cfqq(cfqd, cfqq, "resid=%ld", > cfqq->slice_resid); > } > * here resid is > 0 if there was residual time, and < 0 if the queue > overrun its slice. > use: > rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; > rb_key += cfqq->slice_resid; > cfqq->slice_resid = 0; > * here if residual is > 0, we postpone, i.e. penalize. If residual is > < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it. > > So this is likely not what we want. Indeed, that should be -= cfqq->slice_resid. > I did some tests with and without it, or changing the sign, and it > doesn't matter at all for pure sync workloads. For most cases it will not change things a lot, but it should be technically correct. > The only case in which it matters a little, from my experiments, is > for sync vs async workload. Here, since async queues are preempted, > the current form of the code penalizes them, so they get larger > delays, and we get more bandwidth for sync. Right > This is, btw, the only positive outcome (I can think of) from the > current form of the code, and I think we could obtain it more easily > by unconditionally adding a delay for async queues: > rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; > if (!cfq_cfqq_sync(cfqq)) { > rb_key += CFQ_ASYNC_DELAY; > } > > removing completely the resid stuff (or at least leaving us with the > ability of using it with the proper sign). It's more likely for the async queue to overrun, but it can happen for others as well. I'm keeping the residual count, but making the sign change of course. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-06 9:00 ` Corrado Zoccolo @ 2009-10-06 18:53 ` Jens Axboe -1 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-06 18:53 UTC (permalink / raw) To: Corrado Zoccolo Cc: Jeff Moyer, Vivek Goyal, Valdis.Kletnieks, Mike Galbraith, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Tue, Oct 06 2009, Corrado Zoccolo wrote: > On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > > On Mon, Oct 05 2009, Corrado Zoccolo wrote: > >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: > >> > It stands for residual, not residency. Make more sense? > >> It makes sense when computed, but not when used in rb_key computation. > >> Why should we postpone queues that where preempted, instead of giving > >> them a boost? > > > > We should not, if it is/was working correctly, it should allow both for > > increase/descrease of tree position (hence it's a long and can go > > negative) to account for both over and under time. > > I'm doing some tests with and without it. > How it is working now is: > definition: > if (timed_out && !cfq_cfqq_slice_new(cfqq)) { > cfqq->slice_resid = cfqq->slice_end - jiffies; > cfq_log_cfqq(cfqd, cfqq, "resid=%ld", > cfqq->slice_resid); > } > * here resid is > 0 if there was residual time, and < 0 if the queue > overrun its slice. > use: > rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; > rb_key += cfqq->slice_resid; > cfqq->slice_resid = 0; > * here if residual is > 0, we postpone, i.e. penalize. If residual is > < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it. > > So this is likely not what we want. Indeed, that should be -= cfqq->slice_resid. > I did some tests with and without it, or changing the sign, and it > doesn't matter at all for pure sync workloads. For most cases it will not change things a lot, but it should be technically correct. > The only case in which it matters a little, from my experiments, is > for sync vs async workload. Here, since async queues are preempted, > the current form of the code penalizes them, so they get larger > delays, and we get more bandwidth for sync. Right > This is, btw, the only positive outcome (I can think of) from the > current form of the code, and I think we could obtain it more easily > by unconditionally adding a delay for async queues: > rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; > if (!cfq_cfqq_sync(cfqq)) { > rb_key += CFQ_ASYNC_DELAY; > } > > removing completely the resid stuff (or at least leaving us with the > ability of using it with the proper sign). It's more likely for the async queue to overrun, but it can happen for others as well. I'm keeping the residual count, but making the sign change of course. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) @ 2009-10-06 18:53 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-06 18:53 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, Jeff Moyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Tue, Oct 06 2009, Corrado Zoccolo wrote: > On Tue, Oct 6, 2009 at 10:41 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > > On Mon, Oct 05 2009, Corrado Zoccolo wrote: > >> On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: > >> > It stands for residual, not residency. Make more sense? > >> It makes sense when computed, but not when used in rb_key computation. > >> Why should we postpone queues that where preempted, instead of giving > >> them a boost? > > > > We should not, if it is/was working correctly, it should allow both for > > increase/descrease of tree position (hence it's a long and can go > > negative) to account for both over and under time. > > I'm doing some tests with and without it. > How it is working now is: > definition: > if (timed_out && !cfq_cfqq_slice_new(cfqq)) { > cfqq->slice_resid = cfqq->slice_end - jiffies; > cfq_log_cfqq(cfqd, cfqq, "resid=%ld", > cfqq->slice_resid); > } > * here resid is > 0 if there was residual time, and < 0 if the queue > overrun its slice. > use: > rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; > rb_key += cfqq->slice_resid; > cfqq->slice_resid = 0; > * here if residual is > 0, we postpone, i.e. penalize. If residual is > < 0 (i.e. the queue overrun), we anticipate it, i.e. we boost it. > > So this is likely not what we want. Indeed, that should be -= cfqq->slice_resid. > I did some tests with and without it, or changing the sign, and it > doesn't matter at all for pure sync workloads. For most cases it will not change things a lot, but it should be technically correct. > The only case in which it matters a little, from my experiments, is > for sync vs async workload. Here, since async queues are preempted, > the current form of the code penalizes them, so they get larger > delays, and we get more bandwidth for sync. Right > This is, btw, the only positive outcome (I can think of) from the > current form of the code, and I think we could obtain it more easily > by unconditionally adding a delay for async queues: > rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; > if (!cfq_cfqq_sync(cfqq)) { > rb_key += CFQ_ASYNC_DELAY; > } > > removing completely the resid stuff (or at least leaving us with the > ability of using it with the proper sign). It's more likely for the async queue to overrun, but it can happen for others as well. I'm keeping the residual count, but making the sign change of course. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <x49my457uef.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org>]
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) [not found] ` <x49my457uef.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org> @ 2009-10-05 21:09 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-05 21:09 UTC (permalink / raw) To: Jeff Moyer Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Oct 5, 2009 at 5:06 PM, Jeff Moyer <jmoyer@redhat.com> wrote: > Corrado Zoccolo <czoccolo@gmail.com> writes: > >> Moreover, I suggest removing also the slice_resid part, since its >> semantics doesn't seem consistent. >> When computed, it is not the residency, but the remaining time slice. > > It stands for residual, not residency. Make more sense? It makes sense when computed, but not when used in rb_key computation. Why should we postpone queues that where preempted, instead of giving them a boost? Thanks, Corrado > > Cheers, > Jeff > _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) [not found] ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-05 15:06 ` Jeff Moyer @ 2009-10-06 21:36 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-06 21:36 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sun, Oct 04, 2009 at 02:46:44PM +0200, Corrado Zoccolo wrote: > Hi Vivek, > On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote: > >> Hi Vivek, > >> My guess is that the formula that is used to handle this case is not > >> very stable. > > > > In general I agree that formula to calculate the slice offset is very > > puzzling as busy_queues varies and that changes the position of the task > > sometimes. > > > > I am not sure what's the intent here by removing busy_queues stuff. I have > > got two questions though. > > In the ideal case steady state, busy_queues will be a constant. Since > we are just comparing the values between themselves, we can just > remove this constant completely. > > Whenever it is not constant, it seems to me that it can cause wrong > behaviour, i.e. when the number of processes with ready I/O reduces, a > later coming request can jump before older requests. > So it seems it does more harm than good, hence I suggest to remove it. > I agree here. busy_queues can vary, especially given the fact that CFQ removes the queue from service tree immediately after the dispatch, if the queue is empty, and then it waits for request completion from the queue and idles on the queue. So consider following scenration where two thinking readers and one writer are executing. Readers preempt the writers and writers gets back into the tree. When writer gets backlogged, at that point of time busy_queues=2 and when a readers gets backlogged, busy_queues=1 (most of the time, because a reader is idling), and hence many a time readers gets placed ahead of writer. This is so subtle, that I am not sure it was the designed that way. So dependence on busy_queues can change queue ordering in unpredicatable ways. > Moreover, I suggest removing also the slice_resid part, since its > semantics doesn't seem consistent. > When computed, it is not the residency, but the remaining time slice. > Then it is used to postpone, instead of anticipate, the position of > the queue in the RR, that seems counterintuitive (it would be > intuitive, though, if it was actually a residency, not a remaining > slice, i.e. you already got your full share, so you can wait longer to > be serviced again). > > > > > - Why don't we keep it simple round robin where a task is simply placed at > > the end of service tree. > > This should work for the idling case, since we provide service > differentiation by means of time slice. > For non-idling case, though, the appropriate placement of queues in > the tree (as given by my formula) can still provide it. > So for non-idling case, instead of providing service differentiation by number of times queue is scheduled to run then by providing a bigger slice to the queue? This will work only to an extent and depends on size of IO being dispatched from each queue. If some queue is having bigger requests size and some smaller size (can be easily driven by changing block size), then again you will not see fairness numbers? In that case it might make sense to provide fairness in terms of size of IO/number of IO. So to me it boils down to what is the seek cose of the underlying media. If seek cost is high, provide fairness in terms of time slice and if seek cost is really low, one can afford to faster switching of queues without loosing too much on throughput side and in that case fairness in terms of size of IO should be good. Now if on good SSDs with NCQ, seek cost is low, I am wondering if it will make sense to tweak CFQ to change mode dynamically and start providing fairness in terms of size of IO/number of IO? > > > > - Secondly, CFQ provides full slice length to queues only which are > > idling (in case of sequenatial reader). If we do not enable idling, as > > in case of NCQ enabled SSDs, then CFQ will expire the queue almost > > immediately and put the queue at the end of service tree (almost). > > > > So if we don't enable idling, at max we can provide fairness, we > > esseitially just let every queue dispatch one request and put at the end > > of the end of service tree. Hence no fairness.... > > We should distinguish the two terms fairness and service > differentiation. Fairness is when every queue gets the same amount of > service share. Will it not be "proportionate amount of service share" instead of "same amount of service share" > This is not what we want when priorities are different > (we want the service differentiation, instead), but is what we get if > we do just round robin without idling. > > To fix this, we can alter the placement in the tree, so that if we > have Q1 with slice S1, and Q2 with slice S2, always ready to perform > I/O, we get that Q1 is in front of the three with probability > S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2). > This is what my formula should achieve. I have yet to get into details but as I said, this sounds like fairness by frequency or by number of times a queue is scheduled to dispatch. So it will help up to some extent on NCQ enabled SSDs but will become unfair is size of IO each queue dispatches is very different. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) 2009-10-04 12:46 ` Corrado Zoccolo @ 2009-10-06 21:36 ` Vivek Goyal [not found] ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-06 21:36 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-06 21:36 UTC (permalink / raw) To: Corrado Zoccolo Cc: Valdis.Kletnieks, Mike Galbraith, Jens Axboe, Ingo Molnar, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Sun, Oct 04, 2009 at 02:46:44PM +0200, Corrado Zoccolo wrote: > Hi Vivek, > On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote: > >> Hi Vivek, > >> My guess is that the formula that is used to handle this case is not > >> very stable. > > > > In general I agree that formula to calculate the slice offset is very > > puzzling as busy_queues varies and that changes the position of the task > > sometimes. > > > > I am not sure what's the intent here by removing busy_queues stuff. I have > > got two questions though. > > In the ideal case steady state, busy_queues will be a constant. Since > we are just comparing the values between themselves, we can just > remove this constant completely. > > Whenever it is not constant, it seems to me that it can cause wrong > behaviour, i.e. when the number of processes with ready I/O reduces, a > later coming request can jump before older requests. > So it seems it does more harm than good, hence I suggest to remove it. > I agree here. busy_queues can vary, especially given the fact that CFQ removes the queue from service tree immediately after the dispatch, if the queue is empty, and then it waits for request completion from the queue and idles on the queue. So consider following scenration where two thinking readers and one writer are executing. Readers preempt the writers and writers gets back into the tree. When writer gets backlogged, at that point of time busy_queues=2 and when a readers gets backlogged, busy_queues=1 (most of the time, because a reader is idling), and hence many a time readers gets placed ahead of writer. This is so subtle, that I am not sure it was the designed that way. So dependence on busy_queues can change queue ordering in unpredicatable ways. > Moreover, I suggest removing also the slice_resid part, since its > semantics doesn't seem consistent. > When computed, it is not the residency, but the remaining time slice. > Then it is used to postpone, instead of anticipate, the position of > the queue in the RR, that seems counterintuitive (it would be > intuitive, though, if it was actually a residency, not a remaining > slice, i.e. you already got your full share, so you can wait longer to > be serviced again). > > > > > - Why don't we keep it simple round robin where a task is simply placed at > > the end of service tree. > > This should work for the idling case, since we provide service > differentiation by means of time slice. > For non-idling case, though, the appropriate placement of queues in > the tree (as given by my formula) can still provide it. > So for non-idling case, instead of providing service differentiation by number of times queue is scheduled to run then by providing a bigger slice to the queue? This will work only to an extent and depends on size of IO being dispatched from each queue. If some queue is having bigger requests size and some smaller size (can be easily driven by changing block size), then again you will not see fairness numbers? In that case it might make sense to provide fairness in terms of size of IO/number of IO. So to me it boils down to what is the seek cose of the underlying media. If seek cost is high, provide fairness in terms of time slice and if seek cost is really low, one can afford to faster switching of queues without loosing too much on throughput side and in that case fairness in terms of size of IO should be good. Now if on good SSDs with NCQ, seek cost is low, I am wondering if it will make sense to tweak CFQ to change mode dynamically and start providing fairness in terms of size of IO/number of IO? > > > > - Secondly, CFQ provides full slice length to queues only which are > > idling (in case of sequenatial reader). If we do not enable idling, as > > in case of NCQ enabled SSDs, then CFQ will expire the queue almost > > immediately and put the queue at the end of service tree (almost). > > > > So if we don't enable idling, at max we can provide fairness, we > > esseitially just let every queue dispatch one request and put at the end > > of the end of service tree. Hence no fairness.... > > We should distinguish the two terms fairness and service > differentiation. Fairness is when every queue gets the same amount of > service share. Will it not be "proportionate amount of service share" instead of "same amount of service share" > This is not what we want when priorities are different > (we want the service differentiation, instead), but is what we get if > we do just round robin without idling. > > To fix this, we can alter the placement in the tree, so that if we > have Q1 with slice S1, and Q2 with slice S2, always ready to perform > I/O, we get that Q1 is in front of the three with probability > S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2). > This is what my formula should achieve. I have yet to get into details but as I said, this sounds like fairness by frequency or by number of times a queue is scheduled to dispatch. So it will help up to some extent on NCQ enabled SSDs but will become unfair is size of IO each queue dispatches is very different. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) @ 2009-10-06 21:36 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-06 21:36 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, Valdis.Kletnieks, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Sun, Oct 04, 2009 at 02:46:44PM +0200, Corrado Zoccolo wrote: > Hi Vivek, > On Sun, Oct 4, 2009 at 2:11 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sun, Oct 04, 2009 at 11:15:24AM +0200, Corrado Zoccolo wrote: > >> Hi Vivek, > >> My guess is that the formula that is used to handle this case is not > >> very stable. > > > > In general I agree that formula to calculate the slice offset is very > > puzzling as busy_queues varies and that changes the position of the task > > sometimes. > > > > I am not sure what's the intent here by removing busy_queues stuff. I have > > got two questions though. > > In the ideal case steady state, busy_queues will be a constant. Since > we are just comparing the values between themselves, we can just > remove this constant completely. > > Whenever it is not constant, it seems to me that it can cause wrong > behaviour, i.e. when the number of processes with ready I/O reduces, a > later coming request can jump before older requests. > So it seems it does more harm than good, hence I suggest to remove it. > I agree here. busy_queues can vary, especially given the fact that CFQ removes the queue from service tree immediately after the dispatch, if the queue is empty, and then it waits for request completion from the queue and idles on the queue. So consider following scenration where two thinking readers and one writer are executing. Readers preempt the writers and writers gets back into the tree. When writer gets backlogged, at that point of time busy_queues=2 and when a readers gets backlogged, busy_queues=1 (most of the time, because a reader is idling), and hence many a time readers gets placed ahead of writer. This is so subtle, that I am not sure it was the designed that way. So dependence on busy_queues can change queue ordering in unpredicatable ways. > Moreover, I suggest removing also the slice_resid part, since its > semantics doesn't seem consistent. > When computed, it is not the residency, but the remaining time slice. > Then it is used to postpone, instead of anticipate, the position of > the queue in the RR, that seems counterintuitive (it would be > intuitive, though, if it was actually a residency, not a remaining > slice, i.e. you already got your full share, so you can wait longer to > be serviced again). > > > > > - Why don't we keep it simple round robin where a task is simply placed at > > the end of service tree. > > This should work for the idling case, since we provide service > differentiation by means of time slice. > For non-idling case, though, the appropriate placement of queues in > the tree (as given by my formula) can still provide it. > So for non-idling case, instead of providing service differentiation by number of times queue is scheduled to run then by providing a bigger slice to the queue? This will work only to an extent and depends on size of IO being dispatched from each queue. If some queue is having bigger requests size and some smaller size (can be easily driven by changing block size), then again you will not see fairness numbers? In that case it might make sense to provide fairness in terms of size of IO/number of IO. So to me it boils down to what is the seek cose of the underlying media. If seek cost is high, provide fairness in terms of time slice and if seek cost is really low, one can afford to faster switching of queues without loosing too much on throughput side and in that case fairness in terms of size of IO should be good. Now if on good SSDs with NCQ, seek cost is low, I am wondering if it will make sense to tweak CFQ to change mode dynamically and start providing fairness in terms of size of IO/number of IO? > > > > - Secondly, CFQ provides full slice length to queues only which are > > idling (in case of sequenatial reader). If we do not enable idling, as > > in case of NCQ enabled SSDs, then CFQ will expire the queue almost > > immediately and put the queue at the end of service tree (almost). > > > > So if we don't enable idling, at max we can provide fairness, we > > esseitially just let every queue dispatch one request and put at the end > > of the end of service tree. Hence no fairness.... > > We should distinguish the two terms fairness and service > differentiation. Fairness is when every queue gets the same amount of > service share. Will it not be "proportionate amount of service share" instead of "same amount of service share" > This is not what we want when priorities are different > (we want the service differentiation, instead), but is what we get if > we do just round robin without idling. > > To fix this, we can alter the placement in the tree, so that if we > have Q1 with slice S1, and Q2 with slice S2, always ready to perform > I/O, we get that Q1 is in front of the three with probability > S1/(S1+S2), and Q2 is in front with probability S2/(S1+S2). > This is what my formula should achieve. I have yet to get into details but as I said, this sounds like fairness by frequency or by number of times a queue is scheduled to dispatch. So it will help up to some extent on NCQ enabled SSDs but will become unfair is size of IO each queue dispatches is very different. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) [not found] ` <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-10-03 13:38 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-03 13:38 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sat, Oct 03, 2009 at 02:43:14PM +0200, Corrado Zoccolo wrote: > On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: > >> In fact I think that the 'rotating' flag name is misleading. > >> All the checks we are doing are actually checking if the device truly > >> supports multiple parallel operations, and this feature is shared by > >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single > >> NCQ-enabled SATA disk. > >> > > > > While we are at it, what happens to notion of priority of tasks on SSDs? > This is not changed by proposed patch w.r.t. current CFQ. This is a general question irrespective of current patch. Want to know what is our statement w.r.t ioprio and what it means for user? When do we support it and when do we not. > > Without idling there is not continuous time slice and there is no > > fairness. So ioprio is out of the window for SSDs? > I haven't NCQ enabled SSDs here, so I can't test it, but it seems to > me that the way in which queues are sorted in the rr tree may still > provide some sort of fairness and service differentiation for > priorities, in terms of number of IOs. I have a NCQ enabled SSD. Sometimes I see the difference sometimes I do not. I guess this happens because sometimes idling is enabled and sometmes not because of dyanamic nature of hw_tag. I ran three fio reads for 10 seconds. First job is prio0, second prio4 and third prio7. (prio 0) read : io=978MiB, bw=100MiB/s, iops=25,023, runt= 10005msec (prio 4) read : io=953MiB, bw=99,950KiB/s, iops=24,401, runt= 10003msec (prio 7) read : io=74,228KiB, bw=7,594KiB/s, iops=1,854, runt= 10009msec Note there is almost no difference between prio 0 and prio 4 job and prio7 job has been penalized heavily (gets less than 10% BW of prio 4 job). > Non-NCQ SSDs, instead, will still have the idle window enabled, so it > is not an issue for them. Agree. > > > > On SSDs, will it make more sense to provide fairness in terms of number or > > IO or size of IO and not in terms of time slices. > Not on all SSDs. There are still ones that have a non-negligible > penalty on non-sequential access pattern (hopefully the ones without > NCQ, but if we find otherwise, then we will have to benchmark access > time in I/O scheduler to select the best policy). For those, time > based may still be needed. Ok. So on better SSDs out there with NCQ, we probably don't support the notion of ioprio? Or, I am missing something. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002222756.GG4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002222756.GG4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-03 12:43 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-03 12:43 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sat, Oct 3, 2009 at 12:27 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Sat, Oct 03, 2009 at 12:14:28AM +0200, Corrado Zoccolo wrote: >> In fact I think that the 'rotating' flag name is misleading. >> All the checks we are doing are actually checking if the device truly >> supports multiple parallel operations, and this feature is shared by >> hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single >> NCQ-enabled SATA disk. >> > > While we are at it, what happens to notion of priority of tasks on SSDs? This is not changed by proposed patch w.r.t. current CFQ. > Without idling there is not continuous time slice and there is no > fairness. So ioprio is out of the window for SSDs? I haven't NCQ enabled SSDs here, so I can't test it, but it seems to me that the way in which queues are sorted in the rr tree may still provide some sort of fairness and service differentiation for priorities, in terms of number of IOs. Non-NCQ SSDs, instead, will still have the idle window enabled, so it is not an issue for them. > > On SSDs, will it make more sense to provide fairness in terms of number or > IO or size of IO and not in terms of time slices. Not on all SSDs. There are still ones that have a non-negligible penalty on non-sequential access pattern (hopefully the ones without NCQ, but if we find otherwise, then we will have to benchmark access time in I/O scheduler to select the best policy). For those, time based may still be needed. Thanks, Corrado > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002195815.GE4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002195815.GE4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-02 22:14 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-02 22:14 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Valdis.Kletnieks-PjAqaU27lzQ, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 2, 2009 at 9:58 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Fri, Oct 02, 2009 at 12:50:17PM -0400, Valdis.Kletnieks@vt.edu wrote: >> On Fri, 02 Oct 2009 11:40:20 EDT, Vivek Goyal said: >> >> Umm... I got petabytes of hardware RAID across the hall that very definitely >> *is* rotating. Did you mean "SSD and disk systems with big honking caches >> that cover up the rotation"? Because "RAID" and "big honking caches" are >> not *quite* the same thing, and I can just see that corner case coming out >> to bite somebody on the ass... >> > > I guess both. The systems which have big caches and cover up for rotation, > we probably need not idle for seeky process. An in case of big hardware > RAID, having multiple rotating disks, instead of idling and keeping rest > of the disks free, we probably are better off dispatching requests from > next queue (hoping it is going to a different disk altogether). In fact I think that the 'rotating' flag name is misleading. All the checks we are doing are actually checking if the device truly supports multiple parallel operations, and this feature is shared by hardware raids and NCQ enabled SSDs, but not by cheap SSDs or single NCQ-enabled SATA disk. If we really wanted a "seek is cheap" flag, we could measure seek time in the io-scheduler itself, but in the current code base we don't have it used in this meaning anywhere. Thanks, Corrado > > Thanks > Vivek > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254497520.10392.11.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254497520.10392.11.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 15:40 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 15:40 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02, 2009 at 05:32:00PM +0200, Mike Galbraith wrote: > On Fri, 2009-10-02 at 17:27 +0200, Corrado Zoccolo wrote: > > On Fri, Oct 2, 2009 at 2:49 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > On Fri, Oct 02, 2009 at 12:55:25PM +0200, Corrado Zoccolo wrote: > > > > > > Actually I am not touching this code. Looking at the V10, I have not > > > changed anything here in idling code. > > > > I based my analisys on the original patch: > > http://lkml.indiana.edu/hypermail/linux/kernel/0907.1/01793.html > > > > Mike, can you confirm which version of the fairness patch did you use > > in your tests? > > That would be this one-liner. > Ok. Thanks. Sorry, I got confused and thought that you are using "io controller patches" with fairness=1. In that case, Corrado's suggestion of refining it further and disabling idling for seeky process only on non-rotational media (SSD and hardware RAID), makes sense to me. Thanks Vivek > o CFQ provides fair access to disk in terms of disk time used to processes. > Fairness is provided for the applications which have their think time with > in slice_idle (8ms default) limit. > > o CFQ currently disables idling for seeky processes. So even if a process > has think time with-in slice_idle limits, it will still not get fair share > of disk. Disabling idling for a seeky process seems good from throughput > perspective but not necessarily from fairness perspecitve. > > 0 Do not disable idling based on seek pattern of process if a user has set > /sys/block/<disk>/queue/iosched/fairness = 1. > > Signed-off-by: Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > block/cfq-iosched.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -1953,7 +1953,7 @@ cfq_update_idle_window(struct cfq_data * > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (cfqd->hw_tag && CIC_SEEKY(cic))) > + (!cfqd->cfq_fairness && cfqd->hw_tag && CIC_SEEKY(cic))) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > if (cic->ttime_mean > cfqd->cfq_slice_idle) > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 10:55 Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-02 10:55 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Jens, On Fri, Oct 2, 2009 at 11:28 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: >> >> * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: >> > > It's really not that simple, if we go and do easy latency bits, then > throughput drops 30% or more. You can't say it's black and white latency > vs throughput issue, that's just not how the real world works. The > server folks would be most unpleased. Could we be more selective when the latency optimization is introduced? The code that is currently touched by Vivek's patch is: if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || (cfqd->hw_tag && CIC_SEEKY(cic))) enable_idle = 0; basically, when fairness=1, it becomes just: if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle) enable_idle = 0; Note that, even if we enable idling here, the cfq_arm_slice_timer will use a different idle window for seeky (2ms) than for normal I/O. I think that the 2ms idle window is good for a single rotational SATA disk scenario, even if it supports NCQ. Realistic access times for those disks are still around 8ms (but it is proportional to seek lenght), and waiting 2ms to see if we get a nearby request may pay off, not only in latency and fairness, but also in throughput. What we don't want to do is to enable idling for NCQ enabled SSDs (and this is already taken care in cfq_arm_slice_timer) or for hardware RAIDs. If we agree that hardware RAIDs should be marked as non-rotational, then that code could become: if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || (blk_queue_nonrot(cfqd->queue) && cfqd->hw_tag && CIC_SEEKY(cic))) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { unsigned idle_time = CIC_SEEKY(cic) ? CFQ_MIN_TT : cfqd->cfq_slice_idle; if (cic->ttime_mean > idle_time) enable_idle = 0; else enable_idle = 1; } Thanks, Corrado > > -- > Jens Axboe > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
* IO scheduler based IO controller V10 @ 2009-09-24 19:25 Vivek Goyal 2009-09-24 21:33 ` Andrew Morton ` (3 more replies) 0 siblings, 4 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-24 19:25 UTC (permalink / raw) To: linux-kernel, jens.axboe Cc: containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, akpm, peterz, jmarchan, torvalds, mingo, riel Hi All, Here is the V10 of the IO controller patches generated on top of 2.6.31. For ease of patching, a consolidated patch is available here. http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v10.patch Changes from V9 =============== - Brought back the mechanism of idle trees (cache of recently served io queues). BFQ had originally implemented it and I had got rid of it. Later I realized that it helps providing fairness when io queue and io groups are running at same level. Hence brought the mechanism back. This cache helps in determining whether a task getting back into tree is a streaming reader who just consumed full slice legth or a new process (if not in cache) or a random reader who just got a small slice lenth and now got backlogged again. - Implemented "wait busy" for sequential reader queues. So we wait for one extra idle period for these queues to become busy so that group does not loose fairness. This works even if group_idle=0. - Fixed an issue where readers don't preempt writers with-in a group when readers get backlogged. (implemented late preemption). - Fixed the issue reported by Gui where Anticipatory was not expiring the queue. - Did more modification to AS so that it lets common layer know that it is anticipation on next requeust and common fair queuing layer does not try to do excessive queue expiratrions. - Started charging the queue only for allocated slice length (if fairness is not set) if it consumed more than allocated slice. Otherwise that queue can miss a dispatch round doubling the max latencies. This idea also borrowed from BFQ. - Allowed preemption where a reader can preempt other writer running in sibling groups or a meta data reader can preempt other non metadata reader in sibling group. - Fixed freed_request() issue pointed out by Nauman. What problem are we trying to solve =================================== Provide group IO scheduling feature in Linux along the lines of other resource controllers like cpu. IOW, provide facility so that a user can group applications using cgroups and control the amount of disk time/bandwidth received by a group based on its weight. How to solve the problem ========================= Different people have solved the issue differetnly. So far looks it looks like we seem to have following two core requirements when it comes to fairness at group level. - Control bandwidth seen by groups. - Control on latencies when a request gets backlogged in group. At least there are now three patchsets available (including this one). IO throttling ------------- This is a bandwidth controller which keeps track of IO rate of a group and throttles the process in the group if it exceeds the user specified limit. dm-ioband --------- This is a proportional bandwidth controller implemented as device mapper driver and provides fair access in terms of amount of IO done (not in terms of disk time as CFQ does). So one will setup one or more dm-ioband devices on top of physical/logical block device, configure the ioband device and pass information like grouping etc. Now this device will keep track of bios flowing through it and control the flow of bios based on group policies. IO scheduler based IO controller -------------------------------- Here we have viewed the problem of IO contoller as hierarchical group scheduling (along the lines of CFS group scheduling) issue. Currently one can view linux IO schedulers as flat where there is one root group and all the IO belongs to that group. This patchset basically modifies IO schedulers to also support hierarchical group scheduling. CFQ already provides fairness among different processes. I have extended it support group IO schduling. Also took some of the code out of CFQ and put in a common layer so that same group scheduling code can be used by noop, deadline and AS to support group scheduling. Pros/Cons ========= There are pros and cons to each of the approach. Following are some of the thoughts. Max bandwidth vs proportional bandwidth --------------------------------------- IO throttling is a max bandwidth controller and not a proportional one. Additionaly it provides fairness in terms of amount of IO done (and not in terms of disk time as CFQ does). Personally, I think that proportional weight controller is useful to more people than just max bandwidth controller. In addition, IO scheduler based controller can also be enhanced to do max bandwidth control. So it can satisfy wider set of requirements. Fairness in terms of disk time vs size of IO --------------------------------------------- An higher level controller will most likely be limited to providing fairness in terms of size/number of IO done and will find it hard to provide fairness in terms of disk time used (as CFQ provides between various prio levels). This is because only IO scheduler knows how much disk time a queue has used and information about queues and disk time used is not exported to higher layers. So a seeky application will still run away with lot of disk time and bring down the overall throughput of the the disk. Currently dm-ioband provides fairness in terms of number/size of IO. Latencies and isolation between groups -------------------------------------- An higher level controller is generally implementing a bandwidth throttling solution where if a group exceeds either the max bandwidth or the proportional share then throttle that group. This kind of approach will probably not help in controlling latencies as it will depend on underlying IO scheduler. Consider following scenario. Assume there are two groups. One group is running multiple sequential readers and other group has a random reader. sequential readers will get a nice 100ms slice each and then a random reader from group2 will get to dispatch the request. So latency of this random reader will depend on how many sequential readers are running in other group and that is a weak isolation between groups. When we control things at IO scheduler level, we assign one time slice to one group and then pick next entity to run. So effectively after one time slice (max 180ms, if prio 0 sequential reader is running), random reader in other group will get to run. Hence we achieve better isolation between groups as response time of process in a differnt group is generally not dependent on number of processes running in competing group. So a higher level solution is most likely limited to only shaping bandwidth without any control on latencies. Stacking group scheduler on top of CFQ can lead to issues --------------------------------------------------------- IO throttling and dm-ioband both are second level controller. That is these controllers are implemented in higher layers than io schedulers. So they control the IO at higher layer based on group policies and later IO schedulers take care of dispatching these bios to disk. Implementing a second level controller has the advantage of being able to provide bandwidth control even on logical block devices in the IO stack which don't have any IO schedulers attached to these. But they can also interefere with IO scheduling policy of underlying IO scheduler and change the effective behavior. Following are some of the issues which I think should be visible in second level controller in one form or other. Prio with-in group ------------------ A second level controller can potentially interefere with behavior of different prio processes with-in a group. bios are buffered at higher layer in single queue and release of bios is FIFO and not proportionate to the ioprio of the process. This can result in a particular prio level not getting fair share. Buffering at higher layer can delay read requests for more than slice idle period of CFQ (default 8 ms). That means, it is possible that we are waiting for a request from the queue but it is buffered at higher layer and then idle timer will fire. It means that queue will losse its share at the same time overall throughput will be impacted as we lost those 8 ms. Read Vs Write ------------- Writes can overwhelm readers hence second level controller FIFO release will run into issue here. If there is a single queue maintained then reads will suffer large latencies. If there separate queues for reads and writes then it will be hard to decide in what ratio to dispatch reads and writes as it is IO scheduler's decision to decide when and how much read/write to dispatch. This is another place where higher level controller will not be in sync with lower level io scheduler and can change the effective policies of underlying io scheduler. CFQ IO context Issues --------------------- Buffering at higher layer means submission of bios later with the help of a worker thread. This changes the io context information at CFQ layer which assigns the request to submitting thread. Change of io context info again leads to issues of idle timer expiry and issue of a process not getting fair share and reduced throughput. Throughput with noop, deadline and AS --------------------------------------------- I think an higher level controller will result in reduced overall throughput (as compared to io scheduler based io controller) and more seeks with noop, deadline and AS. The reason being, that it is likely that IO with-in a group will be related and will be relatively close as compared to IO across the groups. For example, thread pool of kvm-qemu doing IO for virtual machine. In case of higher level control, IO from various groups will go into a single queue at lower level controller and it might happen that IO is now interleaved (G1, G2, G1, G3, G4....) causing more seeks and reduced throughput. (Agreed that merging will help up to some extent but still....). Instead, in case of lower level controller, IO scheduler maintains one queue per group hence there is no interleaving of IO between groups. And if IO is related with-in group, then we shoud get reduced number/amount of seek and higher throughput. Latency can be a concern but that can be controlled by reducing the time slice length of the queue. Fairness at logical device level vs at physical device level ------------------------------------------------------------ IO scheduler based controller has the limitation that it works only with the bottom most devices in the IO stack where IO scheduler is attached. For example, assume a user has created a logical device lv0 using three underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 in two groups doing IO on lv0. Also assume that weights of groups are in the ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. T1 T2 \ / lv0 / | \ sda sdb sdc Now resource control will take place only on devices sda, sdb and sdc and not at lv0 level. So if IO from two tasks is relatively uniformly distributed across the disks then T1 and T2 will see the throughput ratio in proportion to weight specified. But if IO from T1 and T2 is going to different disks and there is no contention then at higher level they both will see same BW. Here a second level controller can produce better fairness numbers at logical device but most likely at redued overall throughput of the system, because it will try to control IO even if there is no contention at phsical possibly leaving diksks unused in the system. Hence, question comes that how important it is to control bandwidth at higher level logical devices also. The actual contention for resources is at the leaf block device so it probably makes sense to do any kind of control there and not at the intermediate devices. Secondly probably it also means better use of available resources. Limited Fairness ---------------- Currently CFQ idles on a sequential reader queue to make sure it gets its fair share. A second level controller will find it tricky to anticipate. Either it will not have any anticipation logic and in that case it will not provide fairness to single readers in a group (as dm-ioband does) or if it starts anticipating then we should run into these strange situations where second level controller is anticipating on one queue/group and underlying IO scheduler might be anticipating on something else. Need of device mapper tools --------------------------- A device mapper based solution will require creation of a ioband device on each physical/logical device one wants to control. So it requires usage of device mapper tools even for the people who are not using device mapper. At the same time creation of ioband device on each partition in the system to control the IO can be cumbersome and overwhelming if system has got lots of disks and partitions with-in. IMHO, IO scheduler based IO controller is a reasonable approach to solve the problem of group bandwidth control, and can do hierarchical IO scheduling more tightly and efficiently. But I am all ears to alternative approaches and suggestions how doing things can be done better and will be glad to implement it. TODO ==== - code cleanups, testing, bug fixing, optimizations, benchmarking etc... - More testing to make sure there are no regressions in CFQ. Testing ======= Environment ========== A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. I am mostly running fio jobs which have been limited to 30 seconds run and then monitored the throughput and latency. Test1: Random Reader Vs Random Writers ====================================== Launched a random reader and then increasing number of random writers to see the effect on random reader BW and max lantecies. [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ] [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] [Vanilla CFQ, No groups] <--------------random writers--------------------> <------random reader--> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec Created two cgroups group1 and group2 of weights 500 each. Launched increasing number of random writers in group1 and one random reader in group2 using fio. [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500] <--------------random writers(group1)-------------> <-random reader(group2)-> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec Also ran the same test with IO controller CFQ in flat mode to see if there are any major deviations from Vanilla CFQ. Does not look like any. [IO controller CFQ; No groups ] <--------------random writers--------------------> <------random reader--> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec Notes: - With vanilla CFQ, random writers can overwhelm a random reader. Bring down its throughput and bump up latencies significantly. - With IO controller, one can provide isolation to the random reader group and maintain consitent view of bandwidth and latencies. Test2: Random Reader Vs Sequential Reader ======================================== Launched a random reader and then increasing number of sequential readers to see the effect on BW and latencies of random reader. [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ] [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] [ Vanilla CFQ, No groups ] <---------------seq readers----------------------> <------random reader--> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec Created two cgroups group1 and group2 of weights 500 each. Launched increasing number of sequential readers in group1 and one random reader in group2 using fio. [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500] <---------------group1---------------------------> <------group2---------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec Also ran the same test with IO controller CFQ in flat mode to see if there are any major deviations from Vanilla CFQ. Does not look like any. [IO controller CFQ; No groups ] <---------------seq readers----------------------> <------random reader--> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec Notes: - The BW and latencies of random reader in group 2 seems to be stable and bounded and does not get impacted much as number of sequential readers increase in group1. Hence provding good isolation. - Throughput of sequential readers comes down and latencies go up as half of disk bandwidth (in terms of time) has been reserved for random reader group. Test3: Sequential Reader Vs Sequential Reader ============================================ Created two cgroups group1 and group2 of weights 500 and 1000 respectively. Launched increasing number of sequential readers in group1 and one sequential reader in group2 using fio and monitored how bandwidth is being distributed between two groups. First 5 columns give stats about job in group1 and last two columns give stats about job in group2. <---------------group1---------------------------> <------group2---------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec Note: group2 is getting double the bandwidth of group1 even in the face of increasing number of readers in group1. Test4 (Isolation between two KVM virtual machines) ================================================== Created two KVM virtual machines. Partitioned a disk on host in two partitions and gave one partition to each virtual machine. Put both the virtual machines in two different cgroup of weight 1000 and 500 each. Virtual machines created ext3 file system on the partitions exported from host and did buffered writes. Host seems writes as synchronous and virtual machine with higher weight gets double the disk time of virtual machine of lower weight. Used deadline scheduler in this test case. Some more details about configuration are in documentation patch. Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) =================================================================== Fairness for async writes is tricky and biggest reason is that async writes are cached in higher layers (page cahe) as well as possibly in file system layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily in proportional manner. For example, consider two dd threads reading /dev/zero as input file and doing writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will be forced to write out some pages to disk before more pages can be dirtied. But not necessarily dirty pages of same thread are picked. It can very well pick the inode of lesser priority dd thread and do some writeout. So effectively higher weight dd is doing writeouts of lower weight dd pages and we don't see service differentation. IOW, the core problem with buffered write fairness is that higher weight thread does not throw enought IO traffic at IO controller to keep the queue continuously backlogged. In my testing, there are many .2 to .8 second intervals where higher weight queue is empty and in that duration lower weight queue get lots of job done giving the impression that there was no service differentiation. In summary, from IO controller point of view async writes support is there. Because page cache has not been designed in such a manner that higher prio/weight writer can do more write out as compared to lower prio/weight writer, gettting service differentiation is hard and it is visible in some cases and not visible in some cases. Vanilla CFQ Vs IO Controller CFQ ================================ We have not fundamentally changed CFQ, instead enhanced it to also support hierarchical io scheduling. In the process invariably there are small changes here and there as new scenarios come up. Running some tests here and comparing both the CFQ's to see if there is any major deviation in behavior. Test1: Sequential Readers ========================= [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] IO scheduler: Vanilla CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec IO scheduler: IO controller CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec Test2: Sequential Writers ========================= [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] IO scheduler: Vanilla CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec IO scheduler: IO Controller CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec Test3: Random Readers ========================= [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] IO scheduler: Vanilla CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 484KiB/s 484KiB/s 484KiB/s 22596 usec 2 229KiB/s 196KiB/s 425KiB/s 51111 usec 4 119KiB/s 73KiB/s 405KiB/s 2344 msec 8 93KiB/s 23KiB/s 399KiB/s 2246 msec 16 38KiB/s 8KiB/s 328KiB/s 3965 msec IO scheduler: IO Controller CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 483KiB/s 483KiB/s 483KiB/s 29391 usec 2 229KiB/s 196KiB/s 426KiB/s 51625 usec 4 132KiB/s 88KiB/s 417KiB/s 2313 msec 8 79KiB/s 18KiB/s 389KiB/s 2298 msec 16 43KiB/s 9KiB/s 327KiB/s 3905 msec Test4: Random Writers ===================== [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] IO scheduler: Vanilla CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec 16 66KiB/s 22KiB/s 829KiB/s 1308 msec IO scheduler: IO Controller CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec 16 71KiB/s 29KiB/s 814KiB/s 1457 msec Notes: - Does not look like that anything has changed significantly. Previous versions of the patches were posted here. ------------------------------------------------ (V1) http://lkml.org/lkml/2009/3/11/486 (V2) http://lkml.org/lkml/2009/5/5/275 (V3) http://lkml.org/lkml/2009/5/26/472 (V4) http://lkml.org/lkml/2009/6/8/580 (V5) http://lkml.org/lkml/2009/6/19/279 (V6) http://lkml.org/lkml/2009/7/2/369 (V7) http://lkml.org/lkml/2009/7/24/253 (V8) http://lkml.org/lkml/2009/8/16/204 (V9) http://lkml.org/lkml/2009/8/28/327 Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-24 19:25 Vivek Goyal @ 2009-09-24 21:33 ` Andrew Morton 2009-09-25 2:20 ` Ulrich Lukas ` (2 subsequent siblings) 3 siblings, 0 replies; 349+ messages in thread From: Andrew Morton @ 2009-09-24 21:33 UTC (permalink / raw) To: Vivek Goyal Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, vgoyal, peterz, jmarchan, torvalds, mingo, riel On Thu, 24 Sep 2009 15:25:04 -0400 Vivek Goyal <vgoyal@redhat.com> wrote: > > Hi All, > > Here is the V10 of the IO controller patches generated on top of 2.6.31. > Thanks for the writeup. It really helps and is most worthwhile for a project of this importance, size and complexity. > > What problem are we trying to solve > =================================== > Provide group IO scheduling feature in Linux along the lines of other resource > controllers like cpu. > > IOW, provide facility so that a user can group applications using cgroups and > control the amount of disk time/bandwidth received by a group based on its > weight. > > How to solve the problem > ========================= > > Different people have solved the issue differetnly. So far looks it looks > like we seem to have following two core requirements when it comes to > fairness at group level. > > - Control bandwidth seen by groups. > - Control on latencies when a request gets backlogged in group. > > At least there are now three patchsets available (including this one). > > IO throttling > ------------- > This is a bandwidth controller which keeps track of IO rate of a group and > throttles the process in the group if it exceeds the user specified limit. > > dm-ioband > --------- > This is a proportional bandwidth controller implemented as device mapper > driver and provides fair access in terms of amount of IO done (not in terms > of disk time as CFQ does). > > So one will setup one or more dm-ioband devices on top of physical/logical > block device, configure the ioband device and pass information like grouping > etc. Now this device will keep track of bios flowing through it and control > the flow of bios based on group policies. > > IO scheduler based IO controller > -------------------------------- > Here we have viewed the problem of IO contoller as hierarchical group > scheduling (along the lines of CFS group scheduling) issue. Currently one can > view linux IO schedulers as flat where there is one root group and all the IO > belongs to that group. > > This patchset basically modifies IO schedulers to also support hierarchical > group scheduling. CFQ already provides fairness among different processes. I > have extended it support group IO schduling. Also took some of the code out > of CFQ and put in a common layer so that same group scheduling code can be > used by noop, deadline and AS to support group scheduling. > > Pros/Cons > ========= > There are pros and cons to each of the approach. Following are some of the > thoughts. > > Max bandwidth vs proportional bandwidth > --------------------------------------- > IO throttling is a max bandwidth controller and not a proportional one. > Additionaly it provides fairness in terms of amount of IO done (and not in > terms of disk time as CFQ does). > > Personally, I think that proportional weight controller is useful to more > people than just max bandwidth controller. In addition, IO scheduler based > controller can also be enhanced to do max bandwidth control. So it can > satisfy wider set of requirements. > > Fairness in terms of disk time vs size of IO > --------------------------------------------- > An higher level controller will most likely be limited to providing fairness > in terms of size/number of IO done and will find it hard to provide fairness > in terms of disk time used (as CFQ provides between various prio levels). This > is because only IO scheduler knows how much disk time a queue has used and > information about queues and disk time used is not exported to higher > layers. > > So a seeky application will still run away with lot of disk time and bring > down the overall throughput of the the disk. But that's only true if the thing is poorly implemented. A high-level controller will need some view of the busyness of the underlying device(s). That could be "proportion of idle time", or "average length of queue" or "average request latency" or some mix of these or something else altogether. But these things are simple to calculate, and are simple to feed back to the higher-level controller and probably don't require any changes to to IO scheduler at all, which is a great advantage. And I must say that high-level throttling based upon feedback from lower layers seems like a much better model to me than hacking away in the IO scheduler layer. Both from an implementation point of view and from a "we can get it to work on things other than block devices" point of view. > Currently dm-ioband provides fairness in terms of number/size of IO. > > Latencies and isolation between groups > -------------------------------------- > An higher level controller is generally implementing a bandwidth throttling > solution where if a group exceeds either the max bandwidth or the proportional > share then throttle that group. > > This kind of approach will probably not help in controlling latencies as it > will depend on underlying IO scheduler. Consider following scenario. > > Assume there are two groups. One group is running multiple sequential readers > and other group has a random reader. sequential readers will get a nice 100ms > slice Do you refer to each reader within group1, or to all readers? It would be daft if each reader in group1 were to get 100ms. > each and then a random reader from group2 will get to dispatch the > request. So latency of this random reader will depend on how many sequential > readers are running in other group and that is a weak isolation between groups. And yet that is what you appear to mean. But surely nobody would do that - the 100ms would be assigned to and distributed amongst all readers in group1? > When we control things at IO scheduler level, we assign one time slice to one > group and then pick next entity to run. So effectively after one time slice > (max 180ms, if prio 0 sequential reader is running), random reader in other > group will get to run. Hence we achieve better isolation between groups as > response time of process in a differnt group is generally not dependent on > number of processes running in competing group. I don't understand why you're comparing this implementation with such an obviously dumb competing design! > So a higher level solution is most likely limited to only shaping bandwidth > without any control on latencies. > > Stacking group scheduler on top of CFQ can lead to issues > --------------------------------------------------------- > IO throttling and dm-ioband both are second level controller. That is these > controllers are implemented in higher layers than io schedulers. So they > control the IO at higher layer based on group policies and later IO > schedulers take care of dispatching these bios to disk. > > Implementing a second level controller has the advantage of being able to > provide bandwidth control even on logical block devices in the IO stack > which don't have any IO schedulers attached to these. But they can also > interefere with IO scheduling policy of underlying IO scheduler and change > the effective behavior. Following are some of the issues which I think > should be visible in second level controller in one form or other. > > Prio with-in group > ------------------ > A second level controller can potentially interefere with behavior of > different prio processes with-in a group. bios are buffered at higher layer > in single queue and release of bios is FIFO and not proportionate to the > ioprio of the process. This can result in a particular prio level not > getting fair share. That's an administrator error, isn't it? Should have put the different-priority processes into different groups. > Buffering at higher layer can delay read requests for more than slice idle > period of CFQ (default 8 ms). That means, it is possible that we are waiting > for a request from the queue but it is buffered at higher layer and then idle > timer will fire. It means that queue will losse its share at the same time > overall throughput will be impacted as we lost those 8 ms. That sounds like a bug. > Read Vs Write > ------------- > Writes can overwhelm readers hence second level controller FIFO release > will run into issue here. If there is a single queue maintained then reads > will suffer large latencies. If there separate queues for reads and writes > then it will be hard to decide in what ratio to dispatch reads and writes as > it is IO scheduler's decision to decide when and how much read/write to > dispatch. This is another place where higher level controller will not be in > sync with lower level io scheduler and can change the effective policies of > underlying io scheduler. The IO schedulers already take care of read-vs-write and already take care of preventing large writes-starve-reads latencies (or at least, they're supposed to). > CFQ IO context Issues > --------------------- > Buffering at higher layer means submission of bios later with the help of > a worker thread. Why? If it's a read, we just block the userspace process. If it's a delayed write, the IO submission already happens in a kernel thread. If it's a synchronous write, we have to block the userspace caller anyway. Async reads might be an issue, dunno. > This changes the io context information at CFQ layer which > assigns the request to submitting thread. Change of io context info again > leads to issues of idle timer expiry and issue of a process not getting fair > share and reduced throughput. But we already have that problem with delayed writeback, which is a huge thing - often it's the majority of IO. > Throughput with noop, deadline and AS > --------------------------------------------- > I think an higher level controller will result in reduced overall throughput > (as compared to io scheduler based io controller) and more seeks with noop, > deadline and AS. > > The reason being, that it is likely that IO with-in a group will be related > and will be relatively close as compared to IO across the groups. For example, > thread pool of kvm-qemu doing IO for virtual machine. In case of higher level > control, IO from various groups will go into a single queue at lower level > controller and it might happen that IO is now interleaved (G1, G2, G1, G3, > G4....) causing more seeks and reduced throughput. (Agreed that merging will > help up to some extent but still....). > > Instead, in case of lower level controller, IO scheduler maintains one queue > per group hence there is no interleaving of IO between groups. And if IO is > related with-in group, then we shoud get reduced number/amount of seek and > higher throughput. > > Latency can be a concern but that can be controlled by reducing the time > slice length of the queue. Well maybe, maybe not. If a group is throttled, it isn't submitting new IO. The unthrottled group is doing the IO submitting and that IO will have decent locality. > Fairness at logical device level vs at physical device level > ------------------------------------------------------------ > > IO scheduler based controller has the limitation that it works only with the > bottom most devices in the IO stack where IO scheduler is attached. > > For example, assume a user has created a logical device lv0 using three > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 > in two groups doing IO on lv0. Also assume that weights of groups are in the > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. > > T1 T2 > \ / > lv0 > / | \ > sda sdb sdc > > > Now resource control will take place only on devices sda, sdb and sdc and > not at lv0 level. So if IO from two tasks is relatively uniformly > distributed across the disks then T1 and T2 will see the throughput ratio > in proportion to weight specified. But if IO from T1 and T2 is going to > different disks and there is no contention then at higher level they both > will see same BW. > > Here a second level controller can produce better fairness numbers at > logical device but most likely at redued overall throughput of the system, > because it will try to control IO even if there is no contention at phsical > possibly leaving diksks unused in the system. > > Hence, question comes that how important it is to control bandwidth at > higher level logical devices also. The actual contention for resources is > at the leaf block device so it probably makes sense to do any kind of > control there and not at the intermediate devices. Secondly probably it > also means better use of available resources. hm. What will be the effects of this limitation in real-world use? > Limited Fairness > ---------------- > Currently CFQ idles on a sequential reader queue to make sure it gets its > fair share. A second level controller will find it tricky to anticipate. > Either it will not have any anticipation logic and in that case it will not > provide fairness to single readers in a group (as dm-ioband does) or if it > starts anticipating then we should run into these strange situations where > second level controller is anticipating on one queue/group and underlying > IO scheduler might be anticipating on something else. It depends on the size of the inter-group timeslices. If the amount of time for which a group is unthrottled is "large" comapred to the typical anticipation times, this issue fades away. And those timeslices _should_ be large. Because as you mentioned above, different groups are probably working different parts of the disk. > Need of device mapper tools > --------------------------- > A device mapper based solution will require creation of a ioband device > on each physical/logical device one wants to control. So it requires usage > of device mapper tools even for the people who are not using device mapper. > At the same time creation of ioband device on each partition in the system to > control the IO can be cumbersome and overwhelming if system has got lots of > disks and partitions with-in. > > > IMHO, IO scheduler based IO controller is a reasonable approach to solve the > problem of group bandwidth control, and can do hierarchical IO scheduling > more tightly and efficiently. > > But I am all ears to alternative approaches and suggestions how doing things > can be done better and will be glad to implement it. > > TODO > ==== > - code cleanups, testing, bug fixing, optimizations, benchmarking etc... > - More testing to make sure there are no regressions in CFQ. > > Testing > ======= > > Environment > ========== > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. That's a bit of a toy. Do we have testing results for more enterprisey hardware? Big storage arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha) > I am mostly > running fio jobs which have been limited to 30 seconds run and then monitored > the throughput and latency. > > Test1: Random Reader Vs Random Writers > ====================================== > Launched a random reader and then increasing number of random writers to see > the effect on random reader BW and max lantecies. > > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ] > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > [Vanilla CFQ, No groups] > <--------------random writers--------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec > 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec > 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec > 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec > 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec > 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > number of random writers in group1 and one random reader in group2 using fio. > > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500] > <--------------random writers(group1)-------------> <-random reader(group2)-> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec > 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec > 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec > 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec > 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec > 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec That's a good result. > Also ran the same test with IO controller CFQ in flat mode to see if there > are any major deviations from Vanilla CFQ. Does not look like any. > > [IO controller CFQ; No groups ] > <--------------random writers--------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec > 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec > 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec > 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec > 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec > 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec > > Notes: > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down > its throughput and bump up latencies significantly. Isn't that a CFQ shortcoming which we should address separately? If so, the comparisons aren't presently valid because we're comparing with a CFQ which has known, should-be-fixed problems. > - With IO controller, one can provide isolation to the random reader group and > maintain consitent view of bandwidth and latencies. > > Test2: Random Reader Vs Sequential Reader > ======================================== > Launched a random reader and then increasing number of sequential readers to > see the effect on BW and latencies of random reader. > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ] > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > [ Vanilla CFQ, No groups ] > <---------------seq readers----------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec > 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec > 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec > 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec > 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > number of sequential readers in group1 and one random reader in group2 using > fio. > > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500] > <---------------group1---------------------------> <------group2---------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec > 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec > 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec > 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec > 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec > > Also ran the same test with IO controller CFQ in flat mode to see if there > are any major deviations from Vanilla CFQ. Does not look like any. > > [IO controller CFQ; No groups ] > <---------------seq readers----------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec > 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec > 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec > 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec > 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec > > Notes: > - The BW and latencies of random reader in group 2 seems to be stable and > bounded and does not get impacted much as number of sequential readers > increase in group1. Hence provding good isolation. > > - Throughput of sequential readers comes down and latencies go up as half > of disk bandwidth (in terms of time) has been reserved for random reader > group. > > Test3: Sequential Reader Vs Sequential Reader > ============================================ > Created two cgroups group1 and group2 of weights 500 and 1000 respectively. > Launched increasing number of sequential readers in group1 and one sequential > reader in group2 using fio and monitored how bandwidth is being distributed > between two groups. > > First 5 columns give stats about job in group1 and last two columns give > stats about job in group2. > > <---------------group1---------------------------> <------group2---------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec > 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec > 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec > 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec > 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec > > Note: group2 is getting double the bandwidth of group1 even in the face > of increasing number of readers in group1. > > Test4 (Isolation between two KVM virtual machines) > ================================================== > Created two KVM virtual machines. Partitioned a disk on host in two partitions > and gave one partition to each virtual machine. Put both the virtual machines > in two different cgroup of weight 1000 and 500 each. Virtual machines created > ext3 file system on the partitions exported from host and did buffered writes. > Host seems writes as synchronous and virtual machine with higher weight gets > double the disk time of virtual machine of lower weight. Used deadline > scheduler in this test case. > > Some more details about configuration are in documentation patch. > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > =================================================================== > Fairness for async writes is tricky and biggest reason is that async writes > are cached in higher layers (page cahe) as well as possibly in file system > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > in proportional manner. > > For example, consider two dd threads reading /dev/zero as input file and doing > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > be forced to write out some pages to disk before more pages can be dirtied. But > not necessarily dirty pages of same thread are picked. It can very well pick > the inode of lesser priority dd thread and do some writeout. So effectively > higher weight dd is doing writeouts of lower weight dd pages and we don't see > service differentation. > > IOW, the core problem with buffered write fairness is that higher weight thread > does not throw enought IO traffic at IO controller to keep the queue > continuously backlogged. In my testing, there are many .2 to .8 second > intervals where higher weight queue is empty and in that duration lower weight > queue get lots of job done giving the impression that there was no service > differentiation. > > In summary, from IO controller point of view async writes support is there. > Because page cache has not been designed in such a manner that higher > prio/weight writer can do more write out as compared to lower prio/weight > writer, gettting service differentiation is hard and it is visible in some > cases and not visible in some cases. Here's where it all falls to pieces. For async writeback we just don't care about IO priorities. Because from the point of view of the userspace task, the write was async! It occurred at memory bandwidth speed. It's only when the kernel's dirty memory thresholds start to get exceeded that we start to care about prioritisation. And at that time, all dirty memory (within a memcg?) is equal - a high-ioprio dirty page consumes just as much memory as a low-ioprio dirty page. So when balance_dirty_pages() hits, what do we want to do? I suppose that all we can do is to block low-ioprio processes more agressively at the VFS layer, to reduce the rate at which they're dirtying memory so as to give high-ioprio processes more of the disk bandwidth. But you've gone and implemented all of this stuff at the io-controller level and not at the VFS level so you're, umm, screwed. Importantly screwed! It's a very common workload pattern, and one which causes tremendous amounts of IO to be generated very quickly, traditionally causing bad latency effects all over the place. And we have no answer to this. > Vanilla CFQ Vs IO Controller CFQ > ================================ > We have not fundamentally changed CFQ, instead enhanced it to also support > hierarchical io scheduling. In the process invariably there are small changes > here and there as new scenarios come up. Running some tests here and comparing > both the CFQ's to see if there is any major deviation in behavior. > > Test1: Sequential Readers > ========================= > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > IO scheduler: IO controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > Test2: Sequential Writers > ========================= > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > Test3: Random Readers > ========================= > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > Test4: Random Writers > ===================== > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > Notes: > - Does not look like that anything has changed significantly. > > Previous versions of the patches were posted here. > ------------------------------------------------ > > (V1) http://lkml.org/lkml/2009/3/11/486 > (V2) http://lkml.org/lkml/2009/5/5/275 > (V3) http://lkml.org/lkml/2009/5/26/472 > (V4) http://lkml.org/lkml/2009/6/8/580 > (V5) http://lkml.org/lkml/2009/6/19/279 > (V6) http://lkml.org/lkml/2009/7/2/369 > (V7) http://lkml.org/lkml/2009/7/24/253 > (V8) http://lkml.org/lkml/2009/8/16/204 > (V9) http://lkml.org/lkml/2009/8/28/327 > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-24 21:33 ` Andrew Morton 0 siblings, 0 replies; 349+ messages in thread From: Andrew Morton @ 2009-09-24 21:33 UTC (permalink / raw) Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, vgoyal, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, s-uchida, righi.andrea, torvalds On Thu, 24 Sep 2009 15:25:04 -0400 Vivek Goyal <vgoyal@redhat.com> wrote: > > Hi All, > > Here is the V10 of the IO controller patches generated on top of 2.6.31. > Thanks for the writeup. It really helps and is most worthwhile for a project of this importance, size and complexity. > > What problem are we trying to solve > =================================== > Provide group IO scheduling feature in Linux along the lines of other resource > controllers like cpu. > > IOW, provide facility so that a user can group applications using cgroups and > control the amount of disk time/bandwidth received by a group based on its > weight. > > How to solve the problem > ========================= > > Different people have solved the issue differetnly. So far looks it looks > like we seem to have following two core requirements when it comes to > fairness at group level. > > - Control bandwidth seen by groups. > - Control on latencies when a request gets backlogged in group. > > At least there are now three patchsets available (including this one). > > IO throttling > ------------- > This is a bandwidth controller which keeps track of IO rate of a group and > throttles the process in the group if it exceeds the user specified limit. > > dm-ioband > --------- > This is a proportional bandwidth controller implemented as device mapper > driver and provides fair access in terms of amount of IO done (not in terms > of disk time as CFQ does). > > So one will setup one or more dm-ioband devices on top of physical/logical > block device, configure the ioband device and pass information like grouping > etc. Now this device will keep track of bios flowing through it and control > the flow of bios based on group policies. > > IO scheduler based IO controller > -------------------------------- > Here we have viewed the problem of IO contoller as hierarchical group > scheduling (along the lines of CFS group scheduling) issue. Currently one can > view linux IO schedulers as flat where there is one root group and all the IO > belongs to that group. > > This patchset basically modifies IO schedulers to also support hierarchical > group scheduling. CFQ already provides fairness among different processes. I > have extended it support group IO schduling. Also took some of the code out > of CFQ and put in a common layer so that same group scheduling code can be > used by noop, deadline and AS to support group scheduling. > > Pros/Cons > ========= > There are pros and cons to each of the approach. Following are some of the > thoughts. > > Max bandwidth vs proportional bandwidth > --------------------------------------- > IO throttling is a max bandwidth controller and not a proportional one. > Additionaly it provides fairness in terms of amount of IO done (and not in > terms of disk time as CFQ does). > > Personally, I think that proportional weight controller is useful to more > people than just max bandwidth controller. In addition, IO scheduler based > controller can also be enhanced to do max bandwidth control. So it can > satisfy wider set of requirements. > > Fairness in terms of disk time vs size of IO > --------------------------------------------- > An higher level controller will most likely be limited to providing fairness > in terms of size/number of IO done and will find it hard to provide fairness > in terms of disk time used (as CFQ provides between various prio levels). This > is because only IO scheduler knows how much disk time a queue has used and > information about queues and disk time used is not exported to higher > layers. > > So a seeky application will still run away with lot of disk time and bring > down the overall throughput of the the disk. But that's only true if the thing is poorly implemented. A high-level controller will need some view of the busyness of the underlying device(s). That could be "proportion of idle time", or "average length of queue" or "average request latency" or some mix of these or something else altogether. But these things are simple to calculate, and are simple to feed back to the higher-level controller and probably don't require any changes to to IO scheduler at all, which is a great advantage. And I must say that high-level throttling based upon feedback from lower layers seems like a much better model to me than hacking away in the IO scheduler layer. Both from an implementation point of view and from a "we can get it to work on things other than block devices" point of view. > Currently dm-ioband provides fairness in terms of number/size of IO. > > Latencies and isolation between groups > -------------------------------------- > An higher level controller is generally implementing a bandwidth throttling > solution where if a group exceeds either the max bandwidth or the proportional > share then throttle that group. > > This kind of approach will probably not help in controlling latencies as it > will depend on underlying IO scheduler. Consider following scenario. > > Assume there are two groups. One group is running multiple sequential readers > and other group has a random reader. sequential readers will get a nice 100ms > slice Do you refer to each reader within group1, or to all readers? It would be daft if each reader in group1 were to get 100ms. > each and then a random reader from group2 will get to dispatch the > request. So latency of this random reader will depend on how many sequential > readers are running in other group and that is a weak isolation between groups. And yet that is what you appear to mean. But surely nobody would do that - the 100ms would be assigned to and distributed amongst all readers in group1? > When we control things at IO scheduler level, we assign one time slice to one > group and then pick next entity to run. So effectively after one time slice > (max 180ms, if prio 0 sequential reader is running), random reader in other > group will get to run. Hence we achieve better isolation between groups as > response time of process in a differnt group is generally not dependent on > number of processes running in competing group. I don't understand why you're comparing this implementation with such an obviously dumb competing design! > So a higher level solution is most likely limited to only shaping bandwidth > without any control on latencies. > > Stacking group scheduler on top of CFQ can lead to issues > --------------------------------------------------------- > IO throttling and dm-ioband both are second level controller. That is these > controllers are implemented in higher layers than io schedulers. So they > control the IO at higher layer based on group policies and later IO > schedulers take care of dispatching these bios to disk. > > Implementing a second level controller has the advantage of being able to > provide bandwidth control even on logical block devices in the IO stack > which don't have any IO schedulers attached to these. But they can also > interefere with IO scheduling policy of underlying IO scheduler and change > the effective behavior. Following are some of the issues which I think > should be visible in second level controller in one form or other. > > Prio with-in group > ------------------ > A second level controller can potentially interefere with behavior of > different prio processes with-in a group. bios are buffered at higher layer > in single queue and release of bios is FIFO and not proportionate to the > ioprio of the process. This can result in a particular prio level not > getting fair share. That's an administrator error, isn't it? Should have put the different-priority processes into different groups. > Buffering at higher layer can delay read requests for more than slice idle > period of CFQ (default 8 ms). That means, it is possible that we are waiting > for a request from the queue but it is buffered at higher layer and then idle > timer will fire. It means that queue will losse its share at the same time > overall throughput will be impacted as we lost those 8 ms. That sounds like a bug. > Read Vs Write > ------------- > Writes can overwhelm readers hence second level controller FIFO release > will run into issue here. If there is a single queue maintained then reads > will suffer large latencies. If there separate queues for reads and writes > then it will be hard to decide in what ratio to dispatch reads and writes as > it is IO scheduler's decision to decide when and how much read/write to > dispatch. This is another place where higher level controller will not be in > sync with lower level io scheduler and can change the effective policies of > underlying io scheduler. The IO schedulers already take care of read-vs-write and already take care of preventing large writes-starve-reads latencies (or at least, they're supposed to). > CFQ IO context Issues > --------------------- > Buffering at higher layer means submission of bios later with the help of > a worker thread. Why? If it's a read, we just block the userspace process. If it's a delayed write, the IO submission already happens in a kernel thread. If it's a synchronous write, we have to block the userspace caller anyway. Async reads might be an issue, dunno. > This changes the io context information at CFQ layer which > assigns the request to submitting thread. Change of io context info again > leads to issues of idle timer expiry and issue of a process not getting fair > share and reduced throughput. But we already have that problem with delayed writeback, which is a huge thing - often it's the majority of IO. > Throughput with noop, deadline and AS > --------------------------------------------- > I think an higher level controller will result in reduced overall throughput > (as compared to io scheduler based io controller) and more seeks with noop, > deadline and AS. > > The reason being, that it is likely that IO with-in a group will be related > and will be relatively close as compared to IO across the groups. For example, > thread pool of kvm-qemu doing IO for virtual machine. In case of higher level > control, IO from various groups will go into a single queue at lower level > controller and it might happen that IO is now interleaved (G1, G2, G1, G3, > G4....) causing more seeks and reduced throughput. (Agreed that merging will > help up to some extent but still....). > > Instead, in case of lower level controller, IO scheduler maintains one queue > per group hence there is no interleaving of IO between groups. And if IO is > related with-in group, then we shoud get reduced number/amount of seek and > higher throughput. > > Latency can be a concern but that can be controlled by reducing the time > slice length of the queue. Well maybe, maybe not. If a group is throttled, it isn't submitting new IO. The unthrottled group is doing the IO submitting and that IO will have decent locality. > Fairness at logical device level vs at physical device level > ------------------------------------------------------------ > > IO scheduler based controller has the limitation that it works only with the > bottom most devices in the IO stack where IO scheduler is attached. > > For example, assume a user has created a logical device lv0 using three > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 > in two groups doing IO on lv0. Also assume that weights of groups are in the > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. > > T1 T2 > \ / > lv0 > / | \ > sda sdb sdc > > > Now resource control will take place only on devices sda, sdb and sdc and > not at lv0 level. So if IO from two tasks is relatively uniformly > distributed across the disks then T1 and T2 will see the throughput ratio > in proportion to weight specified. But if IO from T1 and T2 is going to > different disks and there is no contention then at higher level they both > will see same BW. > > Here a second level controller can produce better fairness numbers at > logical device but most likely at redued overall throughput of the system, > because it will try to control IO even if there is no contention at phsical > possibly leaving diksks unused in the system. > > Hence, question comes that how important it is to control bandwidth at > higher level logical devices also. The actual contention for resources is > at the leaf block device so it probably makes sense to do any kind of > control there and not at the intermediate devices. Secondly probably it > also means better use of available resources. hm. What will be the effects of this limitation in real-world use? > Limited Fairness > ---------------- > Currently CFQ idles on a sequential reader queue to make sure it gets its > fair share. A second level controller will find it tricky to anticipate. > Either it will not have any anticipation logic and in that case it will not > provide fairness to single readers in a group (as dm-ioband does) or if it > starts anticipating then we should run into these strange situations where > second level controller is anticipating on one queue/group and underlying > IO scheduler might be anticipating on something else. It depends on the size of the inter-group timeslices. If the amount of time for which a group is unthrottled is "large" comapred to the typical anticipation times, this issue fades away. And those timeslices _should_ be large. Because as you mentioned above, different groups are probably working different parts of the disk. > Need of device mapper tools > --------------------------- > A device mapper based solution will require creation of a ioband device > on each physical/logical device one wants to control. So it requires usage > of device mapper tools even for the people who are not using device mapper. > At the same time creation of ioband device on each partition in the system to > control the IO can be cumbersome and overwhelming if system has got lots of > disks and partitions with-in. > > > IMHO, IO scheduler based IO controller is a reasonable approach to solve the > problem of group bandwidth control, and can do hierarchical IO scheduling > more tightly and efficiently. > > But I am all ears to alternative approaches and suggestions how doing things > can be done better and will be glad to implement it. > > TODO > ==== > - code cleanups, testing, bug fixing, optimizations, benchmarking etc... > - More testing to make sure there are no regressions in CFQ. > > Testing > ======= > > Environment > ========== > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. That's a bit of a toy. Do we have testing results for more enterprisey hardware? Big storage arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha) > I am mostly > running fio jobs which have been limited to 30 seconds run and then monitored > the throughput and latency. > > Test1: Random Reader Vs Random Writers > ====================================== > Launched a random reader and then increasing number of random writers to see > the effect on random reader BW and max lantecies. > > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ] > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > [Vanilla CFQ, No groups] > <--------------random writers--------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec > 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec > 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec > 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec > 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec > 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > number of random writers in group1 and one random reader in group2 using fio. > > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500] > <--------------random writers(group1)-------------> <-random reader(group2)-> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec > 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec > 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec > 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec > 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec > 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec That's a good result. > Also ran the same test with IO controller CFQ in flat mode to see if there > are any major deviations from Vanilla CFQ. Does not look like any. > > [IO controller CFQ; No groups ] > <--------------random writers--------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec > 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec > 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec > 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec > 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec > 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec > > Notes: > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down > its throughput and bump up latencies significantly. Isn't that a CFQ shortcoming which we should address separately? If so, the comparisons aren't presently valid because we're comparing with a CFQ which has known, should-be-fixed problems. > - With IO controller, one can provide isolation to the random reader group and > maintain consitent view of bandwidth and latencies. > > Test2: Random Reader Vs Sequential Reader > ======================================== > Launched a random reader and then increasing number of sequential readers to > see the effect on BW and latencies of random reader. > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ] > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > [ Vanilla CFQ, No groups ] > <---------------seq readers----------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec > 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec > 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec > 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec > 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > number of sequential readers in group1 and one random reader in group2 using > fio. > > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500] > <---------------group1---------------------------> <------group2---------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec > 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec > 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec > 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec > 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec > > Also ran the same test with IO controller CFQ in flat mode to see if there > are any major deviations from Vanilla CFQ. Does not look like any. > > [IO controller CFQ; No groups ] > <---------------seq readers----------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec > 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec > 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec > 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec > 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec > > Notes: > - The BW and latencies of random reader in group 2 seems to be stable and > bounded and does not get impacted much as number of sequential readers > increase in group1. Hence provding good isolation. > > - Throughput of sequential readers comes down and latencies go up as half > of disk bandwidth (in terms of time) has been reserved for random reader > group. > > Test3: Sequential Reader Vs Sequential Reader > ============================================ > Created two cgroups group1 and group2 of weights 500 and 1000 respectively. > Launched increasing number of sequential readers in group1 and one sequential > reader in group2 using fio and monitored how bandwidth is being distributed > between two groups. > > First 5 columns give stats about job in group1 and last two columns give > stats about job in group2. > > <---------------group1---------------------------> <------group2---------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec > 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec > 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec > 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec > 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec > > Note: group2 is getting double the bandwidth of group1 even in the face > of increasing number of readers in group1. > > Test4 (Isolation between two KVM virtual machines) > ================================================== > Created two KVM virtual machines. Partitioned a disk on host in two partitions > and gave one partition to each virtual machine. Put both the virtual machines > in two different cgroup of weight 1000 and 500 each. Virtual machines created > ext3 file system on the partitions exported from host and did buffered writes. > Host seems writes as synchronous and virtual machine with higher weight gets > double the disk time of virtual machine of lower weight. Used deadline > scheduler in this test case. > > Some more details about configuration are in documentation patch. > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > =================================================================== > Fairness for async writes is tricky and biggest reason is that async writes > are cached in higher layers (page cahe) as well as possibly in file system > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > in proportional manner. > > For example, consider two dd threads reading /dev/zero as input file and doing > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > be forced to write out some pages to disk before more pages can be dirtied. But > not necessarily dirty pages of same thread are picked. It can very well pick > the inode of lesser priority dd thread and do some writeout. So effectively > higher weight dd is doing writeouts of lower weight dd pages and we don't see > service differentation. > > IOW, the core problem with buffered write fairness is that higher weight thread > does not throw enought IO traffic at IO controller to keep the queue > continuously backlogged. In my testing, there are many .2 to .8 second > intervals where higher weight queue is empty and in that duration lower weight > queue get lots of job done giving the impression that there was no service > differentiation. > > In summary, from IO controller point of view async writes support is there. > Because page cache has not been designed in such a manner that higher > prio/weight writer can do more write out as compared to lower prio/weight > writer, gettting service differentiation is hard and it is visible in some > cases and not visible in some cases. Here's where it all falls to pieces. For async writeback we just don't care about IO priorities. Because from the point of view of the userspace task, the write was async! It occurred at memory bandwidth speed. It's only when the kernel's dirty memory thresholds start to get exceeded that we start to care about prioritisation. And at that time, all dirty memory (within a memcg?) is equal - a high-ioprio dirty page consumes just as much memory as a low-ioprio dirty page. So when balance_dirty_pages() hits, what do we want to do? I suppose that all we can do is to block low-ioprio processes more agressively at the VFS layer, to reduce the rate at which they're dirtying memory so as to give high-ioprio processes more of the disk bandwidth. But you've gone and implemented all of this stuff at the io-controller level and not at the VFS level so you're, umm, screwed. Importantly screwed! It's a very common workload pattern, and one which causes tremendous amounts of IO to be generated very quickly, traditionally causing bad latency effects all over the place. And we have no answer to this. > Vanilla CFQ Vs IO Controller CFQ > ================================ > We have not fundamentally changed CFQ, instead enhanced it to also support > hierarchical io scheduling. In the process invariably there are small changes > here and there as new scenarios come up. Running some tests here and comparing > both the CFQ's to see if there is any major deviation in behavior. > > Test1: Sequential Readers > ========================= > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > IO scheduler: IO controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > Test2: Sequential Writers > ========================= > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > Test3: Random Readers > ========================= > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > Test4: Random Writers > ===================== > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > Notes: > - Does not look like that anything has changed significantly. > > Previous versions of the patches were posted here. > ------------------------------------------------ > > (V1) http://lkml.org/lkml/2009/3/11/486 > (V2) http://lkml.org/lkml/2009/5/5/275 > (V3) http://lkml.org/lkml/2009/5/26/472 > (V4) http://lkml.org/lkml/2009/6/8/580 > (V5) http://lkml.org/lkml/2009/6/19/279 > (V6) http://lkml.org/lkml/2009/7/2/369 > (V7) http://lkml.org/lkml/2009/7/24/253 > (V8) http://lkml.org/lkml/2009/8/16/204 > (V9) http://lkml.org/lkml/2009/8/28/327 > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-24 21:33 ` Andrew Morton (?) @ 2009-09-25 1:09 ` KAMEZAWA Hiroyuki [not found] ` <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> ` (2 more replies) -1 siblings, 3 replies; 349+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-09-25 1:09 UTC (permalink / raw) To: Andrew Morton Cc: Vivek Goyal, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel On Thu, 24 Sep 2009 14:33:15 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > =================================================================== > > Fairness for async writes is tricky and biggest reason is that async writes > > are cached in higher layers (page cahe) as well as possibly in file system > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > in proportional manner. > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > be forced to write out some pages to disk before more pages can be dirtied. But > > not necessarily dirty pages of same thread are picked. It can very well pick > > the inode of lesser priority dd thread and do some writeout. So effectively > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > service differentation. > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > does not throw enought IO traffic at IO controller to keep the queue > > continuously backlogged. In my testing, there are many .2 to .8 second > > intervals where higher weight queue is empty and in that duration lower weight > > queue get lots of job done giving the impression that there was no service > > differentiation. > > > > In summary, from IO controller point of view async writes support is there. > > Because page cache has not been designed in such a manner that higher > > prio/weight writer can do more write out as compared to lower prio/weight > > writer, gettting service differentiation is hard and it is visible in some > > cases and not visible in some cases. > > Here's where it all falls to pieces. > > For async writeback we just don't care about IO priorities. Because > from the point of view of the userspace task, the write was async! It > occurred at memory bandwidth speed. > > It's only when the kernel's dirty memory thresholds start to get > exceeded that we start to care about prioritisation. And at that time, > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > consumes just as much memory as a low-ioprio dirty page. > > So when balance_dirty_pages() hits, what do we want to do? > > I suppose that all we can do is to block low-ioprio processes more > agressively at the VFS layer, to reduce the rate at which they're > dirtying memory so as to give high-ioprio processes more of the disk > bandwidth. > > But you've gone and implemented all of this stuff at the io-controller > level and not at the VFS level so you're, umm, screwed. > I think I must support dirty-ratio in memcg layer. But not yet. I can't easily imagine how the system will work if both dirty-ratio and io-controller cgroup are supported. But considering use them as a set of cgroup, called containers(zone?), it's will not be bad, I think. The final bottelneck queue for fairness in usual workload on usual (small) server will ext3's journal, I wonder ;) Thanks, -Kame > Importantly screwed! It's a very common workload pattern, and one > which causes tremendous amounts of IO to be generated very quickly, > traditionally causing bad latency effects all over the place. And we > have no answer to this. > > > Vanilla CFQ Vs IO Controller CFQ > > ================================ > > We have not fundamentally changed CFQ, instead enhanced it to also support > > hierarchical io scheduling. In the process invariably there are small changes > > here and there as new scenarios come up. Running some tests here and comparing > > both the CFQ's to see if there is any major deviation in behavior. > > > > Test1: Sequential Readers > > ========================= > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > > > IO scheduler: IO controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > > > Test2: Sequential Writers > > ========================= > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > > > Test3: Random Readers > > ========================= > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > > > Test4: Random Writers > > ===================== > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > > > Notes: > > - Does not look like that anything has changed significantly. > > > > Previous versions of the patches were posted here. > > ------------------------------------------------ > > > > (V1) http://lkml.org/lkml/2009/3/11/486 > > (V2) http://lkml.org/lkml/2009/5/5/275 > > (V3) http://lkml.org/lkml/2009/5/26/472 > > (V4) http://lkml.org/lkml/2009/6/8/580 > > (V5) http://lkml.org/lkml/2009/6/19/279 > > (V6) http://lkml.org/lkml/2009/7/2/369 > > (V7) http://lkml.org/lkml/2009/7/24/253 > > (V8) http://lkml.org/lkml/2009/8/16/204 > > (V9) http://lkml.org/lkml/2009/8/28/327 > > > > Thanks > > Vivek > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2009-09-25 1:18 ` KAMEZAWA Hiroyuki 2009-09-25 4:14 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-09-25 1:18 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, 25 Sep 2009 10:09:52 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote: > On Thu, 24 Sep 2009 14:33:15 -0700 > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > =================================================================== > > > Fairness for async writes is tricky and biggest reason is that async writes > > > are cached in higher layers (page cahe) as well as possibly in file system > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > in proportional manner. > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > service differentation. > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > does not throw enought IO traffic at IO controller to keep the queue > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > intervals where higher weight queue is empty and in that duration lower weight > > > queue get lots of job done giving the impression that there was no service > > > differentiation. > > > > > > In summary, from IO controller point of view async writes support is there. > > > Because page cache has not been designed in such a manner that higher > > > prio/weight writer can do more write out as compared to lower prio/weight > > > writer, gettting service differentiation is hard and it is visible in some > > > cases and not visible in some cases. > > > > Here's where it all falls to pieces. > > > > For async writeback we just don't care about IO priorities. Because > > from the point of view of the userspace task, the write was async! It > > occurred at memory bandwidth speed. > > > > It's only when the kernel's dirty memory thresholds start to get > > exceeded that we start to care about prioritisation. And at that time, > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > consumes just as much memory as a low-ioprio dirty page. > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > I suppose that all we can do is to block low-ioprio processes more > > agressively at the VFS layer, to reduce the rate at which they're > > dirtying memory so as to give high-ioprio processes more of the disk > > bandwidth. > > > > But you've gone and implemented all of this stuff at the io-controller > > level and not at the VFS level so you're, umm, screwed. > > > > I think I must support dirty-ratio in memcg layer. But not yet. OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. And add a control knob as bufferred_write.nr_dirty_thresh to limit the number of dirty pages generetad via a cgroup. Because memcg just records a owner of pages but not records who makes them dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio cgroup code. But I'm not sure how I should treat I/Os generated out by kswapd. Thanks, -Kame ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2009-09-25 1:18 ` KAMEZAWA Hiroyuki @ 2009-09-25 4:14 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 4:14 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Sep 25, 2009 at 10:09:52AM +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 24 Sep 2009 14:33:15 -0700 > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > =================================================================== > > > Fairness for async writes is tricky and biggest reason is that async writes > > > are cached in higher layers (page cahe) as well as possibly in file system > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > in proportional manner. > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > service differentation. > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > does not throw enought IO traffic at IO controller to keep the queue > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > intervals where higher weight queue is empty and in that duration lower weight > > > queue get lots of job done giving the impression that there was no service > > > differentiation. > > > > > > In summary, from IO controller point of view async writes support is there. > > > Because page cache has not been designed in such a manner that higher > > > prio/weight writer can do more write out as compared to lower prio/weight > > > writer, gettting service differentiation is hard and it is visible in some > > > cases and not visible in some cases. > > > > Here's where it all falls to pieces. > > > > For async writeback we just don't care about IO priorities. Because > > from the point of view of the userspace task, the write was async! It > > occurred at memory bandwidth speed. > > > > It's only when the kernel's dirty memory thresholds start to get > > exceeded that we start to care about prioritisation. And at that time, > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > consumes just as much memory as a low-ioprio dirty page. > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > I suppose that all we can do is to block low-ioprio processes more > > agressively at the VFS layer, to reduce the rate at which they're > > dirtying memory so as to give high-ioprio processes more of the disk > > bandwidth. > > > > But you've gone and implemented all of this stuff at the io-controller > > level and not at the VFS level so you're, umm, screwed. > > > > I think I must support dirty-ratio in memcg layer. But not yet. > I can't easily imagine how the system will work if both dirty-ratio and > io-controller cgroup are supported. IIUC, you are suggesting per memeory cgroup dirty ratio and writer will be throttled if dirty ratio is crossed. makes sense to me. Just that io controller and memory controller shall have to me mounted together. Thanks Vivek > But considering use them as a set of > cgroup, called containers(zone?), it's will not be bad, I think. > > The final bottelneck queue for fairness in usual workload on usual (small) > server will ext3's journal, I wonder ;) > > Thanks, > -Kame > > > > Importantly screwed! It's a very common workload pattern, and one > > which causes tremendous amounts of IO to be generated very quickly, > > traditionally causing bad latency effects all over the place. And we > > have no answer to this. > > > > > Vanilla CFQ Vs IO Controller CFQ > > > ================================ > > > We have not fundamentally changed CFQ, instead enhanced it to also support > > > hierarchical io scheduling. In the process invariably there are small changes > > > here and there as new scenarios come up. Running some tests here and comparing > > > both the CFQ's to see if there is any major deviation in behavior. > > > > > > Test1: Sequential Readers > > > ========================= > > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > > > > > IO scheduler: IO controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > > > > > Test2: Sequential Writers > > > ========================= > > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > > > > > Test3: Random Readers > > > ========================= > > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > > > > > Test4: Random Writers > > > ===================== > > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > > > > > Notes: > > > - Does not look like that anything has changed significantly. > > > > > > Previous versions of the patches were posted here. > > > ------------------------------------------------ > > > > > > (V1) http://lkml.org/lkml/2009/3/11/486 > > > (V2) http://lkml.org/lkml/2009/5/5/275 > > > (V3) http://lkml.org/lkml/2009/5/26/472 > > > (V4) http://lkml.org/lkml/2009/6/8/580 > > > (V5) http://lkml.org/lkml/2009/6/19/279 > > > (V6) http://lkml.org/lkml/2009/7/2/369 > > > (V7) http://lkml.org/lkml/2009/7/24/253 > > > (V8) http://lkml.org/lkml/2009/8/16/204 > > > (V9) http://lkml.org/lkml/2009/8/28/327 > > > > > > Thanks > > > Vivek > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 1:09 ` KAMEZAWA Hiroyuki @ 2009-09-25 1:18 ` KAMEZAWA Hiroyuki 2009-09-25 1:18 ` KAMEZAWA Hiroyuki 2009-09-25 4:14 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-09-25 1:18 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Vivek Goyal, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel On Fri, 25 Sep 2009 10:09:52 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Thu, 24 Sep 2009 14:33:15 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > =================================================================== > > > Fairness for async writes is tricky and biggest reason is that async writes > > > are cached in higher layers (page cahe) as well as possibly in file system > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > in proportional manner. > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > service differentation. > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > does not throw enought IO traffic at IO controller to keep the queue > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > intervals where higher weight queue is empty and in that duration lower weight > > > queue get lots of job done giving the impression that there was no service > > > differentiation. > > > > > > In summary, from IO controller point of view async writes support is there. > > > Because page cache has not been designed in such a manner that higher > > > prio/weight writer can do more write out as compared to lower prio/weight > > > writer, gettting service differentiation is hard and it is visible in some > > > cases and not visible in some cases. > > > > Here's where it all falls to pieces. > > > > For async writeback we just don't care about IO priorities. Because > > from the point of view of the userspace task, the write was async! It > > occurred at memory bandwidth speed. > > > > It's only when the kernel's dirty memory thresholds start to get > > exceeded that we start to care about prioritisation. And at that time, > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > consumes just as much memory as a low-ioprio dirty page. > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > I suppose that all we can do is to block low-ioprio processes more > > agressively at the VFS layer, to reduce the rate at which they're > > dirtying memory so as to give high-ioprio processes more of the disk > > bandwidth. > > > > But you've gone and implemented all of this stuff at the io-controller > > level and not at the VFS level so you're, umm, screwed. > > > > I think I must support dirty-ratio in memcg layer. But not yet. OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. And add a control knob as bufferred_write.nr_dirty_thresh to limit the number of dirty pages generetad via a cgroup. Because memcg just records a owner of pages but not records who makes them dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio cgroup code. But I'm not sure how I should treat I/Os generated out by kswapd. Thanks, -Kame ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-25 1:18 ` KAMEZAWA Hiroyuki 0 siblings, 0 replies; 349+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-09-25 1:18 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, Andrew Morton, containers, linux-kernel, s-uchida, righi.andrea, torvalds On Fri, 25 Sep 2009 10:09:52 +0900 KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > On Thu, 24 Sep 2009 14:33:15 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > =================================================================== > > > Fairness for async writes is tricky and biggest reason is that async writes > > > are cached in higher layers (page cahe) as well as possibly in file system > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > in proportional manner. > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > service differentation. > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > does not throw enought IO traffic at IO controller to keep the queue > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > intervals where higher weight queue is empty and in that duration lower weight > > > queue get lots of job done giving the impression that there was no service > > > differentiation. > > > > > > In summary, from IO controller point of view async writes support is there. > > > Because page cache has not been designed in such a manner that higher > > > prio/weight writer can do more write out as compared to lower prio/weight > > > writer, gettting service differentiation is hard and it is visible in some > > > cases and not visible in some cases. > > > > Here's where it all falls to pieces. > > > > For async writeback we just don't care about IO priorities. Because > > from the point of view of the userspace task, the write was async! It > > occurred at memory bandwidth speed. > > > > It's only when the kernel's dirty memory thresholds start to get > > exceeded that we start to care about prioritisation. And at that time, > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > consumes just as much memory as a low-ioprio dirty page. > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > I suppose that all we can do is to block low-ioprio processes more > > agressively at the VFS layer, to reduce the rate at which they're > > dirtying memory so as to give high-ioprio processes more of the disk > > bandwidth. > > > > But you've gone and implemented all of this stuff at the io-controller > > level and not at the VFS level so you're, umm, screwed. > > > > I think I must support dirty-ratio in memcg layer. But not yet. OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. And add a control knob as bufferred_write.nr_dirty_thresh to limit the number of dirty pages generetad via a cgroup. Because memcg just records a owner of pages but not records who makes them dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio cgroup code. But I'm not sure how I should treat I/Os generated out by kswapd. Thanks, -Kame ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 1:18 ` KAMEZAWA Hiroyuki (?) @ 2009-09-25 5:29 ` Balbir Singh 2009-09-25 7:09 ` Ryo Tsuruta [not found] ` <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Balbir Singh @ 2009-09-25 5:29 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, Vivek Goyal, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-09-25 10:18:21]: > On Fri, 25 Sep 2009 10:09:52 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > > > On Thu, 24 Sep 2009 14:33:15 -0700 > > Andrew Morton <akpm@linux-foundation.org> wrote: > > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > > =================================================================== > > > > Fairness for async writes is tricky and biggest reason is that async writes > > > > are cached in higher layers (page cahe) as well as possibly in file system > > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > > in proportional manner. > > > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > > service differentation. > > > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > > does not throw enought IO traffic at IO controller to keep the queue > > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > > intervals where higher weight queue is empty and in that duration lower weight > > > > queue get lots of job done giving the impression that there was no service > > > > differentiation. > > > > > > > > In summary, from IO controller point of view async writes support is there. > > > > Because page cache has not been designed in such a manner that higher > > > > prio/weight writer can do more write out as compared to lower prio/weight > > > > writer, gettting service differentiation is hard and it is visible in some > > > > cases and not visible in some cases. > > > > > > Here's where it all falls to pieces. > > > > > > For async writeback we just don't care about IO priorities. Because > > > from the point of view of the userspace task, the write was async! It > > > occurred at memory bandwidth speed. > > > > > > It's only when the kernel's dirty memory thresholds start to get > > > exceeded that we start to care about prioritisation. And at that time, > > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > > consumes just as much memory as a low-ioprio dirty page. > > > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > > > I suppose that all we can do is to block low-ioprio processes more > > > agressively at the VFS layer, to reduce the rate at which they're > > > dirtying memory so as to give high-ioprio processes more of the disk > > > bandwidth. > > > > > > But you've gone and implemented all of this stuff at the io-controller > > > level and not at the VFS level so you're, umm, screwed. > > > > > > > I think I must support dirty-ratio in memcg layer. But not yet. > We need to add this to the TODO list. > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. > And add a control knob as > bufferred_write.nr_dirty_thresh > to limit the number of dirty pages generetad via a cgroup. > > Because memcg just records a owner of pages but not records who makes them > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio > cgroup code. Very good point, this is crucial for shared pages. > > But I'm not sure how I should treat I/Os generated out by kswapd. > Account them to process 0 :) -- Balbir ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 5:29 ` Balbir Singh @ 2009-09-25 7:09 ` Ryo Tsuruta [not found] ` <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-25 7:09 UTC (permalink / raw) To: balbir Cc: kamezawa.hiroyu, akpm, vgoyal, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel Hi, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > I think I must support dirty-ratio in memcg layer. But not yet. > > > > We need to add this to the TODO list. > > > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. > > And add a control knob as > > bufferred_write.nr_dirty_thresh > > to limit the number of dirty pages generetad via a cgroup. > > > > Because memcg just records a owner of pages but not records who makes them > > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio > > cgroup code. > > Very good point, this is crucial for shared pages. > > > > > But I'm not sure how I should treat I/Os generated out by kswapd. > > > > Account them to process 0 :) How about accounting them to processes who make pages dirty? I think that a process which consumes more memory should get penalty. However, this allows a page request process to use other's bandwidth, but If a user doesn't want to swap-out the memory, the user should allocate enough memory for the process by using memcg in advance. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-25 7:09 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-25 7:09 UTC (permalink / raw) To: balbir Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, vgoyal, m-ikeda, riel, lizf, fchecconi, akpm, kamezawa.hiroyu, containers, linux-kernel, s-uchida, righi.andrea, torvalds Hi, Balbir Singh <balbir@linux.vnet.ibm.com> wrote: > > > I think I must support dirty-ratio in memcg layer. But not yet. > > > > We need to add this to the TODO list. > > > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. > > And add a control knob as > > bufferred_write.nr_dirty_thresh > > to limit the number of dirty pages generetad via a cgroup. > > > > Because memcg just records a owner of pages but not records who makes them > > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio > > cgroup code. > > Very good point, this is crucial for shared pages. > > > > > But I'm not sure how I should treat I/Os generated out by kswapd. > > > > Account them to process 0 :) How about accounting them to processes who make pages dirty? I think that a process which consumes more memory should get penalty. However, this allows a page request process to use other's bandwidth, but If a user doesn't want to swap-out the memory, the user should allocate enough memory for the process by using memcg in advance. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org> @ 2009-09-25 7:09 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-25 7:09 UTC (permalink / raw) To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8 Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi, Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote: > > > I think I must support dirty-ratio in memcg layer. But not yet. > > > > We need to add this to the TODO list. > > > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. > > And add a control knob as > > bufferred_write.nr_dirty_thresh > > to limit the number of dirty pages generetad via a cgroup. > > > > Because memcg just records a owner of pages but not records who makes them > > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio > > cgroup code. > > Very good point, this is crucial for shared pages. > > > > > But I'm not sure how I should treat I/Os generated out by kswapd. > > > > Account them to process 0 :) How about accounting them to processes who make pages dirty? I think that a process which consumes more memory should get penalty. However, this allows a page request process to use other's bandwidth, but If a user doesn't want to swap-out the memory, the user should allocate enough memory for the process by using memcg in advance. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090925101821.1de8091a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090925101821.1de8091a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2009-09-25 5:29 ` Balbir Singh 0 siblings, 0 replies; 349+ messages in thread From: Balbir Singh @ 2009-09-25 5:29 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Andrew Morton, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b * KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> [2009-09-25 10:18:21]: > On Fri, 25 Sep 2009 10:09:52 +0900 > KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote: > > > On Thu, 24 Sep 2009 14:33:15 -0700 > > Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > > =================================================================== > > > > Fairness for async writes is tricky and biggest reason is that async writes > > > > are cached in higher layers (page cahe) as well as possibly in file system > > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > > in proportional manner. > > > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > > service differentation. > > > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > > does not throw enought IO traffic at IO controller to keep the queue > > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > > intervals where higher weight queue is empty and in that duration lower weight > > > > queue get lots of job done giving the impression that there was no service > > > > differentiation. > > > > > > > > In summary, from IO controller point of view async writes support is there. > > > > Because page cache has not been designed in such a manner that higher > > > > prio/weight writer can do more write out as compared to lower prio/weight > > > > writer, gettting service differentiation is hard and it is visible in some > > > > cases and not visible in some cases. > > > > > > Here's where it all falls to pieces. > > > > > > For async writeback we just don't care about IO priorities. Because > > > from the point of view of the userspace task, the write was async! It > > > occurred at memory bandwidth speed. > > > > > > It's only when the kernel's dirty memory thresholds start to get > > > exceeded that we start to care about prioritisation. And at that time, > > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > > consumes just as much memory as a low-ioprio dirty page. > > > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > > > I suppose that all we can do is to block low-ioprio processes more > > > agressively at the VFS layer, to reduce the rate at which they're > > > dirtying memory so as to give high-ioprio processes more of the disk > > > bandwidth. > > > > > > But you've gone and implemented all of this stuff at the io-controller > > > level and not at the VFS level so you're, umm, screwed. > > > > > > > I think I must support dirty-ratio in memcg layer. But not yet. > We need to add this to the TODO list. > OR...I'll add a bufferred-write-cgroup to track bufferred writebacks. > And add a control knob as > bufferred_write.nr_dirty_thresh > to limit the number of dirty pages generetad via a cgroup. > > Because memcg just records a owner of pages but not records who makes them > dirty, this may be better. Maybe I can reuse page_cgroup and Ryo's blockio > cgroup code. Very good point, this is crucial for shared pages. > > But I'm not sure how I should treat I/Os generated out by kswapd. > Account them to process 0 :) -- Balbir ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 1:09 ` KAMEZAWA Hiroyuki @ 2009-09-25 4:14 ` Vivek Goyal 2009-09-25 1:18 ` KAMEZAWA Hiroyuki 2009-09-25 4:14 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 4:14 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: Andrew Morton, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel On Fri, Sep 25, 2009 at 10:09:52AM +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 24 Sep 2009 14:33:15 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > =================================================================== > > > Fairness for async writes is tricky and biggest reason is that async writes > > > are cached in higher layers (page cahe) as well as possibly in file system > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > in proportional manner. > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > service differentation. > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > does not throw enought IO traffic at IO controller to keep the queue > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > intervals where higher weight queue is empty and in that duration lower weight > > > queue get lots of job done giving the impression that there was no service > > > differentiation. > > > > > > In summary, from IO controller point of view async writes support is there. > > > Because page cache has not been designed in such a manner that higher > > > prio/weight writer can do more write out as compared to lower prio/weight > > > writer, gettting service differentiation is hard and it is visible in some > > > cases and not visible in some cases. > > > > Here's where it all falls to pieces. > > > > For async writeback we just don't care about IO priorities. Because > > from the point of view of the userspace task, the write was async! It > > occurred at memory bandwidth speed. > > > > It's only when the kernel's dirty memory thresholds start to get > > exceeded that we start to care about prioritisation. And at that time, > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > consumes just as much memory as a low-ioprio dirty page. > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > I suppose that all we can do is to block low-ioprio processes more > > agressively at the VFS layer, to reduce the rate at which they're > > dirtying memory so as to give high-ioprio processes more of the disk > > bandwidth. > > > > But you've gone and implemented all of this stuff at the io-controller > > level and not at the VFS level so you're, umm, screwed. > > > > I think I must support dirty-ratio in memcg layer. But not yet. > I can't easily imagine how the system will work if both dirty-ratio and > io-controller cgroup are supported. IIUC, you are suggesting per memeory cgroup dirty ratio and writer will be throttled if dirty ratio is crossed. makes sense to me. Just that io controller and memory controller shall have to me mounted together. Thanks Vivek > But considering use them as a set of > cgroup, called containers(zone?), it's will not be bad, I think. > > The final bottelneck queue for fairness in usual workload on usual (small) > server will ext3's journal, I wonder ;) > > Thanks, > -Kame > > > > Importantly screwed! It's a very common workload pattern, and one > > which causes tremendous amounts of IO to be generated very quickly, > > traditionally causing bad latency effects all over the place. And we > > have no answer to this. > > > > > Vanilla CFQ Vs IO Controller CFQ > > > ================================ > > > We have not fundamentally changed CFQ, instead enhanced it to also support > > > hierarchical io scheduling. In the process invariably there are small changes > > > here and there as new scenarios come up. Running some tests here and comparing > > > both the CFQ's to see if there is any major deviation in behavior. > > > > > > Test1: Sequential Readers > > > ========================= > > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > > > > > IO scheduler: IO controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > > > > > Test2: Sequential Writers > > > ========================= > > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > > > > > Test3: Random Readers > > > ========================= > > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > > > > > Test4: Random Writers > > > ===================== > > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > > > > > Notes: > > > - Does not look like that anything has changed significantly. > > > > > > Previous versions of the patches were posted here. > > > ------------------------------------------------ > > > > > > (V1) http://lkml.org/lkml/2009/3/11/486 > > > (V2) http://lkml.org/lkml/2009/5/5/275 > > > (V3) http://lkml.org/lkml/2009/5/26/472 > > > (V4) http://lkml.org/lkml/2009/6/8/580 > > > (V5) http://lkml.org/lkml/2009/6/19/279 > > > (V6) http://lkml.org/lkml/2009/7/2/369 > > > (V7) http://lkml.org/lkml/2009/7/24/253 > > > (V8) http://lkml.org/lkml/2009/8/16/204 > > > (V9) http://lkml.org/lkml/2009/8/28/327 > > > > > > Thanks > > > Vivek > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-25 4:14 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 4:14 UTC (permalink / raw) To: KAMEZAWA Hiroyuki Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, Andrew Morton, containers, linux-kernel, s-uchida, righi.andrea, torvalds On Fri, Sep 25, 2009 at 10:09:52AM +0900, KAMEZAWA Hiroyuki wrote: > On Thu, 24 Sep 2009 14:33:15 -0700 > Andrew Morton <akpm@linux-foundation.org> wrote: > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > > =================================================================== > > > Fairness for async writes is tricky and biggest reason is that async writes > > > are cached in higher layers (page cahe) as well as possibly in file system > > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > > in proportional manner. > > > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > > be forced to write out some pages to disk before more pages can be dirtied. But > > > not necessarily dirty pages of same thread are picked. It can very well pick > > > the inode of lesser priority dd thread and do some writeout. So effectively > > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > > service differentation. > > > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > > does not throw enought IO traffic at IO controller to keep the queue > > > continuously backlogged. In my testing, there are many .2 to .8 second > > > intervals where higher weight queue is empty and in that duration lower weight > > > queue get lots of job done giving the impression that there was no service > > > differentiation. > > > > > > In summary, from IO controller point of view async writes support is there. > > > Because page cache has not been designed in such a manner that higher > > > prio/weight writer can do more write out as compared to lower prio/weight > > > writer, gettting service differentiation is hard and it is visible in some > > > cases and not visible in some cases. > > > > Here's where it all falls to pieces. > > > > For async writeback we just don't care about IO priorities. Because > > from the point of view of the userspace task, the write was async! It > > occurred at memory bandwidth speed. > > > > It's only when the kernel's dirty memory thresholds start to get > > exceeded that we start to care about prioritisation. And at that time, > > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > > consumes just as much memory as a low-ioprio dirty page. > > > > So when balance_dirty_pages() hits, what do we want to do? > > > > I suppose that all we can do is to block low-ioprio processes more > > agressively at the VFS layer, to reduce the rate at which they're > > dirtying memory so as to give high-ioprio processes more of the disk > > bandwidth. > > > > But you've gone and implemented all of this stuff at the io-controller > > level and not at the VFS level so you're, umm, screwed. > > > > I think I must support dirty-ratio in memcg layer. But not yet. > I can't easily imagine how the system will work if both dirty-ratio and > io-controller cgroup are supported. IIUC, you are suggesting per memeory cgroup dirty ratio and writer will be throttled if dirty ratio is crossed. makes sense to me. Just that io controller and memory controller shall have to me mounted together. Thanks Vivek > But considering use them as a set of > cgroup, called containers(zone?), it's will not be bad, I think. > > The final bottelneck queue for fairness in usual workload on usual (small) > server will ext3's journal, I wonder ;) > > Thanks, > -Kame > > > > Importantly screwed! It's a very common workload pattern, and one > > which causes tremendous amounts of IO to be generated very quickly, > > traditionally causing bad latency effects all over the place. And we > > have no answer to this. > > > > > Vanilla CFQ Vs IO Controller CFQ > > > ================================ > > > We have not fundamentally changed CFQ, instead enhanced it to also support > > > hierarchical io scheduling. In the process invariably there are small changes > > > here and there as new scenarios come up. Running some tests here and comparing > > > both the CFQ's to see if there is any major deviation in behavior. > > > > > > Test1: Sequential Readers > > > ========================= > > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > > > > > IO scheduler: IO controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > > > > > Test2: Sequential Writers > > > ========================= > > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > > > > > Test3: Random Readers > > > ========================= > > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > > > > > Test4: Random Writers > > > ===================== > > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > > > IO scheduler: Vanilla CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > > > > > IO scheduler: IO Controller CFQ > > > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > > > > > Notes: > > > - Does not look like that anything has changed significantly. > > > > > > Previous versions of the patches were posted here. > > > ------------------------------------------------ > > > > > > (V1) http://lkml.org/lkml/2009/3/11/486 > > > (V2) http://lkml.org/lkml/2009/5/5/275 > > > (V3) http://lkml.org/lkml/2009/5/26/472 > > > (V4) http://lkml.org/lkml/2009/6/8/580 > > > (V5) http://lkml.org/lkml/2009/6/19/279 > > > (V6) http://lkml.org/lkml/2009/7/2/369 > > > (V7) http://lkml.org/lkml/2009/7/24/253 > > > (V8) http://lkml.org/lkml/2009/8/16/204 > > > (V9) http://lkml.org/lkml/2009/8/28/327 > > > > > > Thanks > > > Vivek > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090924143315.781cd0ac.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090924143315.781cd0ac.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> @ 2009-09-25 1:09 ` KAMEZAWA Hiroyuki 2009-09-25 5:04 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: KAMEZAWA Hiroyuki @ 2009-09-25 1:09 UTC (permalink / raw) To: Andrew Morton Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, 24 Sep 2009 14:33:15 -0700 Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > =================================================================== > > Fairness for async writes is tricky and biggest reason is that async writes > > are cached in higher layers (page cahe) as well as possibly in file system > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > in proportional manner. > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > be forced to write out some pages to disk before more pages can be dirtied. But > > not necessarily dirty pages of same thread are picked. It can very well pick > > the inode of lesser priority dd thread and do some writeout. So effectively > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > service differentation. > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > does not throw enought IO traffic at IO controller to keep the queue > > continuously backlogged. In my testing, there are many .2 to .8 second > > intervals where higher weight queue is empty and in that duration lower weight > > queue get lots of job done giving the impression that there was no service > > differentiation. > > > > In summary, from IO controller point of view async writes support is there. > > Because page cache has not been designed in such a manner that higher > > prio/weight writer can do more write out as compared to lower prio/weight > > writer, gettting service differentiation is hard and it is visible in some > > cases and not visible in some cases. > > Here's where it all falls to pieces. > > For async writeback we just don't care about IO priorities. Because > from the point of view of the userspace task, the write was async! It > occurred at memory bandwidth speed. > > It's only when the kernel's dirty memory thresholds start to get > exceeded that we start to care about prioritisation. And at that time, > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > consumes just as much memory as a low-ioprio dirty page. > > So when balance_dirty_pages() hits, what do we want to do? > > I suppose that all we can do is to block low-ioprio processes more > agressively at the VFS layer, to reduce the rate at which they're > dirtying memory so as to give high-ioprio processes more of the disk > bandwidth. > > But you've gone and implemented all of this stuff at the io-controller > level and not at the VFS level so you're, umm, screwed. > I think I must support dirty-ratio in memcg layer. But not yet. I can't easily imagine how the system will work if both dirty-ratio and io-controller cgroup are supported. But considering use them as a set of cgroup, called containers(zone?), it's will not be bad, I think. The final bottelneck queue for fairness in usual workload on usual (small) server will ext3's journal, I wonder ;) Thanks, -Kame > Importantly screwed! It's a very common workload pattern, and one > which causes tremendous amounts of IO to be generated very quickly, > traditionally causing bad latency effects all over the place. And we > have no answer to this. > > > Vanilla CFQ Vs IO Controller CFQ > > ================================ > > We have not fundamentally changed CFQ, instead enhanced it to also support > > hierarchical io scheduling. In the process invariably there are small changes > > here and there as new scenarios come up. Running some tests here and comparing > > both the CFQ's to see if there is any major deviation in behavior. > > > > Test1: Sequential Readers > > ========================= > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > > > IO scheduler: IO controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > > > Test2: Sequential Writers > > ========================= > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > > > Test3: Random Readers > > ========================= > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > > > Test4: Random Writers > > ===================== > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > > > Notes: > > - Does not look like that anything has changed significantly. > > > > Previous versions of the patches were posted here. > > ------------------------------------------------ > > > > (V1) http://lkml.org/lkml/2009/3/11/486 > > (V2) http://lkml.org/lkml/2009/5/5/275 > > (V3) http://lkml.org/lkml/2009/5/26/472 > > (V4) http://lkml.org/lkml/2009/6/8/580 > > (V5) http://lkml.org/lkml/2009/6/19/279 > > (V6) http://lkml.org/lkml/2009/7/2/369 > > (V7) http://lkml.org/lkml/2009/7/24/253 > > (V8) http://lkml.org/lkml/2009/8/16/204 > > (V9) http://lkml.org/lkml/2009/8/28/327 > > > > Thanks > > Vivek > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20090924143315.781cd0ac.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2009-09-25 1:09 ` KAMEZAWA Hiroyuki @ 2009-09-25 5:04 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 5:04 UTC (permalink / raw) To: Andrew Morton Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote: > On Thu, 24 Sep 2009 15:25:04 -0400 > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > > Hi All, > > > > Here is the V10 of the IO controller patches generated on top of 2.6.31. > > > > Thanks for the writeup. It really helps and is most worthwhile for a > project of this importance, size and complexity. > > > > > > What problem are we trying to solve > > =================================== > > Provide group IO scheduling feature in Linux along the lines of other resource > > controllers like cpu. > > > > IOW, provide facility so that a user can group applications using cgroups and > > control the amount of disk time/bandwidth received by a group based on its > > weight. > > > > How to solve the problem > > ========================= > > > > Different people have solved the issue differetnly. So far looks it looks > > like we seem to have following two core requirements when it comes to > > fairness at group level. > > > > - Control bandwidth seen by groups. > > - Control on latencies when a request gets backlogged in group. > > > > At least there are now three patchsets available (including this one). > > > > IO throttling > > ------------- > > This is a bandwidth controller which keeps track of IO rate of a group and > > throttles the process in the group if it exceeds the user specified limit. > > > > dm-ioband > > --------- > > This is a proportional bandwidth controller implemented as device mapper > > driver and provides fair access in terms of amount of IO done (not in terms > > of disk time as CFQ does). > > > > So one will setup one or more dm-ioband devices on top of physical/logical > > block device, configure the ioband device and pass information like grouping > > etc. Now this device will keep track of bios flowing through it and control > > the flow of bios based on group policies. > > > > IO scheduler based IO controller > > -------------------------------- > > Here we have viewed the problem of IO contoller as hierarchical group > > scheduling (along the lines of CFS group scheduling) issue. Currently one can > > view linux IO schedulers as flat where there is one root group and all the IO > > belongs to that group. > > > > This patchset basically modifies IO schedulers to also support hierarchical > > group scheduling. CFQ already provides fairness among different processes. I > > have extended it support group IO schduling. Also took some of the code out > > of CFQ and put in a common layer so that same group scheduling code can be > > used by noop, deadline and AS to support group scheduling. > > > > Pros/Cons > > ========= > > There are pros and cons to each of the approach. Following are some of the > > thoughts. > > > > Max bandwidth vs proportional bandwidth > > --------------------------------------- > > IO throttling is a max bandwidth controller and not a proportional one. > > Additionaly it provides fairness in terms of amount of IO done (and not in > > terms of disk time as CFQ does). > > > > Personally, I think that proportional weight controller is useful to more > > people than just max bandwidth controller. In addition, IO scheduler based > > controller can also be enhanced to do max bandwidth control. So it can > > satisfy wider set of requirements. > > > > Fairness in terms of disk time vs size of IO > > --------------------------------------------- > > An higher level controller will most likely be limited to providing fairness > > in terms of size/number of IO done and will find it hard to provide fairness > > in terms of disk time used (as CFQ provides between various prio levels). This > > is because only IO scheduler knows how much disk time a queue has used and > > information about queues and disk time used is not exported to higher > > layers. > > > > So a seeky application will still run away with lot of disk time and bring > > down the overall throughput of the the disk. > > But that's only true if the thing is poorly implemented. > > A high-level controller will need some view of the busyness of the > underlying device(s). That could be "proportion of idle time", or > "average length of queue" or "average request latency" or some mix of > these or something else altogether. > > But these things are simple to calculate, and are simple to feed back > to the higher-level controller and probably don't require any changes > to to IO scheduler at all, which is a great advantage. > > > And I must say that high-level throttling based upon feedback from > lower layers seems like a much better model to me than hacking away in > the IO scheduler layer. Both from an implementation point of view and > from a "we can get it to work on things other than block devices" point > of view. > Hi Andrew, Few thoughts. - A higher level throttling approach suffers from the issue of unfair throttling. So if there are multiple tasks in the group, who do we throttle and how do we make sure that we did throttling in proportion to the prio of tasks. Andrea's IO throttling implementation suffered from these issues. I had run some tests where RT and BW tasks were getting same BW with-in group or tasks of different prio were gettting same BW. Even if we figure a way out to do fair throttling with-in group, underlying IO scheduler might not be CFQ at all and we should not have done so. https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html - Higher level throttling does not know where actually IO is going in physical layer. So we might unnecessarily be throttling IO which are going to same logical device but at the end of day to different physical devices. Agreed that some people will want that behavior, especially in the case of max bandwidth control where one does not want to give you the BW because you did not pay for it. So higher level controller is good for max bw control but if it comes to optimal usage of resources and do control only if needed, then it probably is not the best thing. About the feedback thing, I am not very sure. Are you saying that we will run timed groups in higher layer and take feedback from underlying IO scheduler about how much time a group consumed or something like that and not do accounting in terms of size of IO? > > Currently dm-ioband provides fairness in terms of number/size of IO. > > > > Latencies and isolation between groups > > -------------------------------------- > > An higher level controller is generally implementing a bandwidth throttling > > solution where if a group exceeds either the max bandwidth or the proportional > > share then throttle that group. > > > > This kind of approach will probably not help in controlling latencies as it > > will depend on underlying IO scheduler. Consider following scenario. > > > > Assume there are two groups. One group is running multiple sequential readers > > and other group has a random reader. sequential readers will get a nice 100ms > > slice > > Do you refer to each reader within group1, or to all readers? It would be > daft if each reader in group1 were to get 100ms. > All readers in the group should get 100ms each, both in IO throttling and dm-ioband solution. Higher level solutions are not keeping track of time slices. Time slices will be allocated by CFQ which does not have any idea about grouping. Higher level controller just keeps track of size of IO done at group level and then run either a leaky bucket or token bucket algorithm. IO throttling is a max BW controller, so it will not even care about what is happening in other group. It will just be concerned with rate of IO in one particular group and if we exceed specified limit, throttle it. So until and unless sequential reader group hits it max bw limit, it will keep sending reads down to CFQ, and CFQ will happily assign 100ms slices to readers. dm-ioband will not try to choke the high throughput sequential reader group for the slow random reader group because that would just kill the throughput of rotational media. Every sequential reader will run for few ms and then be throttled and this goes on. Disk will soon be seek bound. > > each and then a random reader from group2 will get to dispatch the > > request. So latency of this random reader will depend on how many sequential > > readers are running in other group and that is a weak isolation between groups. > > And yet that is what you appear to mean. > > But surely nobody would do that - the 100ms would be assigned to and > distributed amongst all readers in group1? Dividing 100ms to all the sequential readers might not be very good on rotational media as each reader runs for small time and then seek happens. This will increase number of seeks in the system. Think of 32 sequential readers in the group and then each getting less than 3ms to run. A better way probably is to give each queue 100ms in one run of group and then switch group. Someting like following. SR1 RR SR2 RR SR3 RR SR4 RR... Now each sequential reader gets 100ms and disk is not seek bound at the same time random reader latency limited by number of competing groups and not by number of processes in the group. This is what IO scheduler based IO controller is effectively doing currently. > > > When we control things at IO scheduler level, we assign one time slice to one > > group and then pick next entity to run. So effectively after one time slice > > (max 180ms, if prio 0 sequential reader is running), random reader in other > > group will get to run. Hence we achieve better isolation between groups as > > response time of process in a differnt group is generally not dependent on > > number of processes running in competing group. > > I don't understand why you're comparing this implementation with such > an obviously dumb competing design! > > > So a higher level solution is most likely limited to only shaping bandwidth > > without any control on latencies. > > > > Stacking group scheduler on top of CFQ can lead to issues > > --------------------------------------------------------- > > IO throttling and dm-ioband both are second level controller. That is these > > controllers are implemented in higher layers than io schedulers. So they > > control the IO at higher layer based on group policies and later IO > > schedulers take care of dispatching these bios to disk. > > > > Implementing a second level controller has the advantage of being able to > > provide bandwidth control even on logical block devices in the IO stack > > which don't have any IO schedulers attached to these. But they can also > > interefere with IO scheduling policy of underlying IO scheduler and change > > the effective behavior. Following are some of the issues which I think > > should be visible in second level controller in one form or other. > > > > Prio with-in group > > ------------------ > > A second level controller can potentially interefere with behavior of > > different prio processes with-in a group. bios are buffered at higher layer > > in single queue and release of bios is FIFO and not proportionate to the > > ioprio of the process. This can result in a particular prio level not > > getting fair share. > > That's an administrator error, isn't it? Should have put the > different-priority processes into different groups. > I am thinking in practice it probably will be a mix of priority in each group. For example, consider a hypothetical scenario where two students on a university server are given two cgroups of certain weights so that IO done by these students are limited in case of contention. Now these students might want to throw in a mix of priority workload in their respective cgroup. Admin would not have any idea what priority process students are running in respective cgroup. > > Buffering at higher layer can delay read requests for more than slice idle > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > for a request from the queue but it is buffered at higher layer and then idle > > timer will fire. It means that queue will losse its share at the same time > > overall throughput will be impacted as we lost those 8 ms. > > That sounds like a bug. > Actually this probably is a limitation of higher level controller. It most likely is sitting so high in IO stack that it has no idea what underlying IO scheduler is and what are IO scheduler's policies. So it can't keep up with IO scheduler's policies. Secondly, it might be a low weight group and tokens might not be available fast enough to release the request. > > Read Vs Write > > ------------- > > Writes can overwhelm readers hence second level controller FIFO release > > will run into issue here. If there is a single queue maintained then reads > > will suffer large latencies. If there separate queues for reads and writes > > then it will be hard to decide in what ratio to dispatch reads and writes as > > it is IO scheduler's decision to decide when and how much read/write to > > dispatch. This is another place where higher level controller will not be in > > sync with lower level io scheduler and can change the effective policies of > > underlying io scheduler. > > The IO schedulers already take care of read-vs-write and already take > care of preventing large writes-starve-reads latencies (or at least, > they're supposed to). True. Actually this is a limitation of higher level controller. A higher level controller will most likely implement some of kind of queuing/buffering mechanism where it will buffer requeuests when it decides to throttle the group. Now once a fair number read and requests are buffered, and if controller is ready to dispatch some requests from the group, which requests/bio should it dispatch? reads first or writes first or reads and writes in certain ratio? In what ratio reads and writes are dispatched is the property and decision of IO scheduler. Now higher level controller will be taking this decision and change the behavior of underlying io scheduler. > > > CFQ IO context Issues > > --------------------- > > Buffering at higher layer means submission of bios later with the help of > > a worker thread. > > Why? > > If it's a read, we just block the userspace process. > > If it's a delayed write, the IO submission already happens in a kernel thread. Is it ok to block pdflush on group. Some low weight group might block it for long time and hence not allow flushing out other pages. Probably that's the reason pdflush used to check if underlying device is congested or not and if it is congested, we don't go ahead with submission of request. With per bdi flusher thread things will change. I think btrfs also has some threds which don't want to block and if underlying deivce is congested, it bails out. That's the reason I implemented per group congestion interface where if a thread does not want to block, it can check whether the group IO is going in is congested or not and will it block. So for such threads, probably higher level controller shall have to implement per group congestion interface so that threads which don't want to block can check with the controller whether it has sufficient BW to let it through and not block or may be start buffering writes in group queue. > > If it's a synchronous write, we have to block the userspace caller > anyway. > > Async reads might be an issue, dunno. > I think async IO is one of the reason. IIRC, Andrea Righi, implemented the policy of returning error for async IO if group did not have sufficient tokens to dispatch the async IO and expected the application to retry later. I am not sure if that is ok. So yes, if we are not buffering any of the read requests and either blocking the caller or returning an error (async IO) than CFQ io context is not an issue. > > This changes the io context information at CFQ layer which > > assigns the request to submitting thread. Change of io context info again > > leads to issues of idle timer expiry and issue of a process not getting fair > > share and reduced throughput. > > But we already have that problem with delayed writeback, which is a > huge thing - often it's the majority of IO. > For delayed writes CFQ will not anticipate so increased anticipation timer expiry is not an issue with writes. But it probably will be an issue with reads where if higher level controller decides to block next read and CFQ is anticipating on that read. I am wondering that such kind of issues must appear with all the higher level device mapper/software raid devices also. How do they handle it. May be it is more theoritical and in practice impact is not significant. > > Throughput with noop, deadline and AS > > --------------------------------------------- > > I think an higher level controller will result in reduced overall throughput > > (as compared to io scheduler based io controller) and more seeks with noop, > > deadline and AS. > > > > The reason being, that it is likely that IO with-in a group will be related > > and will be relatively close as compared to IO across the groups. For example, > > thread pool of kvm-qemu doing IO for virtual machine. In case of higher level > > control, IO from various groups will go into a single queue at lower level > > controller and it might happen that IO is now interleaved (G1, G2, G1, G3, > > G4....) causing more seeks and reduced throughput. (Agreed that merging will > > help up to some extent but still....). > > > > Instead, in case of lower level controller, IO scheduler maintains one queue > > per group hence there is no interleaving of IO between groups. And if IO is > > related with-in group, then we shoud get reduced number/amount of seek and > > higher throughput. > > > > Latency can be a concern but that can be controlled by reducing the time > > slice length of the queue. > > Well maybe, maybe not. If a group is throttled, it isn't submitting > new IO. The unthrottled group is doing the IO submitting and that IO > will have decent locality. But throttling will kick in ocassionaly. Rest of the time both the groups will be dispatching bios at the same time. So for most part of it IO scheduler will probably see IO from both the groups and there will be small intervals where one group is completely throttled and IO scheduler is busy dispatching requests only from a single group. > > > Fairness at logical device level vs at physical device level > > ------------------------------------------------------------ > > > > IO scheduler based controller has the limitation that it works only with the > > bottom most devices in the IO stack where IO scheduler is attached. > > > > For example, assume a user has created a logical device lv0 using three > > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 > > in two groups doing IO on lv0. Also assume that weights of groups are in the > > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. > > > > T1 T2 > > \ / > > lv0 > > / | \ > > sda sdb sdc > > > > > > Now resource control will take place only on devices sda, sdb and sdc and > > not at lv0 level. So if IO from two tasks is relatively uniformly > > distributed across the disks then T1 and T2 will see the throughput ratio > > in proportion to weight specified. But if IO from T1 and T2 is going to > > different disks and there is no contention then at higher level they both > > will see same BW. > > > > Here a second level controller can produce better fairness numbers at > > logical device but most likely at redued overall throughput of the system, > > because it will try to control IO even if there is no contention at phsical > > possibly leaving diksks unused in the system. > > > > Hence, question comes that how important it is to control bandwidth at > > higher level logical devices also. The actual contention for resources is > > at the leaf block device so it probably makes sense to do any kind of > > control there and not at the intermediate devices. Secondly probably it > > also means better use of available resources. > > hm. What will be the effects of this limitation in real-world use? In some cases user/application will not see the bandwidth ratio between two groups in same proportion as assigned weights and primary reason for that will be that this workload did not create enough contention for physical resources unerneath. So it all depends on what kind of bandwidth gurantees are we offering. If we are saying that we provide good fairness numbers at logical devices irrespective of whether resources are not used optimally, then it will be irritating for the user. I think it also might become an issue once we implement max bandwidth control. We will not be able to define max bandwidth on a logical device and an application will get more than max bandwidth if it is doing IO to different underlying devices. I would say that leaf node control is good for optimal resource usage and for proportional BW control, but not a good fit for max bandwidth control. > > > Limited Fairness > > ---------------- > > Currently CFQ idles on a sequential reader queue to make sure it gets its > > fair share. A second level controller will find it tricky to anticipate. > > Either it will not have any anticipation logic and in that case it will not > > provide fairness to single readers in a group (as dm-ioband does) or if it > > starts anticipating then we should run into these strange situations where > > second level controller is anticipating on one queue/group and underlying > > IO scheduler might be anticipating on something else. > > It depends on the size of the inter-group timeslices. If the amount of > time for which a group is unthrottled is "large" comapred to the > typical anticipation times, this issue fades away. > > And those timeslices _should_ be large. Because as you mentioned > above, different groups are probably working different parts of the > disk. > > > Need of device mapper tools > > --------------------------- > > A device mapper based solution will require creation of a ioband device > > on each physical/logical device one wants to control. So it requires usage > > of device mapper tools even for the people who are not using device mapper. > > At the same time creation of ioband device on each partition in the system to > > control the IO can be cumbersome and overwhelming if system has got lots of > > disks and partitions with-in. > > > > > > IMHO, IO scheduler based IO controller is a reasonable approach to solve the > > problem of group bandwidth control, and can do hierarchical IO scheduling > > more tightly and efficiently. > > > > But I am all ears to alternative approaches and suggestions how doing things > > can be done better and will be glad to implement it. > > > > TODO > > ==== > > - code cleanups, testing, bug fixing, optimizations, benchmarking etc... > > - More testing to make sure there are no regressions in CFQ. > > > > Testing > > ======= > > > > Environment > > ========== > > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. > > That's a bit of a toy. Yes it is. :-) > > Do we have testing results for more enterprisey hardware? Big storage > arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha) Not yet. I will try to get hold of some storage arrays and run some tests. > > > > I am mostly > > running fio jobs which have been limited to 30 seconds run and then monitored > > the throughput and latency. > > > > Test1: Random Reader Vs Random Writers > > ====================================== > > Launched a random reader and then increasing number of random writers to see > > the effect on random reader BW and max lantecies. > > > > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ] > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > > > [Vanilla CFQ, No groups] > > <--------------random writers--------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec > > 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec > > 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec > > 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec > > 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec > > 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec > > > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > > number of random writers in group1 and one random reader in group2 using fio. > > > > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500] > > <--------------random writers(group1)-------------> <-random reader(group2)-> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec > > 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec > > 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec > > 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec > > 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec > > 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec > > That's a good result. > > > Also ran the same test with IO controller CFQ in flat mode to see if there > > are any major deviations from Vanilla CFQ. Does not look like any. > > > > [IO controller CFQ; No groups ] > > <--------------random writers--------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec > > 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec > > 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec > > 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec > > 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec > > 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec > > > > Notes: > > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down > > its throughput and bump up latencies significantly. > > Isn't that a CFQ shortcoming which we should address separately? If > so, the comparisons aren't presently valid because we're comparing with > a CFQ which has known, should-be-fixed problems. I am not sure if it is a CFQ issue. These are synchronous random writes. These are equally important as random reader. So now CFQ has 33 synchronous queues to serve. Becuase it does not know about groups, it has no choice but to serve them in round robin manner. So it does not sound like a CFQ issue. Just that CFQ can give random reader an advantage if it knows that random reader is in a different group and that's where IO controller comes in to picture. > > > - With IO controller, one can provide isolation to the random reader group and > > maintain consitent view of bandwidth and latencies. > > > > Test2: Random Reader Vs Sequential Reader > > ======================================== > > Launched a random reader and then increasing number of sequential readers to > > see the effect on BW and latencies of random reader. > > > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ] > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > > > [ Vanilla CFQ, No groups ] > > <---------------seq readers----------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec > > 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec > > 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec > > 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec > > 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec > > > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > > number of sequential readers in group1 and one random reader in group2 using > > fio. > > > > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500] > > <---------------group1---------------------------> <------group2---------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec > > 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec > > 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec > > 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec > > 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec > > > > Also ran the same test with IO controller CFQ in flat mode to see if there > > are any major deviations from Vanilla CFQ. Does not look like any. > > > > [IO controller CFQ; No groups ] > > <---------------seq readers----------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec > > 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec > > 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec > > 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec > > 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec > > > > Notes: > > - The BW and latencies of random reader in group 2 seems to be stable and > > bounded and does not get impacted much as number of sequential readers > > increase in group1. Hence provding good isolation. > > > > - Throughput of sequential readers comes down and latencies go up as half > > of disk bandwidth (in terms of time) has been reserved for random reader > > group. > > > > Test3: Sequential Reader Vs Sequential Reader > > ============================================ > > Created two cgroups group1 and group2 of weights 500 and 1000 respectively. > > Launched increasing number of sequential readers in group1 and one sequential > > reader in group2 using fio and monitored how bandwidth is being distributed > > between two groups. > > > > First 5 columns give stats about job in group1 and last two columns give > > stats about job in group2. > > > > <---------------group1---------------------------> <------group2---------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec > > 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec > > 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec > > 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec > > 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec > > > > Note: group2 is getting double the bandwidth of group1 even in the face > > of increasing number of readers in group1. > > > > Test4 (Isolation between two KVM virtual machines) > > ================================================== > > Created two KVM virtual machines. Partitioned a disk on host in two partitions > > and gave one partition to each virtual machine. Put both the virtual machines > > in two different cgroup of weight 1000 and 500 each. Virtual machines created > > ext3 file system on the partitions exported from host and did buffered writes. > > Host seems writes as synchronous and virtual machine with higher weight gets > > double the disk time of virtual machine of lower weight. Used deadline > > scheduler in this test case. > > > > Some more details about configuration are in documentation patch. > > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > =================================================================== > > Fairness for async writes is tricky and biggest reason is that async writes > > are cached in higher layers (page cahe) as well as possibly in file system > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > in proportional manner. > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > be forced to write out some pages to disk before more pages can be dirtied. But > > not necessarily dirty pages of same thread are picked. It can very well pick > > the inode of lesser priority dd thread and do some writeout. So effectively > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > service differentation. > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > does not throw enought IO traffic at IO controller to keep the queue > > continuously backlogged. In my testing, there are many .2 to .8 second > > intervals where higher weight queue is empty and in that duration lower weight > > queue get lots of job done giving the impression that there was no service > > differentiation. > > > > In summary, from IO controller point of view async writes support is there. > > Because page cache has not been designed in such a manner that higher > > prio/weight writer can do more write out as compared to lower prio/weight > > writer, gettting service differentiation is hard and it is visible in some > > cases and not visible in some cases. > > Here's where it all falls to pieces. > > For async writeback we just don't care about IO priorities. Because > from the point of view of the userspace task, the write was async! It > occurred at memory bandwidth speed. > > It's only when the kernel's dirty memory thresholds start to get > exceeded that we start to care about prioritisation. And at that time, > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > consumes just as much memory as a low-ioprio dirty page. > > So when balance_dirty_pages() hits, what do we want to do? > > I suppose that all we can do is to block low-ioprio processes more > agressively at the VFS layer, to reduce the rate at which they're > dirtying memory so as to give high-ioprio processes more of the disk > bandwidth. > > But you've gone and implemented all of this stuff at the io-controller > level and not at the VFS level so you're, umm, screwed. True that's an issue. For async writes we don't create parallel IO paths from user space to IO scheduler hence it is hard to provide fairness in all the cases. I think part of the problem is page cache and some serialization also comes from kjournald. How about coming up with another cgroup controller for buffered writes or clubbing it with memory controller as KAMEZAWA Hiroyuki suggested and co-mount this with io controller? This should help control buffered writes per cgroup. > > Importantly screwed! It's a very common workload pattern, and one > which causes tremendous amounts of IO to be generated very quickly, > traditionally causing bad latency effects all over the place. And we > have no answer to this. > > > Vanilla CFQ Vs IO Controller CFQ > > ================================ > > We have not fundamentally changed CFQ, instead enhanced it to also support > > hierarchical io scheduling. In the process invariably there are small changes > > here and there as new scenarios come up. Running some tests here and comparing > > both the CFQ's to see if there is any major deviation in behavior. > > > > Test1: Sequential Readers > > ========================= > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > > > IO scheduler: IO controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > > > Test2: Sequential Writers > > ========================= > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > > > Test3: Random Readers > > ========================= > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > > > Test4: Random Writers > > ===================== > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > > > Notes: > > - Does not look like that anything has changed significantly. > > > > Previous versions of the patches were posted here. > > ------------------------------------------------ > > > > (V1) http://lkml.org/lkml/2009/3/11/486 > > (V2) http://lkml.org/lkml/2009/5/5/275 > > (V3) http://lkml.org/lkml/2009/5/26/472 > > (V4) http://lkml.org/lkml/2009/6/8/580 > > (V5) http://lkml.org/lkml/2009/6/19/279 > > (V6) http://lkml.org/lkml/2009/7/2/369 > > (V7) http://lkml.org/lkml/2009/7/24/253 > > (V8) http://lkml.org/lkml/2009/8/16/204 > > (V9) http://lkml.org/lkml/2009/8/28/327 > > > > Thanks > > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-24 21:33 ` Andrew Morton @ 2009-09-25 5:04 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 5:04 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote: > On Thu, 24 Sep 2009 15:25:04 -0400 > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > Hi All, > > > > Here is the V10 of the IO controller patches generated on top of 2.6.31. > > > > Thanks for the writeup. It really helps and is most worthwhile for a > project of this importance, size and complexity. > > > > > > What problem are we trying to solve > > =================================== > > Provide group IO scheduling feature in Linux along the lines of other resource > > controllers like cpu. > > > > IOW, provide facility so that a user can group applications using cgroups and > > control the amount of disk time/bandwidth received by a group based on its > > weight. > > > > How to solve the problem > > ========================= > > > > Different people have solved the issue differetnly. So far looks it looks > > like we seem to have following two core requirements when it comes to > > fairness at group level. > > > > - Control bandwidth seen by groups. > > - Control on latencies when a request gets backlogged in group. > > > > At least there are now three patchsets available (including this one). > > > > IO throttling > > ------------- > > This is a bandwidth controller which keeps track of IO rate of a group and > > throttles the process in the group if it exceeds the user specified limit. > > > > dm-ioband > > --------- > > This is a proportional bandwidth controller implemented as device mapper > > driver and provides fair access in terms of amount of IO done (not in terms > > of disk time as CFQ does). > > > > So one will setup one or more dm-ioband devices on top of physical/logical > > block device, configure the ioband device and pass information like grouping > > etc. Now this device will keep track of bios flowing through it and control > > the flow of bios based on group policies. > > > > IO scheduler based IO controller > > -------------------------------- > > Here we have viewed the problem of IO contoller as hierarchical group > > scheduling (along the lines of CFS group scheduling) issue. Currently one can > > view linux IO schedulers as flat where there is one root group and all the IO > > belongs to that group. > > > > This patchset basically modifies IO schedulers to also support hierarchical > > group scheduling. CFQ already provides fairness among different processes. I > > have extended it support group IO schduling. Also took some of the code out > > of CFQ and put in a common layer so that same group scheduling code can be > > used by noop, deadline and AS to support group scheduling. > > > > Pros/Cons > > ========= > > There are pros and cons to each of the approach. Following are some of the > > thoughts. > > > > Max bandwidth vs proportional bandwidth > > --------------------------------------- > > IO throttling is a max bandwidth controller and not a proportional one. > > Additionaly it provides fairness in terms of amount of IO done (and not in > > terms of disk time as CFQ does). > > > > Personally, I think that proportional weight controller is useful to more > > people than just max bandwidth controller. In addition, IO scheduler based > > controller can also be enhanced to do max bandwidth control. So it can > > satisfy wider set of requirements. > > > > Fairness in terms of disk time vs size of IO > > --------------------------------------------- > > An higher level controller will most likely be limited to providing fairness > > in terms of size/number of IO done and will find it hard to provide fairness > > in terms of disk time used (as CFQ provides between various prio levels). This > > is because only IO scheduler knows how much disk time a queue has used and > > information about queues and disk time used is not exported to higher > > layers. > > > > So a seeky application will still run away with lot of disk time and bring > > down the overall throughput of the the disk. > > But that's only true if the thing is poorly implemented. > > A high-level controller will need some view of the busyness of the > underlying device(s). That could be "proportion of idle time", or > "average length of queue" or "average request latency" or some mix of > these or something else altogether. > > But these things are simple to calculate, and are simple to feed back > to the higher-level controller and probably don't require any changes > to to IO scheduler at all, which is a great advantage. > > > And I must say that high-level throttling based upon feedback from > lower layers seems like a much better model to me than hacking away in > the IO scheduler layer. Both from an implementation point of view and > from a "we can get it to work on things other than block devices" point > of view. > Hi Andrew, Few thoughts. - A higher level throttling approach suffers from the issue of unfair throttling. So if there are multiple tasks in the group, who do we throttle and how do we make sure that we did throttling in proportion to the prio of tasks. Andrea's IO throttling implementation suffered from these issues. I had run some tests where RT and BW tasks were getting same BW with-in group or tasks of different prio were gettting same BW. Even if we figure a way out to do fair throttling with-in group, underlying IO scheduler might not be CFQ at all and we should not have done so. https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html - Higher level throttling does not know where actually IO is going in physical layer. So we might unnecessarily be throttling IO which are going to same logical device but at the end of day to different physical devices. Agreed that some people will want that behavior, especially in the case of max bandwidth control where one does not want to give you the BW because you did not pay for it. So higher level controller is good for max bw control but if it comes to optimal usage of resources and do control only if needed, then it probably is not the best thing. About the feedback thing, I am not very sure. Are you saying that we will run timed groups in higher layer and take feedback from underlying IO scheduler about how much time a group consumed or something like that and not do accounting in terms of size of IO? > > Currently dm-ioband provides fairness in terms of number/size of IO. > > > > Latencies and isolation between groups > > -------------------------------------- > > An higher level controller is generally implementing a bandwidth throttling > > solution where if a group exceeds either the max bandwidth or the proportional > > share then throttle that group. > > > > This kind of approach will probably not help in controlling latencies as it > > will depend on underlying IO scheduler. Consider following scenario. > > > > Assume there are two groups. One group is running multiple sequential readers > > and other group has a random reader. sequential readers will get a nice 100ms > > slice > > Do you refer to each reader within group1, or to all readers? It would be > daft if each reader in group1 were to get 100ms. > All readers in the group should get 100ms each, both in IO throttling and dm-ioband solution. Higher level solutions are not keeping track of time slices. Time slices will be allocated by CFQ which does not have any idea about grouping. Higher level controller just keeps track of size of IO done at group level and then run either a leaky bucket or token bucket algorithm. IO throttling is a max BW controller, so it will not even care about what is happening in other group. It will just be concerned with rate of IO in one particular group and if we exceed specified limit, throttle it. So until and unless sequential reader group hits it max bw limit, it will keep sending reads down to CFQ, and CFQ will happily assign 100ms slices to readers. dm-ioband will not try to choke the high throughput sequential reader group for the slow random reader group because that would just kill the throughput of rotational media. Every sequential reader will run for few ms and then be throttled and this goes on. Disk will soon be seek bound. > > each and then a random reader from group2 will get to dispatch the > > request. So latency of this random reader will depend on how many sequential > > readers are running in other group and that is a weak isolation between groups. > > And yet that is what you appear to mean. > > But surely nobody would do that - the 100ms would be assigned to and > distributed amongst all readers in group1? Dividing 100ms to all the sequential readers might not be very good on rotational media as each reader runs for small time and then seek happens. This will increase number of seeks in the system. Think of 32 sequential readers in the group and then each getting less than 3ms to run. A better way probably is to give each queue 100ms in one run of group and then switch group. Someting like following. SR1 RR SR2 RR SR3 RR SR4 RR... Now each sequential reader gets 100ms and disk is not seek bound at the same time random reader latency limited by number of competing groups and not by number of processes in the group. This is what IO scheduler based IO controller is effectively doing currently. > > > When we control things at IO scheduler level, we assign one time slice to one > > group and then pick next entity to run. So effectively after one time slice > > (max 180ms, if prio 0 sequential reader is running), random reader in other > > group will get to run. Hence we achieve better isolation between groups as > > response time of process in a differnt group is generally not dependent on > > number of processes running in competing group. > > I don't understand why you're comparing this implementation with such > an obviously dumb competing design! > > > So a higher level solution is most likely limited to only shaping bandwidth > > without any control on latencies. > > > > Stacking group scheduler on top of CFQ can lead to issues > > --------------------------------------------------------- > > IO throttling and dm-ioband both are second level controller. That is these > > controllers are implemented in higher layers than io schedulers. So they > > control the IO at higher layer based on group policies and later IO > > schedulers take care of dispatching these bios to disk. > > > > Implementing a second level controller has the advantage of being able to > > provide bandwidth control even on logical block devices in the IO stack > > which don't have any IO schedulers attached to these. But they can also > > interefere with IO scheduling policy of underlying IO scheduler and change > > the effective behavior. Following are some of the issues which I think > > should be visible in second level controller in one form or other. > > > > Prio with-in group > > ------------------ > > A second level controller can potentially interefere with behavior of > > different prio processes with-in a group. bios are buffered at higher layer > > in single queue and release of bios is FIFO and not proportionate to the > > ioprio of the process. This can result in a particular prio level not > > getting fair share. > > That's an administrator error, isn't it? Should have put the > different-priority processes into different groups. > I am thinking in practice it probably will be a mix of priority in each group. For example, consider a hypothetical scenario where two students on a university server are given two cgroups of certain weights so that IO done by these students are limited in case of contention. Now these students might want to throw in a mix of priority workload in their respective cgroup. Admin would not have any idea what priority process students are running in respective cgroup. > > Buffering at higher layer can delay read requests for more than slice idle > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > for a request from the queue but it is buffered at higher layer and then idle > > timer will fire. It means that queue will losse its share at the same time > > overall throughput will be impacted as we lost those 8 ms. > > That sounds like a bug. > Actually this probably is a limitation of higher level controller. It most likely is sitting so high in IO stack that it has no idea what underlying IO scheduler is and what are IO scheduler's policies. So it can't keep up with IO scheduler's policies. Secondly, it might be a low weight group and tokens might not be available fast enough to release the request. > > Read Vs Write > > ------------- > > Writes can overwhelm readers hence second level controller FIFO release > > will run into issue here. If there is a single queue maintained then reads > > will suffer large latencies. If there separate queues for reads and writes > > then it will be hard to decide in what ratio to dispatch reads and writes as > > it is IO scheduler's decision to decide when and how much read/write to > > dispatch. This is another place where higher level controller will not be in > > sync with lower level io scheduler and can change the effective policies of > > underlying io scheduler. > > The IO schedulers already take care of read-vs-write and already take > care of preventing large writes-starve-reads latencies (or at least, > they're supposed to). True. Actually this is a limitation of higher level controller. A higher level controller will most likely implement some of kind of queuing/buffering mechanism where it will buffer requeuests when it decides to throttle the group. Now once a fair number read and requests are buffered, and if controller is ready to dispatch some requests from the group, which requests/bio should it dispatch? reads first or writes first or reads and writes in certain ratio? In what ratio reads and writes are dispatched is the property and decision of IO scheduler. Now higher level controller will be taking this decision and change the behavior of underlying io scheduler. > > > CFQ IO context Issues > > --------------------- > > Buffering at higher layer means submission of bios later with the help of > > a worker thread. > > Why? > > If it's a read, we just block the userspace process. > > If it's a delayed write, the IO submission already happens in a kernel thread. Is it ok to block pdflush on group. Some low weight group might block it for long time and hence not allow flushing out other pages. Probably that's the reason pdflush used to check if underlying device is congested or not and if it is congested, we don't go ahead with submission of request. With per bdi flusher thread things will change. I think btrfs also has some threds which don't want to block and if underlying deivce is congested, it bails out. That's the reason I implemented per group congestion interface where if a thread does not want to block, it can check whether the group IO is going in is congested or not and will it block. So for such threads, probably higher level controller shall have to implement per group congestion interface so that threads which don't want to block can check with the controller whether it has sufficient BW to let it through and not block or may be start buffering writes in group queue. > > If it's a synchronous write, we have to block the userspace caller > anyway. > > Async reads might be an issue, dunno. > I think async IO is one of the reason. IIRC, Andrea Righi, implemented the policy of returning error for async IO if group did not have sufficient tokens to dispatch the async IO and expected the application to retry later. I am not sure if that is ok. So yes, if we are not buffering any of the read requests and either blocking the caller or returning an error (async IO) than CFQ io context is not an issue. > > This changes the io context information at CFQ layer which > > assigns the request to submitting thread. Change of io context info again > > leads to issues of idle timer expiry and issue of a process not getting fair > > share and reduced throughput. > > But we already have that problem with delayed writeback, which is a > huge thing - often it's the majority of IO. > For delayed writes CFQ will not anticipate so increased anticipation timer expiry is not an issue with writes. But it probably will be an issue with reads where if higher level controller decides to block next read and CFQ is anticipating on that read. I am wondering that such kind of issues must appear with all the higher level device mapper/software raid devices also. How do they handle it. May be it is more theoritical and in practice impact is not significant. > > Throughput with noop, deadline and AS > > --------------------------------------------- > > I think an higher level controller will result in reduced overall throughput > > (as compared to io scheduler based io controller) and more seeks with noop, > > deadline and AS. > > > > The reason being, that it is likely that IO with-in a group will be related > > and will be relatively close as compared to IO across the groups. For example, > > thread pool of kvm-qemu doing IO for virtual machine. In case of higher level > > control, IO from various groups will go into a single queue at lower level > > controller and it might happen that IO is now interleaved (G1, G2, G1, G3, > > G4....) causing more seeks and reduced throughput. (Agreed that merging will > > help up to some extent but still....). > > > > Instead, in case of lower level controller, IO scheduler maintains one queue > > per group hence there is no interleaving of IO between groups. And if IO is > > related with-in group, then we shoud get reduced number/amount of seek and > > higher throughput. > > > > Latency can be a concern but that can be controlled by reducing the time > > slice length of the queue. > > Well maybe, maybe not. If a group is throttled, it isn't submitting > new IO. The unthrottled group is doing the IO submitting and that IO > will have decent locality. But throttling will kick in ocassionaly. Rest of the time both the groups will be dispatching bios at the same time. So for most part of it IO scheduler will probably see IO from both the groups and there will be small intervals where one group is completely throttled and IO scheduler is busy dispatching requests only from a single group. > > > Fairness at logical device level vs at physical device level > > ------------------------------------------------------------ > > > > IO scheduler based controller has the limitation that it works only with the > > bottom most devices in the IO stack where IO scheduler is attached. > > > > For example, assume a user has created a logical device lv0 using three > > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 > > in two groups doing IO on lv0. Also assume that weights of groups are in the > > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. > > > > T1 T2 > > \ / > > lv0 > > / | \ > > sda sdb sdc > > > > > > Now resource control will take place only on devices sda, sdb and sdc and > > not at lv0 level. So if IO from two tasks is relatively uniformly > > distributed across the disks then T1 and T2 will see the throughput ratio > > in proportion to weight specified. But if IO from T1 and T2 is going to > > different disks and there is no contention then at higher level they both > > will see same BW. > > > > Here a second level controller can produce better fairness numbers at > > logical device but most likely at redued overall throughput of the system, > > because it will try to control IO even if there is no contention at phsical > > possibly leaving diksks unused in the system. > > > > Hence, question comes that how important it is to control bandwidth at > > higher level logical devices also. The actual contention for resources is > > at the leaf block device so it probably makes sense to do any kind of > > control there and not at the intermediate devices. Secondly probably it > > also means better use of available resources. > > hm. What will be the effects of this limitation in real-world use? In some cases user/application will not see the bandwidth ratio between two groups in same proportion as assigned weights and primary reason for that will be that this workload did not create enough contention for physical resources unerneath. So it all depends on what kind of bandwidth gurantees are we offering. If we are saying that we provide good fairness numbers at logical devices irrespective of whether resources are not used optimally, then it will be irritating for the user. I think it also might become an issue once we implement max bandwidth control. We will not be able to define max bandwidth on a logical device and an application will get more than max bandwidth if it is doing IO to different underlying devices. I would say that leaf node control is good for optimal resource usage and for proportional BW control, but not a good fit for max bandwidth control. > > > Limited Fairness > > ---------------- > > Currently CFQ idles on a sequential reader queue to make sure it gets its > > fair share. A second level controller will find it tricky to anticipate. > > Either it will not have any anticipation logic and in that case it will not > > provide fairness to single readers in a group (as dm-ioband does) or if it > > starts anticipating then we should run into these strange situations where > > second level controller is anticipating on one queue/group and underlying > > IO scheduler might be anticipating on something else. > > It depends on the size of the inter-group timeslices. If the amount of > time for which a group is unthrottled is "large" comapred to the > typical anticipation times, this issue fades away. > > And those timeslices _should_ be large. Because as you mentioned > above, different groups are probably working different parts of the > disk. > > > Need of device mapper tools > > --------------------------- > > A device mapper based solution will require creation of a ioband device > > on each physical/logical device one wants to control. So it requires usage > > of device mapper tools even for the people who are not using device mapper. > > At the same time creation of ioband device on each partition in the system to > > control the IO can be cumbersome and overwhelming if system has got lots of > > disks and partitions with-in. > > > > > > IMHO, IO scheduler based IO controller is a reasonable approach to solve the > > problem of group bandwidth control, and can do hierarchical IO scheduling > > more tightly and efficiently. > > > > But I am all ears to alternative approaches and suggestions how doing things > > can be done better and will be glad to implement it. > > > > TODO > > ==== > > - code cleanups, testing, bug fixing, optimizations, benchmarking etc... > > - More testing to make sure there are no regressions in CFQ. > > > > Testing > > ======= > > > > Environment > > ========== > > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. > > That's a bit of a toy. Yes it is. :-) > > Do we have testing results for more enterprisey hardware? Big storage > arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha) Not yet. I will try to get hold of some storage arrays and run some tests. > > > > I am mostly > > running fio jobs which have been limited to 30 seconds run and then monitored > > the throughput and latency. > > > > Test1: Random Reader Vs Random Writers > > ====================================== > > Launched a random reader and then increasing number of random writers to see > > the effect on random reader BW and max lantecies. > > > > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ] > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > > > [Vanilla CFQ, No groups] > > <--------------random writers--------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec > > 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec > > 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec > > 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec > > 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec > > 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec > > > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > > number of random writers in group1 and one random reader in group2 using fio. > > > > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500] > > <--------------random writers(group1)-------------> <-random reader(group2)-> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec > > 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec > > 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec > > 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec > > 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec > > 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec > > That's a good result. > > > Also ran the same test with IO controller CFQ in flat mode to see if there > > are any major deviations from Vanilla CFQ. Does not look like any. > > > > [IO controller CFQ; No groups ] > > <--------------random writers--------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec > > 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec > > 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec > > 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec > > 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec > > 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec > > > > Notes: > > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down > > its throughput and bump up latencies significantly. > > Isn't that a CFQ shortcoming which we should address separately? If > so, the comparisons aren't presently valid because we're comparing with > a CFQ which has known, should-be-fixed problems. I am not sure if it is a CFQ issue. These are synchronous random writes. These are equally important as random reader. So now CFQ has 33 synchronous queues to serve. Becuase it does not know about groups, it has no choice but to serve them in round robin manner. So it does not sound like a CFQ issue. Just that CFQ can give random reader an advantage if it knows that random reader is in a different group and that's where IO controller comes in to picture. > > > - With IO controller, one can provide isolation to the random reader group and > > maintain consitent view of bandwidth and latencies. > > > > Test2: Random Reader Vs Sequential Reader > > ======================================== > > Launched a random reader and then increasing number of sequential readers to > > see the effect on BW and latencies of random reader. > > > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ] > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > > > [ Vanilla CFQ, No groups ] > > <---------------seq readers----------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec > > 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec > > 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec > > 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec > > 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec > > > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > > number of sequential readers in group1 and one random reader in group2 using > > fio. > > > > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500] > > <---------------group1---------------------------> <------group2---------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec > > 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec > > 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec > > 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec > > 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec > > > > Also ran the same test with IO controller CFQ in flat mode to see if there > > are any major deviations from Vanilla CFQ. Does not look like any. > > > > [IO controller CFQ; No groups ] > > <---------------seq readers----------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec > > 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec > > 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec > > 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec > > 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec > > > > Notes: > > - The BW and latencies of random reader in group 2 seems to be stable and > > bounded and does not get impacted much as number of sequential readers > > increase in group1. Hence provding good isolation. > > > > - Throughput of sequential readers comes down and latencies go up as half > > of disk bandwidth (in terms of time) has been reserved for random reader > > group. > > > > Test3: Sequential Reader Vs Sequential Reader > > ============================================ > > Created two cgroups group1 and group2 of weights 500 and 1000 respectively. > > Launched increasing number of sequential readers in group1 and one sequential > > reader in group2 using fio and monitored how bandwidth is being distributed > > between two groups. > > > > First 5 columns give stats about job in group1 and last two columns give > > stats about job in group2. > > > > <---------------group1---------------------------> <------group2---------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec > > 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec > > 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec > > 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec > > 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec > > > > Note: group2 is getting double the bandwidth of group1 even in the face > > of increasing number of readers in group1. > > > > Test4 (Isolation between two KVM virtual machines) > > ================================================== > > Created two KVM virtual machines. Partitioned a disk on host in two partitions > > and gave one partition to each virtual machine. Put both the virtual machines > > in two different cgroup of weight 1000 and 500 each. Virtual machines created > > ext3 file system on the partitions exported from host and did buffered writes. > > Host seems writes as synchronous and virtual machine with higher weight gets > > double the disk time of virtual machine of lower weight. Used deadline > > scheduler in this test case. > > > > Some more details about configuration are in documentation patch. > > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > =================================================================== > > Fairness for async writes is tricky and biggest reason is that async writes > > are cached in higher layers (page cahe) as well as possibly in file system > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > in proportional manner. > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > be forced to write out some pages to disk before more pages can be dirtied. But > > not necessarily dirty pages of same thread are picked. It can very well pick > > the inode of lesser priority dd thread and do some writeout. So effectively > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > service differentation. > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > does not throw enought IO traffic at IO controller to keep the queue > > continuously backlogged. In my testing, there are many .2 to .8 second > > intervals where higher weight queue is empty and in that duration lower weight > > queue get lots of job done giving the impression that there was no service > > differentiation. > > > > In summary, from IO controller point of view async writes support is there. > > Because page cache has not been designed in such a manner that higher > > prio/weight writer can do more write out as compared to lower prio/weight > > writer, gettting service differentiation is hard and it is visible in some > > cases and not visible in some cases. > > Here's where it all falls to pieces. > > For async writeback we just don't care about IO priorities. Because > from the point of view of the userspace task, the write was async! It > occurred at memory bandwidth speed. > > It's only when the kernel's dirty memory thresholds start to get > exceeded that we start to care about prioritisation. And at that time, > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > consumes just as much memory as a low-ioprio dirty page. > > So when balance_dirty_pages() hits, what do we want to do? > > I suppose that all we can do is to block low-ioprio processes more > agressively at the VFS layer, to reduce the rate at which they're > dirtying memory so as to give high-ioprio processes more of the disk > bandwidth. > > But you've gone and implemented all of this stuff at the io-controller > level and not at the VFS level so you're, umm, screwed. True that's an issue. For async writes we don't create parallel IO paths from user space to IO scheduler hence it is hard to provide fairness in all the cases. I think part of the problem is page cache and some serialization also comes from kjournald. How about coming up with another cgroup controller for buffered writes or clubbing it with memory controller as KAMEZAWA Hiroyuki suggested and co-mount this with io controller? This should help control buffered writes per cgroup. > > Importantly screwed! It's a very common workload pattern, and one > which causes tremendous amounts of IO to be generated very quickly, > traditionally causing bad latency effects all over the place. And we > have no answer to this. > > > Vanilla CFQ Vs IO Controller CFQ > > ================================ > > We have not fundamentally changed CFQ, instead enhanced it to also support > > hierarchical io scheduling. In the process invariably there are small changes > > here and there as new scenarios come up. Running some tests here and comparing > > both the CFQ's to see if there is any major deviation in behavior. > > > > Test1: Sequential Readers > > ========================= > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > > > IO scheduler: IO controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > > > Test2: Sequential Writers > > ========================= > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > > > Test3: Random Readers > > ========================= > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > > > Test4: Random Writers > > ===================== > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > > > Notes: > > - Does not look like that anything has changed significantly. > > > > Previous versions of the patches were posted here. > > ------------------------------------------------ > > > > (V1) http://lkml.org/lkml/2009/3/11/486 > > (V2) http://lkml.org/lkml/2009/5/5/275 > > (V3) http://lkml.org/lkml/2009/5/26/472 > > (V4) http://lkml.org/lkml/2009/6/8/580 > > (V5) http://lkml.org/lkml/2009/6/19/279 > > (V6) http://lkml.org/lkml/2009/7/2/369 > > (V7) http://lkml.org/lkml/2009/7/24/253 > > (V8) http://lkml.org/lkml/2009/8/16/204 > > (V9) http://lkml.org/lkml/2009/8/28/327 > > > > Thanks > > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-25 5:04 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 5:04 UTC (permalink / raw) To: Andrew Morton Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, s-uchida, righi.andrea, torvalds On Thu, Sep 24, 2009 at 02:33:15PM -0700, Andrew Morton wrote: > On Thu, 24 Sep 2009 15:25:04 -0400 > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > Hi All, > > > > Here is the V10 of the IO controller patches generated on top of 2.6.31. > > > > Thanks for the writeup. It really helps and is most worthwhile for a > project of this importance, size and complexity. > > > > > > What problem are we trying to solve > > =================================== > > Provide group IO scheduling feature in Linux along the lines of other resource > > controllers like cpu. > > > > IOW, provide facility so that a user can group applications using cgroups and > > control the amount of disk time/bandwidth received by a group based on its > > weight. > > > > How to solve the problem > > ========================= > > > > Different people have solved the issue differetnly. So far looks it looks > > like we seem to have following two core requirements when it comes to > > fairness at group level. > > > > - Control bandwidth seen by groups. > > - Control on latencies when a request gets backlogged in group. > > > > At least there are now three patchsets available (including this one). > > > > IO throttling > > ------------- > > This is a bandwidth controller which keeps track of IO rate of a group and > > throttles the process in the group if it exceeds the user specified limit. > > > > dm-ioband > > --------- > > This is a proportional bandwidth controller implemented as device mapper > > driver and provides fair access in terms of amount of IO done (not in terms > > of disk time as CFQ does). > > > > So one will setup one or more dm-ioband devices on top of physical/logical > > block device, configure the ioband device and pass information like grouping > > etc. Now this device will keep track of bios flowing through it and control > > the flow of bios based on group policies. > > > > IO scheduler based IO controller > > -------------------------------- > > Here we have viewed the problem of IO contoller as hierarchical group > > scheduling (along the lines of CFS group scheduling) issue. Currently one can > > view linux IO schedulers as flat where there is one root group and all the IO > > belongs to that group. > > > > This patchset basically modifies IO schedulers to also support hierarchical > > group scheduling. CFQ already provides fairness among different processes. I > > have extended it support group IO schduling. Also took some of the code out > > of CFQ and put in a common layer so that same group scheduling code can be > > used by noop, deadline and AS to support group scheduling. > > > > Pros/Cons > > ========= > > There are pros and cons to each of the approach. Following are some of the > > thoughts. > > > > Max bandwidth vs proportional bandwidth > > --------------------------------------- > > IO throttling is a max bandwidth controller and not a proportional one. > > Additionaly it provides fairness in terms of amount of IO done (and not in > > terms of disk time as CFQ does). > > > > Personally, I think that proportional weight controller is useful to more > > people than just max bandwidth controller. In addition, IO scheduler based > > controller can also be enhanced to do max bandwidth control. So it can > > satisfy wider set of requirements. > > > > Fairness in terms of disk time vs size of IO > > --------------------------------------------- > > An higher level controller will most likely be limited to providing fairness > > in terms of size/number of IO done and will find it hard to provide fairness > > in terms of disk time used (as CFQ provides between various prio levels). This > > is because only IO scheduler knows how much disk time a queue has used and > > information about queues and disk time used is not exported to higher > > layers. > > > > So a seeky application will still run away with lot of disk time and bring > > down the overall throughput of the the disk. > > But that's only true if the thing is poorly implemented. > > A high-level controller will need some view of the busyness of the > underlying device(s). That could be "proportion of idle time", or > "average length of queue" or "average request latency" or some mix of > these or something else altogether. > > But these things are simple to calculate, and are simple to feed back > to the higher-level controller and probably don't require any changes > to to IO scheduler at all, which is a great advantage. > > > And I must say that high-level throttling based upon feedback from > lower layers seems like a much better model to me than hacking away in > the IO scheduler layer. Both from an implementation point of view and > from a "we can get it to work on things other than block devices" point > of view. > Hi Andrew, Few thoughts. - A higher level throttling approach suffers from the issue of unfair throttling. So if there are multiple tasks in the group, who do we throttle and how do we make sure that we did throttling in proportion to the prio of tasks. Andrea's IO throttling implementation suffered from these issues. I had run some tests where RT and BW tasks were getting same BW with-in group or tasks of different prio were gettting same BW. Even if we figure a way out to do fair throttling with-in group, underlying IO scheduler might not be CFQ at all and we should not have done so. https://lists.linux-foundation.org/pipermail/containers/2009-May/017588.html - Higher level throttling does not know where actually IO is going in physical layer. So we might unnecessarily be throttling IO which are going to same logical device but at the end of day to different physical devices. Agreed that some people will want that behavior, especially in the case of max bandwidth control where one does not want to give you the BW because you did not pay for it. So higher level controller is good for max bw control but if it comes to optimal usage of resources and do control only if needed, then it probably is not the best thing. About the feedback thing, I am not very sure. Are you saying that we will run timed groups in higher layer and take feedback from underlying IO scheduler about how much time a group consumed or something like that and not do accounting in terms of size of IO? > > Currently dm-ioband provides fairness in terms of number/size of IO. > > > > Latencies and isolation between groups > > -------------------------------------- > > An higher level controller is generally implementing a bandwidth throttling > > solution where if a group exceeds either the max bandwidth or the proportional > > share then throttle that group. > > > > This kind of approach will probably not help in controlling latencies as it > > will depend on underlying IO scheduler. Consider following scenario. > > > > Assume there are two groups. One group is running multiple sequential readers > > and other group has a random reader. sequential readers will get a nice 100ms > > slice > > Do you refer to each reader within group1, or to all readers? It would be > daft if each reader in group1 were to get 100ms. > All readers in the group should get 100ms each, both in IO throttling and dm-ioband solution. Higher level solutions are not keeping track of time slices. Time slices will be allocated by CFQ which does not have any idea about grouping. Higher level controller just keeps track of size of IO done at group level and then run either a leaky bucket or token bucket algorithm. IO throttling is a max BW controller, so it will not even care about what is happening in other group. It will just be concerned with rate of IO in one particular group and if we exceed specified limit, throttle it. So until and unless sequential reader group hits it max bw limit, it will keep sending reads down to CFQ, and CFQ will happily assign 100ms slices to readers. dm-ioband will not try to choke the high throughput sequential reader group for the slow random reader group because that would just kill the throughput of rotational media. Every sequential reader will run for few ms and then be throttled and this goes on. Disk will soon be seek bound. > > each and then a random reader from group2 will get to dispatch the > > request. So latency of this random reader will depend on how many sequential > > readers are running in other group and that is a weak isolation between groups. > > And yet that is what you appear to mean. > > But surely nobody would do that - the 100ms would be assigned to and > distributed amongst all readers in group1? Dividing 100ms to all the sequential readers might not be very good on rotational media as each reader runs for small time and then seek happens. This will increase number of seeks in the system. Think of 32 sequential readers in the group and then each getting less than 3ms to run. A better way probably is to give each queue 100ms in one run of group and then switch group. Someting like following. SR1 RR SR2 RR SR3 RR SR4 RR... Now each sequential reader gets 100ms and disk is not seek bound at the same time random reader latency limited by number of competing groups and not by number of processes in the group. This is what IO scheduler based IO controller is effectively doing currently. > > > When we control things at IO scheduler level, we assign one time slice to one > > group and then pick next entity to run. So effectively after one time slice > > (max 180ms, if prio 0 sequential reader is running), random reader in other > > group will get to run. Hence we achieve better isolation between groups as > > response time of process in a differnt group is generally not dependent on > > number of processes running in competing group. > > I don't understand why you're comparing this implementation with such > an obviously dumb competing design! > > > So a higher level solution is most likely limited to only shaping bandwidth > > without any control on latencies. > > > > Stacking group scheduler on top of CFQ can lead to issues > > --------------------------------------------------------- > > IO throttling and dm-ioband both are second level controller. That is these > > controllers are implemented in higher layers than io schedulers. So they > > control the IO at higher layer based on group policies and later IO > > schedulers take care of dispatching these bios to disk. > > > > Implementing a second level controller has the advantage of being able to > > provide bandwidth control even on logical block devices in the IO stack > > which don't have any IO schedulers attached to these. But they can also > > interefere with IO scheduling policy of underlying IO scheduler and change > > the effective behavior. Following are some of the issues which I think > > should be visible in second level controller in one form or other. > > > > Prio with-in group > > ------------------ > > A second level controller can potentially interefere with behavior of > > different prio processes with-in a group. bios are buffered at higher layer > > in single queue and release of bios is FIFO and not proportionate to the > > ioprio of the process. This can result in a particular prio level not > > getting fair share. > > That's an administrator error, isn't it? Should have put the > different-priority processes into different groups. > I am thinking in practice it probably will be a mix of priority in each group. For example, consider a hypothetical scenario where two students on a university server are given two cgroups of certain weights so that IO done by these students are limited in case of contention. Now these students might want to throw in a mix of priority workload in their respective cgroup. Admin would not have any idea what priority process students are running in respective cgroup. > > Buffering at higher layer can delay read requests for more than slice idle > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > for a request from the queue but it is buffered at higher layer and then idle > > timer will fire. It means that queue will losse its share at the same time > > overall throughput will be impacted as we lost those 8 ms. > > That sounds like a bug. > Actually this probably is a limitation of higher level controller. It most likely is sitting so high in IO stack that it has no idea what underlying IO scheduler is and what are IO scheduler's policies. So it can't keep up with IO scheduler's policies. Secondly, it might be a low weight group and tokens might not be available fast enough to release the request. > > Read Vs Write > > ------------- > > Writes can overwhelm readers hence second level controller FIFO release > > will run into issue here. If there is a single queue maintained then reads > > will suffer large latencies. If there separate queues for reads and writes > > then it will be hard to decide in what ratio to dispatch reads and writes as > > it is IO scheduler's decision to decide when and how much read/write to > > dispatch. This is another place where higher level controller will not be in > > sync with lower level io scheduler and can change the effective policies of > > underlying io scheduler. > > The IO schedulers already take care of read-vs-write and already take > care of preventing large writes-starve-reads latencies (or at least, > they're supposed to). True. Actually this is a limitation of higher level controller. A higher level controller will most likely implement some of kind of queuing/buffering mechanism where it will buffer requeuests when it decides to throttle the group. Now once a fair number read and requests are buffered, and if controller is ready to dispatch some requests from the group, which requests/bio should it dispatch? reads first or writes first or reads and writes in certain ratio? In what ratio reads and writes are dispatched is the property and decision of IO scheduler. Now higher level controller will be taking this decision and change the behavior of underlying io scheduler. > > > CFQ IO context Issues > > --------------------- > > Buffering at higher layer means submission of bios later with the help of > > a worker thread. > > Why? > > If it's a read, we just block the userspace process. > > If it's a delayed write, the IO submission already happens in a kernel thread. Is it ok to block pdflush on group. Some low weight group might block it for long time and hence not allow flushing out other pages. Probably that's the reason pdflush used to check if underlying device is congested or not and if it is congested, we don't go ahead with submission of request. With per bdi flusher thread things will change. I think btrfs also has some threds which don't want to block and if underlying deivce is congested, it bails out. That's the reason I implemented per group congestion interface where if a thread does not want to block, it can check whether the group IO is going in is congested or not and will it block. So for such threads, probably higher level controller shall have to implement per group congestion interface so that threads which don't want to block can check with the controller whether it has sufficient BW to let it through and not block or may be start buffering writes in group queue. > > If it's a synchronous write, we have to block the userspace caller > anyway. > > Async reads might be an issue, dunno. > I think async IO is one of the reason. IIRC, Andrea Righi, implemented the policy of returning error for async IO if group did not have sufficient tokens to dispatch the async IO and expected the application to retry later. I am not sure if that is ok. So yes, if we are not buffering any of the read requests and either blocking the caller or returning an error (async IO) than CFQ io context is not an issue. > > This changes the io context information at CFQ layer which > > assigns the request to submitting thread. Change of io context info again > > leads to issues of idle timer expiry and issue of a process not getting fair > > share and reduced throughput. > > But we already have that problem with delayed writeback, which is a > huge thing - often it's the majority of IO. > For delayed writes CFQ will not anticipate so increased anticipation timer expiry is not an issue with writes. But it probably will be an issue with reads where if higher level controller decides to block next read and CFQ is anticipating on that read. I am wondering that such kind of issues must appear with all the higher level device mapper/software raid devices also. How do they handle it. May be it is more theoritical and in practice impact is not significant. > > Throughput with noop, deadline and AS > > --------------------------------------------- > > I think an higher level controller will result in reduced overall throughput > > (as compared to io scheduler based io controller) and more seeks with noop, > > deadline and AS. > > > > The reason being, that it is likely that IO with-in a group will be related > > and will be relatively close as compared to IO across the groups. For example, > > thread pool of kvm-qemu doing IO for virtual machine. In case of higher level > > control, IO from various groups will go into a single queue at lower level > > controller and it might happen that IO is now interleaved (G1, G2, G1, G3, > > G4....) causing more seeks and reduced throughput. (Agreed that merging will > > help up to some extent but still....). > > > > Instead, in case of lower level controller, IO scheduler maintains one queue > > per group hence there is no interleaving of IO between groups. And if IO is > > related with-in group, then we shoud get reduced number/amount of seek and > > higher throughput. > > > > Latency can be a concern but that can be controlled by reducing the time > > slice length of the queue. > > Well maybe, maybe not. If a group is throttled, it isn't submitting > new IO. The unthrottled group is doing the IO submitting and that IO > will have decent locality. But throttling will kick in ocassionaly. Rest of the time both the groups will be dispatching bios at the same time. So for most part of it IO scheduler will probably see IO from both the groups and there will be small intervals where one group is completely throttled and IO scheduler is busy dispatching requests only from a single group. > > > Fairness at logical device level vs at physical device level > > ------------------------------------------------------------ > > > > IO scheduler based controller has the limitation that it works only with the > > bottom most devices in the IO stack where IO scheduler is attached. > > > > For example, assume a user has created a logical device lv0 using three > > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 > > in two groups doing IO on lv0. Also assume that weights of groups are in the > > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. > > > > T1 T2 > > \ / > > lv0 > > / | \ > > sda sdb sdc > > > > > > Now resource control will take place only on devices sda, sdb and sdc and > > not at lv0 level. So if IO from two tasks is relatively uniformly > > distributed across the disks then T1 and T2 will see the throughput ratio > > in proportion to weight specified. But if IO from T1 and T2 is going to > > different disks and there is no contention then at higher level they both > > will see same BW. > > > > Here a second level controller can produce better fairness numbers at > > logical device but most likely at redued overall throughput of the system, > > because it will try to control IO even if there is no contention at phsical > > possibly leaving diksks unused in the system. > > > > Hence, question comes that how important it is to control bandwidth at > > higher level logical devices also. The actual contention for resources is > > at the leaf block device so it probably makes sense to do any kind of > > control there and not at the intermediate devices. Secondly probably it > > also means better use of available resources. > > hm. What will be the effects of this limitation in real-world use? In some cases user/application will not see the bandwidth ratio between two groups in same proportion as assigned weights and primary reason for that will be that this workload did not create enough contention for physical resources unerneath. So it all depends on what kind of bandwidth gurantees are we offering. If we are saying that we provide good fairness numbers at logical devices irrespective of whether resources are not used optimally, then it will be irritating for the user. I think it also might become an issue once we implement max bandwidth control. We will not be able to define max bandwidth on a logical device and an application will get more than max bandwidth if it is doing IO to different underlying devices. I would say that leaf node control is good for optimal resource usage and for proportional BW control, but not a good fit for max bandwidth control. > > > Limited Fairness > > ---------------- > > Currently CFQ idles on a sequential reader queue to make sure it gets its > > fair share. A second level controller will find it tricky to anticipate. > > Either it will not have any anticipation logic and in that case it will not > > provide fairness to single readers in a group (as dm-ioband does) or if it > > starts anticipating then we should run into these strange situations where > > second level controller is anticipating on one queue/group and underlying > > IO scheduler might be anticipating on something else. > > It depends on the size of the inter-group timeslices. If the amount of > time for which a group is unthrottled is "large" comapred to the > typical anticipation times, this issue fades away. > > And those timeslices _should_ be large. Because as you mentioned > above, different groups are probably working different parts of the > disk. > > > Need of device mapper tools > > --------------------------- > > A device mapper based solution will require creation of a ioband device > > on each physical/logical device one wants to control. So it requires usage > > of device mapper tools even for the people who are not using device mapper. > > At the same time creation of ioband device on each partition in the system to > > control the IO can be cumbersome and overwhelming if system has got lots of > > disks and partitions with-in. > > > > > > IMHO, IO scheduler based IO controller is a reasonable approach to solve the > > problem of group bandwidth control, and can do hierarchical IO scheduling > > more tightly and efficiently. > > > > But I am all ears to alternative approaches and suggestions how doing things > > can be done better and will be glad to implement it. > > > > TODO > > ==== > > - code cleanups, testing, bug fixing, optimizations, benchmarking etc... > > - More testing to make sure there are no regressions in CFQ. > > > > Testing > > ======= > > > > Environment > > ========== > > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. > > That's a bit of a toy. Yes it is. :-) > > Do we have testing results for more enterprisey hardware? Big storage > arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha) Not yet. I will try to get hold of some storage arrays and run some tests. > > > > I am mostly > > running fio jobs which have been limited to 30 seconds run and then monitored > > the throughput and latency. > > > > Test1: Random Reader Vs Random Writers > > ====================================== > > Launched a random reader and then increasing number of random writers to see > > the effect on random reader BW and max lantecies. > > > > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ] > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > > > [Vanilla CFQ, No groups] > > <--------------random writers--------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec > > 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec > > 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec > > 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec > > 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec > > 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec > > > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > > number of random writers in group1 and one random reader in group2 using fio. > > > > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500] > > <--------------random writers(group1)-------------> <-random reader(group2)-> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec > > 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec > > 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec > > 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec > > 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec > > 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec > > That's a good result. > > > Also ran the same test with IO controller CFQ in flat mode to see if there > > are any major deviations from Vanilla CFQ. Does not look like any. > > > > [IO controller CFQ; No groups ] > > <--------------random writers--------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec > > 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec > > 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec > > 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec > > 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec > > 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec > > > > Notes: > > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down > > its throughput and bump up latencies significantly. > > Isn't that a CFQ shortcoming which we should address separately? If > so, the comparisons aren't presently valid because we're comparing with > a CFQ which has known, should-be-fixed problems. I am not sure if it is a CFQ issue. These are synchronous random writes. These are equally important as random reader. So now CFQ has 33 synchronous queues to serve. Becuase it does not know about groups, it has no choice but to serve them in round robin manner. So it does not sound like a CFQ issue. Just that CFQ can give random reader an advantage if it knows that random reader is in a different group and that's where IO controller comes in to picture. > > > - With IO controller, one can provide isolation to the random reader group and > > maintain consitent view of bandwidth and latencies. > > > > Test2: Random Reader Vs Sequential Reader > > ======================================== > > Launched a random reader and then increasing number of sequential readers to > > see the effect on BW and latencies of random reader. > > > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ] > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > > > [ Vanilla CFQ, No groups ] > > <---------------seq readers----------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec > > 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec > > 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec > > 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec > > 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec > > > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > > number of sequential readers in group1 and one random reader in group2 using > > fio. > > > > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500] > > <---------------group1---------------------------> <------group2---------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec > > 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec > > 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec > > 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec > > 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec > > > > Also ran the same test with IO controller CFQ in flat mode to see if there > > are any major deviations from Vanilla CFQ. Does not look like any. > > > > [IO controller CFQ; No groups ] > > <---------------seq readers----------------------> <------random reader--> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec > > 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec > > 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec > > 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec > > 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec > > > > Notes: > > - The BW and latencies of random reader in group 2 seems to be stable and > > bounded and does not get impacted much as number of sequential readers > > increase in group1. Hence provding good isolation. > > > > - Throughput of sequential readers comes down and latencies go up as half > > of disk bandwidth (in terms of time) has been reserved for random reader > > group. > > > > Test3: Sequential Reader Vs Sequential Reader > > ============================================ > > Created two cgroups group1 and group2 of weights 500 and 1000 respectively. > > Launched increasing number of sequential readers in group1 and one sequential > > reader in group2 using fio and monitored how bandwidth is being distributed > > between two groups. > > > > First 5 columns give stats about job in group1 and last two columns give > > stats about job in group2. > > > > <---------------group1---------------------------> <------group2---------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec > > 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec > > 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec > > 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec > > 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec > > > > Note: group2 is getting double the bandwidth of group1 even in the face > > of increasing number of readers in group1. > > > > Test4 (Isolation between two KVM virtual machines) > > ================================================== > > Created two KVM virtual machines. Partitioned a disk on host in two partitions > > and gave one partition to each virtual machine. Put both the virtual machines > > in two different cgroup of weight 1000 and 500 each. Virtual machines created > > ext3 file system on the partitions exported from host and did buffered writes. > > Host seems writes as synchronous and virtual machine with higher weight gets > > double the disk time of virtual machine of lower weight. Used deadline > > scheduler in this test case. > > > > Some more details about configuration are in documentation patch. > > > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > > =================================================================== > > Fairness for async writes is tricky and biggest reason is that async writes > > are cached in higher layers (page cahe) as well as possibly in file system > > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > > in proportional manner. > > > > For example, consider two dd threads reading /dev/zero as input file and doing > > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > > be forced to write out some pages to disk before more pages can be dirtied. But > > not necessarily dirty pages of same thread are picked. It can very well pick > > the inode of lesser priority dd thread and do some writeout. So effectively > > higher weight dd is doing writeouts of lower weight dd pages and we don't see > > service differentation. > > > > IOW, the core problem with buffered write fairness is that higher weight thread > > does not throw enought IO traffic at IO controller to keep the queue > > continuously backlogged. In my testing, there are many .2 to .8 second > > intervals where higher weight queue is empty and in that duration lower weight > > queue get lots of job done giving the impression that there was no service > > differentiation. > > > > In summary, from IO controller point of view async writes support is there. > > Because page cache has not been designed in such a manner that higher > > prio/weight writer can do more write out as compared to lower prio/weight > > writer, gettting service differentiation is hard and it is visible in some > > cases and not visible in some cases. > > Here's where it all falls to pieces. > > For async writeback we just don't care about IO priorities. Because > from the point of view of the userspace task, the write was async! It > occurred at memory bandwidth speed. > > It's only when the kernel's dirty memory thresholds start to get > exceeded that we start to care about prioritisation. And at that time, > all dirty memory (within a memcg?) is equal - a high-ioprio dirty page > consumes just as much memory as a low-ioprio dirty page. > > So when balance_dirty_pages() hits, what do we want to do? > > I suppose that all we can do is to block low-ioprio processes more > agressively at the VFS layer, to reduce the rate at which they're > dirtying memory so as to give high-ioprio processes more of the disk > bandwidth. > > But you've gone and implemented all of this stuff at the io-controller > level and not at the VFS level so you're, umm, screwed. True that's an issue. For async writes we don't create parallel IO paths from user space to IO scheduler hence it is hard to provide fairness in all the cases. I think part of the problem is page cache and some serialization also comes from kjournald. How about coming up with another cgroup controller for buffered writes or clubbing it with memory controller as KAMEZAWA Hiroyuki suggested and co-mount this with io controller? This should help control buffered writes per cgroup. > > Importantly screwed! It's a very common workload pattern, and one > which causes tremendous amounts of IO to be generated very quickly, > traditionally causing bad latency effects all over the place. And we > have no answer to this. > > > Vanilla CFQ Vs IO Controller CFQ > > ================================ > > We have not fundamentally changed CFQ, instead enhanced it to also support > > hierarchical io scheduling. In the process invariably there are small changes > > here and there as new scenarios come up. Running some tests here and comparing > > both the CFQ's to see if there is any major deviation in behavior. > > > > Test1: Sequential Readers > > ========================= > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > > > IO scheduler: IO controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > > > Test2: Sequential Writers > > ========================= > > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > > > Test3: Random Readers > > ========================= > > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > > > Test4: Random Writers > > ===================== > > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > > > IO scheduler: Vanilla CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > > > IO scheduler: IO Controller CFQ > > > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > > > Notes: > > - Does not look like that anything has changed significantly. > > > > Previous versions of the patches were posted here. > > ------------------------------------------------ > > > > (V1) http://lkml.org/lkml/2009/3/11/486 > > (V2) http://lkml.org/lkml/2009/5/5/275 > > (V3) http://lkml.org/lkml/2009/5/26/472 > > (V4) http://lkml.org/lkml/2009/6/8/580 > > (V5) http://lkml.org/lkml/2009/6/19/279 > > (V6) http://lkml.org/lkml/2009/7/2/369 > > (V7) http://lkml.org/lkml/2009/7/24/253 > > (V8) http://lkml.org/lkml/2009/8/16/204 > > (V9) http://lkml.org/lkml/2009/8/28/327 > > > > Thanks > > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090925050429.GB12555-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090925050429.GB12555-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-25 9:07 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-25 9:07 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Higher level solutions are not keeping track of time slices. Time slices will > be allocated by CFQ which does not have any idea about grouping. Higher > level controller just keeps track of size of IO done at group level and > then run either a leaky bucket or token bucket algorithm. > > IO throttling is a max BW controller, so it will not even care about what is > happening in other group. It will just be concerned with rate of IO in one > particular group and if we exceed specified limit, throttle it. So until and > unless sequential reader group hits it max bw limit, it will keep sending > reads down to CFQ, and CFQ will happily assign 100ms slices to readers. > > dm-ioband will not try to choke the high throughput sequential reader group > for the slow random reader group because that would just kill the throughput > of rotational media. Every sequential reader will run for few ms and then > be throttled and this goes on. Disk will soon be seek bound. Because dm-ioband provides faireness in terms of how many IO requests are issued or how many bytes are transferred, so this behaviour is to be expected. Do you think fairness in terms of IO requests and size is not fair? > > > Buffering at higher layer can delay read requests for more than slice idle > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > > for a request from the queue but it is buffered at higher layer and then idle > > > timer will fire. It means that queue will losse its share at the same time > > > overall throughput will be impacted as we lost those 8 ms. > > > > That sounds like a bug. > > > > Actually this probably is a limitation of higher level controller. It most > likely is sitting so high in IO stack that it has no idea what underlying > IO scheduler is and what are IO scheduler's policies. So it can't keep up > with IO scheduler's policies. Secondly, it might be a low weight group and > tokens might not be available fast enough to release the request. > > > > Read Vs Write > > > ------------- > > > Writes can overwhelm readers hence second level controller FIFO release > > > will run into issue here. If there is a single queue maintained then reads > > > will suffer large latencies. If there separate queues for reads and writes > > > then it will be hard to decide in what ratio to dispatch reads and writes as > > > it is IO scheduler's decision to decide when and how much read/write to > > > dispatch. This is another place where higher level controller will not be in > > > sync with lower level io scheduler and can change the effective policies of > > > underlying io scheduler. > > > > The IO schedulers already take care of read-vs-write and already take > > care of preventing large writes-starve-reads latencies (or at least, > > they're supposed to). > > True. Actually this is a limitation of higher level controller. A higher > level controller will most likely implement some of kind of queuing/buffering > mechanism where it will buffer requeuests when it decides to throttle the > group. Now once a fair number read and requests are buffered, and if > controller is ready to dispatch some requests from the group, which > requests/bio should it dispatch? reads first or writes first or reads and > writes in certain ratio? The write-starve-reads on dm-ioband, that you pointed out before, was not caused by FIFO release, it was caused by IO flow control in dm-ioband. When I turned off the flow control, then the read throughput was quite improved. Now I'm considering separating dm-ioband's internal queue into sync and async and giving a certain priority of dispatch to async IOs. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 5:04 ` Vivek Goyal @ 2009-09-25 9:07 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-25 9:07 UTC (permalink / raw) To: vgoyal Cc: akpm, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > Higher level solutions are not keeping track of time slices. Time slices will > be allocated by CFQ which does not have any idea about grouping. Higher > level controller just keeps track of size of IO done at group level and > then run either a leaky bucket or token bucket algorithm. > > IO throttling is a max BW controller, so it will not even care about what is > happening in other group. It will just be concerned with rate of IO in one > particular group and if we exceed specified limit, throttle it. So until and > unless sequential reader group hits it max bw limit, it will keep sending > reads down to CFQ, and CFQ will happily assign 100ms slices to readers. > > dm-ioband will not try to choke the high throughput sequential reader group > for the slow random reader group because that would just kill the throughput > of rotational media. Every sequential reader will run for few ms and then > be throttled and this goes on. Disk will soon be seek bound. Because dm-ioband provides faireness in terms of how many IO requests are issued or how many bytes are transferred, so this behaviour is to be expected. Do you think fairness in terms of IO requests and size is not fair? > > > Buffering at higher layer can delay read requests for more than slice idle > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > > for a request from the queue but it is buffered at higher layer and then idle > > > timer will fire. It means that queue will losse its share at the same time > > > overall throughput will be impacted as we lost those 8 ms. > > > > That sounds like a bug. > > > > Actually this probably is a limitation of higher level controller. It most > likely is sitting so high in IO stack that it has no idea what underlying > IO scheduler is and what are IO scheduler's policies. So it can't keep up > with IO scheduler's policies. Secondly, it might be a low weight group and > tokens might not be available fast enough to release the request. > > > > Read Vs Write > > > ------------- > > > Writes can overwhelm readers hence second level controller FIFO release > > > will run into issue here. If there is a single queue maintained then reads > > > will suffer large latencies. If there separate queues for reads and writes > > > then it will be hard to decide in what ratio to dispatch reads and writes as > > > it is IO scheduler's decision to decide when and how much read/write to > > > dispatch. This is another place where higher level controller will not be in > > > sync with lower level io scheduler and can change the effective policies of > > > underlying io scheduler. > > > > The IO schedulers already take care of read-vs-write and already take > > care of preventing large writes-starve-reads latencies (or at least, > > they're supposed to). > > True. Actually this is a limitation of higher level controller. A higher > level controller will most likely implement some of kind of queuing/buffering > mechanism where it will buffer requeuests when it decides to throttle the > group. Now once a fair number read and requests are buffered, and if > controller is ready to dispatch some requests from the group, which > requests/bio should it dispatch? reads first or writes first or reads and > writes in certain ratio? The write-starve-reads on dm-ioband, that you pointed out before, was not caused by FIFO release, it was caused by IO flow control in dm-ioband. When I turned off the flow control, then the read throughput was quite improved. Now I'm considering separating dm-ioband's internal queue into sync and async and giving a certain priority of dispatch to async IOs. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-25 9:07 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-25 9:07 UTC (permalink / raw) To: vgoyal Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, akpm, containers, linux-kernel, s-uchida, righi.andrea, torvalds Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > Higher level solutions are not keeping track of time slices. Time slices will > be allocated by CFQ which does not have any idea about grouping. Higher > level controller just keeps track of size of IO done at group level and > then run either a leaky bucket or token bucket algorithm. > > IO throttling is a max BW controller, so it will not even care about what is > happening in other group. It will just be concerned with rate of IO in one > particular group and if we exceed specified limit, throttle it. So until and > unless sequential reader group hits it max bw limit, it will keep sending > reads down to CFQ, and CFQ will happily assign 100ms slices to readers. > > dm-ioband will not try to choke the high throughput sequential reader group > for the slow random reader group because that would just kill the throughput > of rotational media. Every sequential reader will run for few ms and then > be throttled and this goes on. Disk will soon be seek bound. Because dm-ioband provides faireness in terms of how many IO requests are issued or how many bytes are transferred, so this behaviour is to be expected. Do you think fairness in terms of IO requests and size is not fair? > > > Buffering at higher layer can delay read requests for more than slice idle > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > > for a request from the queue but it is buffered at higher layer and then idle > > > timer will fire. It means that queue will losse its share at the same time > > > overall throughput will be impacted as we lost those 8 ms. > > > > That sounds like a bug. > > > > Actually this probably is a limitation of higher level controller. It most > likely is sitting so high in IO stack that it has no idea what underlying > IO scheduler is and what are IO scheduler's policies. So it can't keep up > with IO scheduler's policies. Secondly, it might be a low weight group and > tokens might not be available fast enough to release the request. > > > > Read Vs Write > > > ------------- > > > Writes can overwhelm readers hence second level controller FIFO release > > > will run into issue here. If there is a single queue maintained then reads > > > will suffer large latencies. If there separate queues for reads and writes > > > then it will be hard to decide in what ratio to dispatch reads and writes as > > > it is IO scheduler's decision to decide when and how much read/write to > > > dispatch. This is another place where higher level controller will not be in > > > sync with lower level io scheduler and can change the effective policies of > > > underlying io scheduler. > > > > The IO schedulers already take care of read-vs-write and already take > > care of preventing large writes-starve-reads latencies (or at least, > > they're supposed to). > > True. Actually this is a limitation of higher level controller. A higher > level controller will most likely implement some of kind of queuing/buffering > mechanism where it will buffer requeuests when it decides to throttle the > group. Now once a fair number read and requests are buffered, and if > controller is ready to dispatch some requests from the group, which > requests/bio should it dispatch? reads first or writes first or reads and > writes in certain ratio? The write-starve-reads on dm-ioband, that you pointed out before, was not caused by FIFO release, it was caused by IO flow control in dm-ioband. When I turned off the flow control, then the read throughput was quite improved. Now I'm considering separating dm-ioband's internal queue into sync and async and giving a certain priority of dispatch to async IOs. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 9:07 ` Ryo Tsuruta @ 2009-09-25 14:33 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 14:33 UTC (permalink / raw) To: Ryo Tsuruta Cc: akpm, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > Higher level solutions are not keeping track of time slices. Time slices will > > be allocated by CFQ which does not have any idea about grouping. Higher > > level controller just keeps track of size of IO done at group level and > > then run either a leaky bucket or token bucket algorithm. > > > > IO throttling is a max BW controller, so it will not even care about what is > > happening in other group. It will just be concerned with rate of IO in one > > particular group and if we exceed specified limit, throttle it. So until and > > unless sequential reader group hits it max bw limit, it will keep sending > > reads down to CFQ, and CFQ will happily assign 100ms slices to readers. > > > > dm-ioband will not try to choke the high throughput sequential reader group > > for the slow random reader group because that would just kill the throughput > > of rotational media. Every sequential reader will run for few ms and then > > be throttled and this goes on. Disk will soon be seek bound. > > Because dm-ioband provides faireness in terms of how many IO requests > are issued or how many bytes are transferred, so this behaviour is to > be expected. Do you think fairness in terms of IO requests and size is > not fair? > Hi Ryo, Fairness in terms of size of IO or number of requests is probably not the best thing to do on rotational media where seek latencies are significant. It probably should work just well on media with very low seek latencies like SSD. So on rotational media, either you will not provide fairness to random readers because they are too slow or you will choke the sequential readers in other group and also bring down the overall disk throughput. If you don't decide to choke/throttle sequential reader group for the sake of random reader in other group then you will not have a good control on random reader latencies. Because now IO scheduler sees the IO from both sequential reader as well as random reader and sequential readers have not been throttled. So the dispatch pattern/time slices will again look like.. SR1 SR2 SR3 SR4 SR5 RR..... instead of SR1 RR SR2 RR SR3 RR SR4 RR .... SR --> sequential reader, RR --> random reader > > > > Buffering at higher layer can delay read requests for more than slice idle > > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > > > for a request from the queue but it is buffered at higher layer and then idle > > > > timer will fire. It means that queue will losse its share at the same time > > > > overall throughput will be impacted as we lost those 8 ms. > > > > > > That sounds like a bug. > > > > > > > Actually this probably is a limitation of higher level controller. It most > > likely is sitting so high in IO stack that it has no idea what underlying > > IO scheduler is and what are IO scheduler's policies. So it can't keep up > > with IO scheduler's policies. Secondly, it might be a low weight group and > > tokens might not be available fast enough to release the request. > > > > > > Read Vs Write > > > > ------------- > > > > Writes can overwhelm readers hence second level controller FIFO release > > > > will run into issue here. If there is a single queue maintained then reads > > > > will suffer large latencies. If there separate queues for reads and writes > > > > then it will be hard to decide in what ratio to dispatch reads and writes as > > > > it is IO scheduler's decision to decide when and how much read/write to > > > > dispatch. This is another place where higher level controller will not be in > > > > sync with lower level io scheduler and can change the effective policies of > > > > underlying io scheduler. > > > > > > The IO schedulers already take care of read-vs-write and already take > > > care of preventing large writes-starve-reads latencies (or at least, > > > they're supposed to). > > > > True. Actually this is a limitation of higher level controller. A higher > > level controller will most likely implement some of kind of queuing/buffering > > mechanism where it will buffer requeuests when it decides to throttle the > > group. Now once a fair number read and requests are buffered, and if > > controller is ready to dispatch some requests from the group, which > > requests/bio should it dispatch? reads first or writes first or reads and > > writes in certain ratio? > > The write-starve-reads on dm-ioband, that you pointed out before, was > not caused by FIFO release, it was caused by IO flow control in > dm-ioband. When I turned off the flow control, then the read > throughput was quite improved. What was flow control doing? > > Now I'm considering separating dm-ioband's internal queue into sync > and async and giving a certain priority of dispatch to async IOs. Even if you maintain separate queues for sync and async, in what ratio will you dispatch reads and writes to underlying layer once fresh tokens become available to the group and you decide to unthrottle the group. Whatever policy you adopt for read and write dispatch, it might not match with policy of underlying IO scheduler because every IO scheduler seems to have its own way of determining how reads and writes should be dispatched. Now somebody might start complaining that my job inside the group is not getting same reader/writer ratio as it was getting outside the group. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-25 14:33 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 14:33 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, akpm, containers, linux-kernel, s-uchida, righi.andrea, torvalds On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > Higher level solutions are not keeping track of time slices. Time slices will > > be allocated by CFQ which does not have any idea about grouping. Higher > > level controller just keeps track of size of IO done at group level and > > then run either a leaky bucket or token bucket algorithm. > > > > IO throttling is a max BW controller, so it will not even care about what is > > happening in other group. It will just be concerned with rate of IO in one > > particular group and if we exceed specified limit, throttle it. So until and > > unless sequential reader group hits it max bw limit, it will keep sending > > reads down to CFQ, and CFQ will happily assign 100ms slices to readers. > > > > dm-ioband will not try to choke the high throughput sequential reader group > > for the slow random reader group because that would just kill the throughput > > of rotational media. Every sequential reader will run for few ms and then > > be throttled and this goes on. Disk will soon be seek bound. > > Because dm-ioband provides faireness in terms of how many IO requests > are issued or how many bytes are transferred, so this behaviour is to > be expected. Do you think fairness in terms of IO requests and size is > not fair? > Hi Ryo, Fairness in terms of size of IO or number of requests is probably not the best thing to do on rotational media where seek latencies are significant. It probably should work just well on media with very low seek latencies like SSD. So on rotational media, either you will not provide fairness to random readers because they are too slow or you will choke the sequential readers in other group and also bring down the overall disk throughput. If you don't decide to choke/throttle sequential reader group for the sake of random reader in other group then you will not have a good control on random reader latencies. Because now IO scheduler sees the IO from both sequential reader as well as random reader and sequential readers have not been throttled. So the dispatch pattern/time slices will again look like.. SR1 SR2 SR3 SR4 SR5 RR..... instead of SR1 RR SR2 RR SR3 RR SR4 RR .... SR --> sequential reader, RR --> random reader > > > > Buffering at higher layer can delay read requests for more than slice idle > > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > > > for a request from the queue but it is buffered at higher layer and then idle > > > > timer will fire. It means that queue will losse its share at the same time > > > > overall throughput will be impacted as we lost those 8 ms. > > > > > > That sounds like a bug. > > > > > > > Actually this probably is a limitation of higher level controller. It most > > likely is sitting so high in IO stack that it has no idea what underlying > > IO scheduler is and what are IO scheduler's policies. So it can't keep up > > with IO scheduler's policies. Secondly, it might be a low weight group and > > tokens might not be available fast enough to release the request. > > > > > > Read Vs Write > > > > ------------- > > > > Writes can overwhelm readers hence second level controller FIFO release > > > > will run into issue here. If there is a single queue maintained then reads > > > > will suffer large latencies. If there separate queues for reads and writes > > > > then it will be hard to decide in what ratio to dispatch reads and writes as > > > > it is IO scheduler's decision to decide when and how much read/write to > > > > dispatch. This is another place where higher level controller will not be in > > > > sync with lower level io scheduler and can change the effective policies of > > > > underlying io scheduler. > > > > > > The IO schedulers already take care of read-vs-write and already take > > > care of preventing large writes-starve-reads latencies (or at least, > > > they're supposed to). > > > > True. Actually this is a limitation of higher level controller. A higher > > level controller will most likely implement some of kind of queuing/buffering > > mechanism where it will buffer requeuests when it decides to throttle the > > group. Now once a fair number read and requests are buffered, and if > > controller is ready to dispatch some requests from the group, which > > requests/bio should it dispatch? reads first or writes first or reads and > > writes in certain ratio? > > The write-starve-reads on dm-ioband, that you pointed out before, was > not caused by FIFO release, it was caused by IO flow control in > dm-ioband. When I turned off the flow control, then the read > throughput was quite improved. What was flow control doing? > > Now I'm considering separating dm-ioband's internal queue into sync > and async and giving a certain priority of dispatch to async IOs. Even if you maintain separate queues for sync and async, in what ratio will you dispatch reads and writes to underlying layer once fresh tokens become available to the group and you decide to unthrottle the group. Whatever policy you adopt for read and write dispatch, it might not match with policy of underlying IO scheduler because every IO scheduler seems to have its own way of determining how reads and writes should be dispatched. Now somebody might start complaining that my job inside the group is not getting same reader/writer ratio as it was getting outside the group. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090925143337.GA15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090925143337.GA15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-28 7:30 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-28 7:30 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > Because dm-ioband provides faireness in terms of how many IO requests > > are issued or how many bytes are transferred, so this behaviour is to > > be expected. Do you think fairness in terms of IO requests and size is > > not fair? > > > > Hi Ryo, > > Fairness in terms of size of IO or number of requests is probably not the > best thing to do on rotational media where seek latencies are significant. > > It probably should work just well on media with very low seek latencies > like SSD. > > So on rotational media, either you will not provide fairness to random > readers because they are too slow or you will choke the sequential readers > in other group and also bring down the overall disk throughput. > > If you don't decide to choke/throttle sequential reader group for the sake > of random reader in other group then you will not have a good control > on random reader latencies. Because now IO scheduler sees the IO from both > sequential reader as well as random reader and sequential readers have not > been throttled. So the dispatch pattern/time slices will again look like.. > > SR1 SR2 SR3 SR4 SR5 RR..... > > instead of > > SR1 RR SR2 RR SR3 RR SR4 RR .... > > SR --> sequential reader, RR --> random reader Thank you for elaborating. However, I think that fairness in terms of disk time has a similar problem. The below is a benchmark result of randread vs seqread I posted before, rand-readers and seq-readers ran on individual groups and their weights were equally assigned. Throughput [KiB/s] io-controller dm-ioband randread 161 314 seqread 9556 631 I know that dm-ioband is needed to improvement on the seqread throughput, but I don't think that io-controller seems quite fair, even the disk times of each group are equal, why randread can't get more bandwidth. So I think that this is how users think about faireness, and it would be good thing to provide multiple policies of bandwidth control for uses. > > The write-starve-reads on dm-ioband, that you pointed out before, was > > not caused by FIFO release, it was caused by IO flow control in > > dm-ioband. When I turned off the flow control, then the read > > throughput was quite improved. > > What was flow control doing? dm-ioband gives a limit on each IO group. When the number of IO requests backlogged in a group exceeds the limit, processes which are going to issue IO requests to the group are made sleep until all the backlogged requests are flushed out. > > Now I'm considering separating dm-ioband's internal queue into sync > > and async and giving a certain priority of dispatch to async IOs. > > Even if you maintain separate queues for sync and async, in what ratio will > you dispatch reads and writes to underlying layer once fresh tokens become > available to the group and you decide to unthrottle the group. Now I'm thinking that It's according to the requested order, but when the number of in-flight sync IOs exceeds io_limit (io_limit is calculated based on nr_requests of underlying block device), dm-ioband dispatches only async IOs until the number of in-flight sync IOs are below the io_limit, and vice versa. At least it could solve the write-starve-read issue which you pointed out. > Whatever policy you adopt for read and write dispatch, it might not match > with policy of underlying IO scheduler because every IO scheduler seems to > have its own way of determining how reads and writes should be dispatched. I think that this is a matter of users choise, which a user would like to give priority to bandwidth or IO scheduler's policy. > Now somebody might start complaining that my job inside the group is not > getting same reader/writer ratio as it was getting outside the group. > > Thanks > Vivek Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 14:33 ` Vivek Goyal (?) (?) @ 2009-09-28 7:30 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-28 7:30 UTC (permalink / raw) To: vgoyal Cc: akpm, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo, riel Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > > Because dm-ioband provides faireness in terms of how many IO requests > > are issued or how many bytes are transferred, so this behaviour is to > > be expected. Do you think fairness in terms of IO requests and size is > > not fair? > > > > Hi Ryo, > > Fairness in terms of size of IO or number of requests is probably not the > best thing to do on rotational media where seek latencies are significant. > > It probably should work just well on media with very low seek latencies > like SSD. > > So on rotational media, either you will not provide fairness to random > readers because they are too slow or you will choke the sequential readers > in other group and also bring down the overall disk throughput. > > If you don't decide to choke/throttle sequential reader group for the sake > of random reader in other group then you will not have a good control > on random reader latencies. Because now IO scheduler sees the IO from both > sequential reader as well as random reader and sequential readers have not > been throttled. So the dispatch pattern/time slices will again look like.. > > SR1 SR2 SR3 SR4 SR5 RR..... > > instead of > > SR1 RR SR2 RR SR3 RR SR4 RR .... > > SR --> sequential reader, RR --> random reader Thank you for elaborating. However, I think that fairness in terms of disk time has a similar problem. The below is a benchmark result of randread vs seqread I posted before, rand-readers and seq-readers ran on individual groups and their weights were equally assigned. Throughput [KiB/s] io-controller dm-ioband randread 161 314 seqread 9556 631 I know that dm-ioband is needed to improvement on the seqread throughput, but I don't think that io-controller seems quite fair, even the disk times of each group are equal, why randread can't get more bandwidth. So I think that this is how users think about faireness, and it would be good thing to provide multiple policies of bandwidth control for uses. > > The write-starve-reads on dm-ioband, that you pointed out before, was > > not caused by FIFO release, it was caused by IO flow control in > > dm-ioband. When I turned off the flow control, then the read > > throughput was quite improved. > > What was flow control doing? dm-ioband gives a limit on each IO group. When the number of IO requests backlogged in a group exceeds the limit, processes which are going to issue IO requests to the group are made sleep until all the backlogged requests are flushed out. > > Now I'm considering separating dm-ioband's internal queue into sync > > and async and giving a certain priority of dispatch to async IOs. > > Even if you maintain separate queues for sync and async, in what ratio will > you dispatch reads and writes to underlying layer once fresh tokens become > available to the group and you decide to unthrottle the group. Now I'm thinking that It's according to the requested order, but when the number of in-flight sync IOs exceeds io_limit (io_limit is calculated based on nr_requests of underlying block device), dm-ioband dispatches only async IOs until the number of in-flight sync IOs are below the io_limit, and vice versa. At least it could solve the write-starve-read issue which you pointed out. > Whatever policy you adopt for read and write dispatch, it might not match > with policy of underlying IO scheduler because every IO scheduler seems to > have its own way of determining how reads and writes should be dispatched. I think that this is a matter of users choise, which a user would like to give priority to bandwidth or IO scheduler's policy. > Now somebody might start complaining that my job inside the group is not > getting same reader/writer ratio as it was getting outside the group. > > Thanks > Vivek Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090925.180724.104041942.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090925.180724.104041942.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2009-09-25 14:33 ` Vivek Goyal 2009-09-25 15:04 ` Rik van Riel 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 14:33 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > Higher level solutions are not keeping track of time slices. Time slices will > > be allocated by CFQ which does not have any idea about grouping. Higher > > level controller just keeps track of size of IO done at group level and > > then run either a leaky bucket or token bucket algorithm. > > > > IO throttling is a max BW controller, so it will not even care about what is > > happening in other group. It will just be concerned with rate of IO in one > > particular group and if we exceed specified limit, throttle it. So until and > > unless sequential reader group hits it max bw limit, it will keep sending > > reads down to CFQ, and CFQ will happily assign 100ms slices to readers. > > > > dm-ioband will not try to choke the high throughput sequential reader group > > for the slow random reader group because that would just kill the throughput > > of rotational media. Every sequential reader will run for few ms and then > > be throttled and this goes on. Disk will soon be seek bound. > > Because dm-ioband provides faireness in terms of how many IO requests > are issued or how many bytes are transferred, so this behaviour is to > be expected. Do you think fairness in terms of IO requests and size is > not fair? > Hi Ryo, Fairness in terms of size of IO or number of requests is probably not the best thing to do on rotational media where seek latencies are significant. It probably should work just well on media with very low seek latencies like SSD. So on rotational media, either you will not provide fairness to random readers because they are too slow or you will choke the sequential readers in other group and also bring down the overall disk throughput. If you don't decide to choke/throttle sequential reader group for the sake of random reader in other group then you will not have a good control on random reader latencies. Because now IO scheduler sees the IO from both sequential reader as well as random reader and sequential readers have not been throttled. So the dispatch pattern/time slices will again look like.. SR1 SR2 SR3 SR4 SR5 RR..... instead of SR1 RR SR2 RR SR3 RR SR4 RR .... SR --> sequential reader, RR --> random reader > > > > Buffering at higher layer can delay read requests for more than slice idle > > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > > > for a request from the queue but it is buffered at higher layer and then idle > > > > timer will fire. It means that queue will losse its share at the same time > > > > overall throughput will be impacted as we lost those 8 ms. > > > > > > That sounds like a bug. > > > > > > > Actually this probably is a limitation of higher level controller. It most > > likely is sitting so high in IO stack that it has no idea what underlying > > IO scheduler is and what are IO scheduler's policies. So it can't keep up > > with IO scheduler's policies. Secondly, it might be a low weight group and > > tokens might not be available fast enough to release the request. > > > > > > Read Vs Write > > > > ------------- > > > > Writes can overwhelm readers hence second level controller FIFO release > > > > will run into issue here. If there is a single queue maintained then reads > > > > will suffer large latencies. If there separate queues for reads and writes > > > > then it will be hard to decide in what ratio to dispatch reads and writes as > > > > it is IO scheduler's decision to decide when and how much read/write to > > > > dispatch. This is another place where higher level controller will not be in > > > > sync with lower level io scheduler and can change the effective policies of > > > > underlying io scheduler. > > > > > > The IO schedulers already take care of read-vs-write and already take > > > care of preventing large writes-starve-reads latencies (or at least, > > > they're supposed to). > > > > True. Actually this is a limitation of higher level controller. A higher > > level controller will most likely implement some of kind of queuing/buffering > > mechanism where it will buffer requeuests when it decides to throttle the > > group. Now once a fair number read and requests are buffered, and if > > controller is ready to dispatch some requests from the group, which > > requests/bio should it dispatch? reads first or writes first or reads and > > writes in certain ratio? > > The write-starve-reads on dm-ioband, that you pointed out before, was > not caused by FIFO release, it was caused by IO flow control in > dm-ioband. When I turned off the flow control, then the read > throughput was quite improved. What was flow control doing? > > Now I'm considering separating dm-ioband's internal queue into sync > and async and giving a certain priority of dispatch to async IOs. Even if you maintain separate queues for sync and async, in what ratio will you dispatch reads and writes to underlying layer once fresh tokens become available to the group and you decide to unthrottle the group. Whatever policy you adopt for read and write dispatch, it might not match with policy of underlying IO scheduler because every IO scheduler seems to have its own way of determining how reads and writes should be dispatched. Now somebody might start complaining that my job inside the group is not getting same reader/writer ratio as it was getting outside the group. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20090925.180724.104041942.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-09-25 14:33 ` Vivek Goyal @ 2009-09-25 15:04 ` Rik van Riel 1 sibling, 0 replies; 349+ messages in thread From: Rik van Riel @ 2009-09-25 15:04 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Ryo Tsuruta wrote: > Because dm-ioband provides faireness in terms of how many IO requests > are issued or how many bytes are transferred, so this behaviour is to > be expected. Do you think fairness in terms of IO requests and size is > not fair? When there are two workloads competing for the same resources, I would expect each of the workloads to run at about 50% of the speed at which it would run on an uncontended system. Having one of the workloads run at 95% of the uncontended speed and the other workload at 5% is "not fair" (to put it diplomatically). -- All rights reversed. ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 9:07 ` Ryo Tsuruta @ 2009-09-25 15:04 ` Rik van Riel -1 siblings, 0 replies; 349+ messages in thread From: Rik van Riel @ 2009-09-25 15:04 UTC (permalink / raw) To: Ryo Tsuruta Cc: vgoyal, akpm, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo Ryo Tsuruta wrote: > Because dm-ioband provides faireness in terms of how many IO requests > are issued or how many bytes are transferred, so this behaviour is to > be expected. Do you think fairness in terms of IO requests and size is > not fair? When there are two workloads competing for the same resources, I would expect each of the workloads to run at about 50% of the speed at which it would run on an uncontended system. Having one of the workloads run at 95% of the uncontended speed and the other workload at 5% is "not fair" (to put it diplomatically). -- All rights reversed. ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-25 15:04 ` Rik van Riel 0 siblings, 0 replies; 349+ messages in thread From: Rik van Riel @ 2009-09-25 15:04 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, vgoyal, m-ikeda, lizf, fchecconi, akpm, containers, linux-kernel, s-uchida, righi.andrea, torvalds Ryo Tsuruta wrote: > Because dm-ioband provides faireness in terms of how many IO requests > are issued or how many bytes are transferred, so this behaviour is to > be expected. Do you think fairness in terms of IO requests and size is > not fair? When there are two workloads competing for the same resources, I would expect each of the workloads to run at about 50% of the speed at which it would run on an uncontended system. Having one of the workloads run at 95% of the uncontended speed and the other workload at 5% is "not fair" (to put it diplomatically). -- All rights reversed. ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 15:04 ` Rik van Riel @ 2009-09-28 7:38 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-28 7:38 UTC (permalink / raw) To: riel Cc: vgoyal, akpm, linux-kernel, jens.axboe, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, peterz, jmarchan, torvalds, mingo Hi Rik, Rik van Riel <riel@redhat.com> wrote: > Ryo Tsuruta wrote: > > > Because dm-ioband provides faireness in terms of how many IO requests > > are issued or how many bytes are transferred, so this behaviour is to > > be expected. Do you think fairness in terms of IO requests and size is > > not fair? > > When there are two workloads competing for the same > resources, I would expect each of the workloads to > run at about 50% of the speed at which it would run > on an uncontended system. > > Having one of the workloads run at 95% of the > uncontended speed and the other workload at 5% > is "not fair" (to put it diplomatically). As I wrote in the mail to Vivek, I think that providing multiple policies, on a per disk time basis, on a per iosize basis, maximum rate limiting or etc would be good for users. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-28 7:38 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-28 7:38 UTC (permalink / raw) To: riel Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, jmoyer, nauman, mingo, vgoyal, m-ikeda, lizf, fchecconi, akpm, containers, linux-kernel, s-uchida, righi.andrea, torvalds Hi Rik, Rik van Riel <riel@redhat.com> wrote: > Ryo Tsuruta wrote: > > > Because dm-ioband provides faireness in terms of how many IO requests > > are issued or how many bytes are transferred, so this behaviour is to > > be expected. Do you think fairness in terms of IO requests and size is > > not fair? > > When there are two workloads competing for the same > resources, I would expect each of the workloads to > run at about 50% of the speed at which it would run > on an uncontended system. > > Having one of the workloads run at 95% of the > uncontended speed and the other workload at 5% > is "not fair" (to put it diplomatically). As I wrote in the mail to Vivek, I think that providing multiple policies, on a per disk time basis, on a per iosize basis, maximum rate limiting or etc would be good for users. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4ABCDBFF.1020203-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4ABCDBFF.1020203-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-28 7:38 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-28 7:38 UTC (permalink / raw) To: riel-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Rik, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Ryo Tsuruta wrote: > > > Because dm-ioband provides faireness in terms of how many IO requests > > are issued or how many bytes are transferred, so this behaviour is to > > be expected. Do you think fairness in terms of IO requests and size is > > not fair? > > When there are two workloads competing for the same > resources, I would expect each of the workloads to > run at about 50% of the speed at which it would run > on an uncontended system. > > Having one of the workloads run at 95% of the > uncontended speed and the other workload at 5% > is "not fair" (to put it diplomatically). As I wrote in the mail to Vivek, I think that providing multiple policies, on a per disk time basis, on a per iosize basis, maximum rate limiting or etc would be good for users. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-24 19:25 Vivek Goyal 2009-09-24 21:33 ` Andrew Morton @ 2009-09-25 2:20 ` Ulrich Lukas [not found] ` <4ABC28DE.7050809-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org> 2009-09-25 20:26 ` Vivek Goyal 2009-09-29 0:37 ` Nauman Rafique [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 3 siblings, 2 replies; 349+ messages in thread From: Ulrich Lukas @ 2009-09-25 2:20 UTC (permalink / raw) To: Vivek Goyal; +Cc: linux-kernel, containers Vivek Goyal wrote: > Notes: > - With vanilla CFQ, random writers can overwhelm a random reader. > Bring down its throughput and bump up latencies significantly. IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, too. I'm basing this assumption on the observations I made on both OpenSuse 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML titled: "Poor desktop responsiveness with background I/O-operations" of 2009-09-20. (Message ID: 4AB59CBB.8090907@datenparkplatz.de) Thus, I'm posting this to show that your work is greatly appreciated, given the rather disappointig status quo of Linux's fairness when it comes to disk IO time. I hope that your efforts lead to a change in performance of current userland applications, the sooner, the better. Thanks Ulrich ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4ABC28DE.7050809-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4ABC28DE.7050809-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org> @ 2009-09-25 20:26 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 20:26 UTC (permalink / raw) To: Ulrich Lukas Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > Vivek Goyal wrote: > > Notes: > > - With vanilla CFQ, random writers can overwhelm a random reader. > > Bring down its throughput and bump up latencies significantly. > > > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > too. > > I'm basing this assumption on the observations I made on both OpenSuse > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > titled: "Poor desktop responsiveness with background I/O-operations" of > 2009-09-20. > (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org) > > > Thus, I'm posting this to show that your work is greatly appreciated, > given the rather disappointig status quo of Linux's fairness when it > comes to disk IO time. > > I hope that your efforts lead to a change in performance of current > userland applications, the sooner, the better. > [Please don't remove people from original CC list. I am putting them back.] Hi Ulrich, I quicky went through that mail thread and I tried following on my desktop. ########################################## dd if=/home/vgoyal/4G-file of=/dev/null & sleep 5 time firefox # close firefox once gui pops up. ########################################## It was taking close to 1 minute 30 seconds to launch firefox and dd got following. 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s (Results do vary across runs, especially if system is booted fresh. Don't know why...). Then I tried putting both the applications in separate groups and assign them weights 200 each. ########################################## dd if=/home/vgoyal/4G-file of=/dev/null & echo $! > /cgroup/io/test1/tasks sleep 5 echo $$ > /cgroup/io/test2/tasks time firefox # close firefox once gui pops up. ########################################## Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s Notice that throughput of dd also improved. I ran the block trace and noticed in many a cases firefox threads immediately preempted the "dd". Probably because it was a file system request. So in this case latency will arise from seek time. In some other cases, threads had to wait for up to 100ms because dd was not preempted. In this case latency will arise both from waiting on queue as well as seek time. With cgroup thing, We will run 100ms slice for the group in which firefox is being launched and then give 100ms uninterrupted time slice to dd. So it should cut down on number of seeks happening and that's why we probably see this improvement. So grouping can help in such cases. May be you can move your X session in one group and launch the big IO in other group. Most likely you should have better desktop experience without compromising on dd thread output. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 2:20 ` Ulrich Lukas @ 2009-09-25 20:26 ` Vivek Goyal 2009-09-25 20:26 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 20:26 UTC (permalink / raw) To: Ulrich Lukas Cc: linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > Vivek Goyal wrote: > > Notes: > > - With vanilla CFQ, random writers can overwhelm a random reader. > > Bring down its throughput and bump up latencies significantly. > > > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > too. > > I'm basing this assumption on the observations I made on both OpenSuse > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > titled: "Poor desktop responsiveness with background I/O-operations" of > 2009-09-20. > (Message ID: 4AB59CBB.8090907@datenparkplatz.de) > > > Thus, I'm posting this to show that your work is greatly appreciated, > given the rather disappointig status quo of Linux's fairness when it > comes to disk IO time. > > I hope that your efforts lead to a change in performance of current > userland applications, the sooner, the better. > [Please don't remove people from original CC list. I am putting them back.] Hi Ulrich, I quicky went through that mail thread and I tried following on my desktop. ########################################## dd if=/home/vgoyal/4G-file of=/dev/null & sleep 5 time firefox # close firefox once gui pops up. ########################################## It was taking close to 1 minute 30 seconds to launch firefox and dd got following. 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s (Results do vary across runs, especially if system is booted fresh. Don't know why...). Then I tried putting both the applications in separate groups and assign them weights 200 each. ########################################## dd if=/home/vgoyal/4G-file of=/dev/null & echo $! > /cgroup/io/test1/tasks sleep 5 echo $$ > /cgroup/io/test2/tasks time firefox # close firefox once gui pops up. ########################################## Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s Notice that throughput of dd also improved. I ran the block trace and noticed in many a cases firefox threads immediately preempted the "dd". Probably because it was a file system request. So in this case latency will arise from seek time. In some other cases, threads had to wait for up to 100ms because dd was not preempted. In this case latency will arise both from waiting on queue as well as seek time. With cgroup thing, We will run 100ms slice for the group in which firefox is being launched and then give 100ms uninterrupted time slice to dd. So it should cut down on number of seeks happening and that's why we probably see this improvement. So grouping can help in such cases. May be you can move your X session in one group and launch the big IO in other group. Most likely you should have better desktop experience without compromising on dd thread output. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-25 20:26 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-25 20:26 UTC (permalink / raw) To: Ulrich Lukas Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, fernando, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > Vivek Goyal wrote: > > Notes: > > - With vanilla CFQ, random writers can overwhelm a random reader. > > Bring down its throughput and bump up latencies significantly. > > > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > too. > > I'm basing this assumption on the observations I made on both OpenSuse > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > titled: "Poor desktop responsiveness with background I/O-operations" of > 2009-09-20. > (Message ID: 4AB59CBB.8090907@datenparkplatz.de) > > > Thus, I'm posting this to show that your work is greatly appreciated, > given the rather disappointig status quo of Linux's fairness when it > comes to disk IO time. > > I hope that your efforts lead to a change in performance of current > userland applications, the sooner, the better. > [Please don't remove people from original CC list. I am putting them back.] Hi Ulrich, I quicky went through that mail thread and I tried following on my desktop. ########################################## dd if=/home/vgoyal/4G-file of=/dev/null & sleep 5 time firefox # close firefox once gui pops up. ########################################## It was taking close to 1 minute 30 seconds to launch firefox and dd got following. 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s (Results do vary across runs, especially if system is booted fresh. Don't know why...). Then I tried putting both the applications in separate groups and assign them weights 200 each. ########################################## dd if=/home/vgoyal/4G-file of=/dev/null & echo $! > /cgroup/io/test1/tasks sleep 5 echo $$ > /cgroup/io/test2/tasks time firefox # close firefox once gui pops up. ########################################## Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s Notice that throughput of dd also improved. I ran the block trace and noticed in many a cases firefox threads immediately preempted the "dd". Probably because it was a file system request. So in this case latency will arise from seek time. In some other cases, threads had to wait for up to 100ms because dd was not preempted. In this case latency will arise both from waiting on queue as well as seek time. With cgroup thing, We will run 100ms slice for the group in which firefox is being launched and then give 100ms uninterrupted time slice to dd. So it should cut down on number of seeks happening and that's why we probably see this improvement. So grouping can help in such cases. May be you can move your X session in one group and launch the big IO in other group. Most likely you should have better desktop experience without compromising on dd thread output. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 20:26 ` Vivek Goyal (?) @ 2009-09-26 14:51 ` Mike Galbraith [not found] ` <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-27 6:55 ` Mike Galbraith -1 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-26 14:51 UTC (permalink / raw) To: Vivek Goyal Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe On Fri, 2009-09-25 at 16:26 -0400, Vivek Goyal wrote: > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > > Vivek Goyal wrote: > > > Notes: > > > - With vanilla CFQ, random writers can overwhelm a random reader. > > > Bring down its throughput and bump up latencies significantly. > > > > > > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > > too. > > > > I'm basing this assumption on the observations I made on both OpenSuse > > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > > titled: "Poor desktop responsiveness with background I/O-operations" of > > 2009-09-20. > > (Message ID: 4AB59CBB.8090907@datenparkplatz.de) > > > > > > Thus, I'm posting this to show that your work is greatly appreciated, > > given the rather disappointig status quo of Linux's fairness when it > > comes to disk IO time. > > > > I hope that your efforts lead to a change in performance of current > > userland applications, the sooner, the better. > > > [Please don't remove people from original CC list. I am putting them back.] > > Hi Ulrich, > > I quicky went through that mail thread and I tried following on my > desktop. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > sleep 5 > time firefox > # close firefox once gui pops up. > ########################################## > > It was taking close to 1 minute 30 seconds to launch firefox and dd got > following. > > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > > (Results do vary across runs, especially if system is booted fresh. Don't > know why...). > > > Then I tried putting both the applications in separate groups and assign > them weights 200 each. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > echo $! > /cgroup/io/test1/tasks > sleep 5 > echo $$ > /cgroup/io/test2/tasks > time firefox > # close firefox once gui pops up. > ########################################## > > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > > Notice that throughput of dd also improved. > > I ran the block trace and noticed in many a cases firefox threads > immediately preempted the "dd". Probably because it was a file system > request. So in this case latency will arise from seek time. > > In some other cases, threads had to wait for up to 100ms because dd was > not preempted. In this case latency will arise both from waiting on queue > as well as seek time. Hm, with tip, I see ~10ms max wakeup latency running scriptlet below. > With cgroup thing, We will run 100ms slice for the group in which firefox > is being launched and then give 100ms uninterrupted time slice to dd. So > it should cut down on number of seeks happening and that's why we probably > see this improvement. I'm not testing with group IO/CPU, but my numbers kinda agree that it's seek latency that's THE killer. What the compiled numbers below from the cheezy script below that _seem_ to be telling me is that the default setting of CFQ quantum is allowing too many write requests through, inflicting too much read latency... for the disk where my binaries live. The longer the seeky burst, the more it hurts both reader/writer, so cutting down the max requests queueable helps the reader (which i think can't queue anything near per unit time that the writer can) finish and get out of the writer's way sooner. 'nuff possibly useless words, onward to possibly useless numbers :) dd pre == number dd emits upon receiving USR1 before execing perf. perf stat == time to load/execute perf stat konsole -e exit. dd post == same after dd number, after perf finishes. quantum = 1 Avg dd pre 58.4 52.5 56.1 61.6 52.3 56.1 MB/s perf stat 2.87 0.91 1.64 1.41 0.90 1.5 Sec dd post 56.6 61.0 66.3 64.7 60.9 61.9 quantum = 2 dd pre 59.7 62.4 58.9 65.3 60.3 61.3 perf stat 5.81 6.09 6.24 10.13 6.21 6.8 dd post 64.0 62.6 64.2 60.4 61.1 62.4 quantum = 3 dd pre 65.5 57.7 54.5 51.1 56.3 57.0 perf stat 14.01 13.71 8.35 5.35 8.57 9.9 dd post 59.2 49.1 58.8 62.3 62.1 58.3 quantum = 4 dd pre 57.2 52.1 56.8 55.2 61.6 56.5 perf stat 11.98 1.61 9.63 16.21 11.13 10.1 dd post 57.2 52.6 62.2 49.3 50.2 54.3 Nothing pinned btw, 4 cores available, but only 1 drive. #!/bin/sh DISK=sdb QUANTUM=/sys/block/$DISK/queue/iosched/quantum END=$(cat $QUANTUM) for q in `seq 1 $END`; do echo $q > $QUANTUM LOGFILE=quantum_log_$q rm -f $LOGFILE for i in `seq 1 5`; do echo 2 > /proc/sys/vm/drop_caches sh -c "dd if=/dev/zero of=./deleteme.dd 2>&1|tee -a $LOGFILE" & sleep 30 sh -c "echo quantum $(cat $QUANTUM) loop $i" 2>&1|tee -a $LOGFILE perf stat -- killlall -q get_stuf_into_ram >/dev/null 2>&1 sleep 1 killall -q -USR1 dd & sleep 1 sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE sleep 1 killall -q -USR1 dd & sleep 5 killall -qw dd rm -f ./deleteme.dd sync sh -c "echo" 2>&1|tee -a $LOGFILE done; done; ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-27 6:55 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-27 6:55 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b My dd vs load non-cached binary woes seem to be coming from backmerge. #if 0 /*MIKEDIDIT sand in gearbox?*/ /* * See if our hash lookup can find a potential backmerge. */ __rq = elv_rqhash_find(q, bio->bi_sector); if (__rq && elv_rq_merge_ok(__rq, bio)) { *req = __rq; return ELEVATOR_BACK_MERGE; } #endif - = stock = 0 + = /sys/block/sdb/queue/nomerges = 1 x = backmerge disabled quantum = 1 Avg dd pre 58.4 52.5 56.1 61.6 52.3 56.1- MB/s virgin/foo 59.6 54.4 53.0 56.1 58.6 56.3+ 1.003 53.8 56.6 54.7 50.7 59.3 55.0x .980 perf stat 2.87 0.91 1.64 1.41 0.90 1.5- Sec 2.61 1.14 1.45 1.43 1.47 1.6+ 1.066 1.07 1.19 1.20 1.24 1.37 1.2x .800 dd post 56.6 61.0 66.3 64.7 60.9 61.9- 54.0 59.3 61.1 58.3 58.9 58.3+ .941 54.3 60.2 59.6 60.6 60.3 59.0x .953 quantum = 2 dd pre 59.7 62.4 58.9 65.3 60.3 61.3- 49.4 51.9 58.7 49.3 52.4 52.3+ .853 58.3 52.8 53.1 50.4 59.9 54.9x .895 perf stat 5.81 6.09 6.24 10.13 6.21 6.8- 2.48 2.10 3.23 2.29 2.31 2.4+ .352 2.09 2.73 1.72 1.96 1.83 2.0x .294 dd post 64.0 62.6 64.2 60.4 61.1 62.4- 52.9 56.2 49.6 51.3 51.2 52.2+ .836 54.7 60.9 56.0 54.0 55.4 56.2x .900 quantum = 3 dd pre 65.5 57.7 54.5 51.1 56.3 57.0- 58.1 53.9 52.2 58.2 51.8 54.8+ .961 60.5 56.5 56.7 55.3 54.6 56.7x .994 perf stat 14.01 13.71 8.35 5.35 8.57 9.9- 1.84 2.30 2.14 2.10 2.45 2.1+ .212 2.12 1.63 2.54 2.23 2.29 2.1x .212 dd post 59.2 49.1 58.8 62.3 62.1 58.3- 59.8 53.2 55.2 50.9 53.7 54.5+ .934 56.1 61.9 51.9 54.3 53.1 55.4x .950 quantun = 4 dd pre 57.2 52.1 56.8 55.2 61.6 56.5- 48.7 55.4 51.3 49.7 54.5 51.9+ .918 55.8 54.5 50.3 56.4 49.3 53.2x .941 perf stat 11.98 1.61 9.63 16.21 11.13 10.1- 2.29 1.94 2.68 2.46 2.45 2.3+ .227 3.01 1.84 2.11 2.27 2.30 2.3x .227 dd post 57.2 52.6 62.2 49.3 50.2 54.3- 50.1 54.5 58.4 54.1 49.0 53.2+ .979 52.9 53.2 50.6 53.2 50.5 52.0x .957 ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-26 14:51 ` Mike Galbraith [not found] ` <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-27 6:55 ` Mike Galbraith [not found] ` <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-27 16:42 ` Jens Axboe 1 sibling, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-27 6:55 UTC (permalink / raw) To: Vivek Goyal Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe My dd vs load non-cached binary woes seem to be coming from backmerge. #if 0 /*MIKEDIDIT sand in gearbox?*/ /* * See if our hash lookup can find a potential backmerge. */ __rq = elv_rqhash_find(q, bio->bi_sector); if (__rq && elv_rq_merge_ok(__rq, bio)) { *req = __rq; return ELEVATOR_BACK_MERGE; } #endif - = stock = 0 + = /sys/block/sdb/queue/nomerges = 1 x = backmerge disabled quantum = 1 Avg dd pre 58.4 52.5 56.1 61.6 52.3 56.1- MB/s virgin/foo 59.6 54.4 53.0 56.1 58.6 56.3+ 1.003 53.8 56.6 54.7 50.7 59.3 55.0x .980 perf stat 2.87 0.91 1.64 1.41 0.90 1.5- Sec 2.61 1.14 1.45 1.43 1.47 1.6+ 1.066 1.07 1.19 1.20 1.24 1.37 1.2x .800 dd post 56.6 61.0 66.3 64.7 60.9 61.9- 54.0 59.3 61.1 58.3 58.9 58.3+ .941 54.3 60.2 59.6 60.6 60.3 59.0x .953 quantum = 2 dd pre 59.7 62.4 58.9 65.3 60.3 61.3- 49.4 51.9 58.7 49.3 52.4 52.3+ .853 58.3 52.8 53.1 50.4 59.9 54.9x .895 perf stat 5.81 6.09 6.24 10.13 6.21 6.8- 2.48 2.10 3.23 2.29 2.31 2.4+ .352 2.09 2.73 1.72 1.96 1.83 2.0x .294 dd post 64.0 62.6 64.2 60.4 61.1 62.4- 52.9 56.2 49.6 51.3 51.2 52.2+ .836 54.7 60.9 56.0 54.0 55.4 56.2x .900 quantum = 3 dd pre 65.5 57.7 54.5 51.1 56.3 57.0- 58.1 53.9 52.2 58.2 51.8 54.8+ .961 60.5 56.5 56.7 55.3 54.6 56.7x .994 perf stat 14.01 13.71 8.35 5.35 8.57 9.9- 1.84 2.30 2.14 2.10 2.45 2.1+ .212 2.12 1.63 2.54 2.23 2.29 2.1x .212 dd post 59.2 49.1 58.8 62.3 62.1 58.3- 59.8 53.2 55.2 50.9 53.7 54.5+ .934 56.1 61.9 51.9 54.3 53.1 55.4x .950 quantun = 4 dd pre 57.2 52.1 56.8 55.2 61.6 56.5- 48.7 55.4 51.3 49.7 54.5 51.9+ .918 55.8 54.5 50.3 56.4 49.3 53.2x .941 perf stat 11.98 1.61 9.63 16.21 11.13 10.1- 2.29 1.94 2.68 2.46 2.45 2.3+ .227 3.01 1.84 2.11 2.27 2.30 2.3x .227 dd post 57.2 52.6 62.2 49.3 50.2 54.3- 50.1 54.5 58.4 54.1 49.0 53.2+ .979 52.9 53.2 50.6 53.2 50.5 52.0x .957 ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-27 16:42 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-09-27 16:42 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sun, Sep 27 2009, Mike Galbraith wrote: > My dd vs load non-cached binary woes seem to be coming from backmerge. > > #if 0 /*MIKEDIDIT sand in gearbox?*/ > /* > * See if our hash lookup can find a potential backmerge. > */ > __rq = elv_rqhash_find(q, bio->bi_sector); > if (__rq && elv_rq_merge_ok(__rq, bio)) { > *req = __rq; > return ELEVATOR_BACK_MERGE; > } > #endif It's a given that not merging will provide better latency. We can't disable that or performance will suffer A LOT on some systems. There are ways to make it better, though. One would be to make the max request size smaller, but that would also hurt for streamed workloads. Can you try whether the below patch makes a difference? It will basically disallow merges to a request that isn't the last one. We should probably make the merging logic a bit more clever, since the below wont work well for two (or more) streamed cases. I'll think a bit about that. Note this is totally untested! diff --git a/block/elevator.c b/block/elevator.c index 1975b61..d00a72b 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) * See if our hash lookup can find a potential backmerge. */ __rq = elv_rqhash_find(q, bio->bi_sector); - if (__rq && elv_rq_merge_ok(__rq, bio)) { - *req = __rq; - return ELEVATOR_BACK_MERGE; + if (__rq) { + /* + * If requests are queued behind this one, disallow merge. This + * prevents streaming IO from continually passing new IO. + */ + if (elv_latter_request(q, __rq)) + return ELEVATOR_NO_MERGE; + if (elv_rq_merge_ok(__rq, bio)) { + *req = __rq; + return ELEVATOR_BACK_MERGE; + } } if (e->ops->elevator_merge_fn) -- Jens Axboe ^ permalink raw reply related [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-27 6:55 ` Mike Galbraith [not found] ` <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-27 16:42 ` Jens Axboe [not found] ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> ` (2 more replies) 1 sibling, 3 replies; 349+ messages in thread From: Jens Axboe @ 2009-09-27 16:42 UTC (permalink / raw) To: Mike Galbraith Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Sun, Sep 27 2009, Mike Galbraith wrote: > My dd vs load non-cached binary woes seem to be coming from backmerge. > > #if 0 /*MIKEDIDIT sand in gearbox?*/ > /* > * See if our hash lookup can find a potential backmerge. > */ > __rq = elv_rqhash_find(q, bio->bi_sector); > if (__rq && elv_rq_merge_ok(__rq, bio)) { > *req = __rq; > return ELEVATOR_BACK_MERGE; > } > #endif It's a given that not merging will provide better latency. We can't disable that or performance will suffer A LOT on some systems. There are ways to make it better, though. One would be to make the max request size smaller, but that would also hurt for streamed workloads. Can you try whether the below patch makes a difference? It will basically disallow merges to a request that isn't the last one. We should probably make the merging logic a bit more clever, since the below wont work well for two (or more) streamed cases. I'll think a bit about that. Note this is totally untested! diff --git a/block/elevator.c b/block/elevator.c index 1975b61..d00a72b 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) * See if our hash lookup can find a potential backmerge. */ __rq = elv_rqhash_find(q, bio->bi_sector); - if (__rq && elv_rq_merge_ok(__rq, bio)) { - *req = __rq; - return ELEVATOR_BACK_MERGE; + if (__rq) { + /* + * If requests are queued behind this one, disallow merge. This + * prevents streaming IO from continually passing new IO. + */ + if (elv_latter_request(q, __rq)) + return ELEVATOR_NO_MERGE; + if (elv_rq_merge_ok(__rq, bio)) { + *req = __rq; + return ELEVATOR_BACK_MERGE; + } } if (e->ops->elevator_merge_fn) -- Jens Axboe ^ permalink raw reply related [flat|nested] 349+ messages in thread
[parent not found: <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-09-27 18:15 ` Mike Galbraith 2009-09-30 19:58 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-27 18:15 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > On Sun, Sep 27 2009, Mike Galbraith wrote: > > My dd vs load non-cached binary woes seem to be coming from backmerge. > > > > #if 0 /*MIKEDIDIT sand in gearbox?*/ > > /* > > * See if our hash lookup can find a potential backmerge. > > */ > > __rq = elv_rqhash_find(q, bio->bi_sector); > > if (__rq && elv_rq_merge_ok(__rq, bio)) { > > *req = __rq; > > return ELEVATOR_BACK_MERGE; > > } > > #endif > > It's a given that not merging will provide better latency. Yeah, absolutely everything I've diddled that reduces the size of queued data improves the situation, which makes perfect sense. This one was a bit unexpected. Front merges didn't hurt at all, back merges did, and lots. After diddling the code a bit, I had the "well _duh_" moment. > We can't > disable that or performance will suffer A LOT on some systems. There are > ways to make it better, though. One would be to make the max request > size smaller, but that would also hurt for streamed workloads. Can you > try whether the below patch makes a difference? It will basically > disallow merges to a request that isn't the last one. That's what all the looking I've done ends up at. Either you let the disk be all it can be, and you pay in latency, or you don't, and you pay in throughput. > below wont work well for two (or more) streamed cases. I'll think a bit > about that. Cool, think away. I've been eyeballing and pondering how to know when latency is going to become paramount. Absolutely nothing is happening, even for "it's my root". > Note this is totally untested! I'll give it a shot first thing in the A.M. Note: I tested my stable of kernels today (22->), and we are better off dd vs read today than ever in this time span at least. (i can't recall ever seeing a system where beating snot outta root didn't hurt really bad... would be very nice though;) > diff --git a/block/elevator.c b/block/elevator.c > index 1975b61..d00a72b 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) > * See if our hash lookup can find a potential backmerge. > */ > __rq = elv_rqhash_find(q, bio->bi_sector); > - if (__rq && elv_rq_merge_ok(__rq, bio)) { > - *req = __rq; > - return ELEVATOR_BACK_MERGE; > + if (__rq) { > + /* > + * If requests are queued behind this one, disallow merge. This > + * prevents streaming IO from continually passing new IO. > + */ > + if (elv_latter_request(q, __rq)) > + return ELEVATOR_NO_MERGE; > + if (elv_rq_merge_ok(__rq, bio)) { > + *req = __rq; > + return ELEVATOR_BACK_MERGE; > + } > } > > if (e->ops->elevator_merge_fn) > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-09-27 18:15 ` Mike Galbraith @ 2009-09-30 19:58 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-30 19:58 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > It's a given that not merging will provide better latency. We can't > disable that or performance will suffer A LOT on some systems. There are > ways to make it better, though. One would be to make the max request > size smaller, but that would also hurt for streamed workloads. Can you > try whether the below patch makes a difference? It will basically > disallow merges to a request that isn't the last one. Thoughts about something like the below? The problem with the dd vs konsole -e exit type load seems to be kjournald overloading the disk between reads. When userland is blocked, kjournald is free to stuff 4*quantum into the queue instantly. Taking the hint from Vivek's fairness tweakable patch, I stamped the queue when a seeker was last seen, and disallowed overload within CIC_SEEK_THR of that time. Worked well. dd competing against perf stat -- konsole -e exec timings, 5 back to back runs Avg before 9.15 14.51 9.39 15.06 9.90 11.6 after 1.76 1.54 1.93 1.88 1.56 1.7 diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index e2a9b92..4a00129 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -174,6 +174,8 @@ struct cfq_data { unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned long last_seeker; + struct list_head cic_list; /* @@ -1326,6 +1328,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) return 0; /* + * We may have seeky queues, don't throttle up just yet. + */ + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) + return 0; + + /* * we are the only queue, allow up to 4 times of 'quantum' */ if (cfqq->dispatched >= 4 * max_dispatch) @@ -1941,7 +1949,7 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int old_idle, enable_idle; + int old_idle, enable_idle, seeky = 0; /* * Don't idle for async or idle io prio class @@ -1951,8 +1959,12 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); - if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (cfqd->hw_tag && CIC_SEEKY(cic))) + if (cfqd->hw_tag && CIC_SEEKY(cic)) { + cfqd->last_seeker = jiffies; + seeky = 1; + } + + if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) @@ -2482,6 +2494,7 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->hw_tag = 1; + cfqd->last_seeker = jiffies; return cfqd; } ^ permalink raw reply related [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-27 16:42 ` Jens Axboe [not found] ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-09-27 18:15 ` Mike Galbraith 2009-09-28 4:04 ` Mike Galbraith [not found] ` <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-30 19:58 ` Mike Galbraith 2 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-27 18:15 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > On Sun, Sep 27 2009, Mike Galbraith wrote: > > My dd vs load non-cached binary woes seem to be coming from backmerge. > > > > #if 0 /*MIKEDIDIT sand in gearbox?*/ > > /* > > * See if our hash lookup can find a potential backmerge. > > */ > > __rq = elv_rqhash_find(q, bio->bi_sector); > > if (__rq && elv_rq_merge_ok(__rq, bio)) { > > *req = __rq; > > return ELEVATOR_BACK_MERGE; > > } > > #endif > > It's a given that not merging will provide better latency. Yeah, absolutely everything I've diddled that reduces the size of queued data improves the situation, which makes perfect sense. This one was a bit unexpected. Front merges didn't hurt at all, back merges did, and lots. After diddling the code a bit, I had the "well _duh_" moment. > We can't > disable that or performance will suffer A LOT on some systems. There are > ways to make it better, though. One would be to make the max request > size smaller, but that would also hurt for streamed workloads. Can you > try whether the below patch makes a difference? It will basically > disallow merges to a request that isn't the last one. That's what all the looking I've done ends up at. Either you let the disk be all it can be, and you pay in latency, or you don't, and you pay in throughput. > below wont work well for two (or more) streamed cases. I'll think a bit > about that. Cool, think away. I've been eyeballing and pondering how to know when latency is going to become paramount. Absolutely nothing is happening, even for "it's my root". > Note this is totally untested! I'll give it a shot first thing in the A.M. Note: I tested my stable of kernels today (22->), and we are better off dd vs read today than ever in this time span at least. (i can't recall ever seeing a system where beating snot outta root didn't hurt really bad... would be very nice though;) > diff --git a/block/elevator.c b/block/elevator.c > index 1975b61..d00a72b 100644 > --- a/block/elevator.c > +++ b/block/elevator.c > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) > * See if our hash lookup can find a potential backmerge. > */ > __rq = elv_rqhash_find(q, bio->bi_sector); > - if (__rq && elv_rq_merge_ok(__rq, bio)) { > - *req = __rq; > - return ELEVATOR_BACK_MERGE; > + if (__rq) { > + /* > + * If requests are queued behind this one, disallow merge. This > + * prevents streaming IO from continually passing new IO. > + */ > + if (elv_latter_request(q, __rq)) > + return ELEVATOR_NO_MERGE; > + if (elv_rq_merge_ok(__rq, bio)) { > + *req = __rq; > + return ELEVATOR_BACK_MERGE; > + } > } > > if (e->ops->elevator_merge_fn) > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-27 18:15 ` Mike Galbraith @ 2009-09-28 4:04 ` Mike Galbraith 2009-09-28 5:55 ` Mike Galbraith ` (2 more replies) [not found] ` <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 3 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 4:04 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote: > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > I'll give it a shot first thing in the A.M. > > diff --git a/block/elevator.c b/block/elevator.c > > index 1975b61..d00a72b 100644 > > --- a/block/elevator.c > > +++ b/block/elevator.c > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) > > * See if our hash lookup can find a potential backmerge. > > */ > > __rq = elv_rqhash_find(q, bio->bi_sector); > > - if (__rq && elv_rq_merge_ok(__rq, bio)) { > > - *req = __rq; > > - return ELEVATOR_BACK_MERGE; > > + if (__rq) { > > + /* > > + * If requests are queued behind this one, disallow merge. This > > + * prevents streaming IO from continually passing new IO. > > + */ > > + if (elv_latter_request(q, __rq)) > > + return ELEVATOR_NO_MERGE; > > + if (elv_rq_merge_ok(__rq, bio)) { > > + *req = __rq; > > + return ELEVATOR_BACK_MERGE; > > + } > > } > > > > if (e->ops->elevator_merge_fn) - = virgin tip v2.6.31-10215-ga3c9602 + = with patchlet Avg dd pre 67.4 70.9 65.4 68.9 66.2 67.7- 65.9 68.5 69.8 65.2 65.8 67.0- Avg 70.4 70.3 65.1 66.4 70.1 68.4- 67.7- 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968 63.8 67.9 65.2 65.1 64.4 65.2+ 64.9 66.3 64.1 65.2 64.8 65.0+ perf stat 8.66 16.29 9.65 14.88 9.45 11.7- 15.36 9.71 15.47 10.44 12.93 12.7- 10.55 15.11 10.22 15.35 10.32 12.3- 12.2- 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745 7.73 10.12 8.19 11.87 8.07 9.1+ 11.04 7.62 10.14 8.13 10.23 9.4+ dd post 63.4 60.5 66.7 64.5 67.3 64.4- 64.4 66.8 64.3 61.5 62.0 63.8- 63.8 64.9 66.2 65.6 66.9 65.4- 64.5- 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958 63.3 59.9 61.9 62.7 61.2 61.8+ 60.1 63.7 59.5 61.5 60.6 61.0+ ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 4:04 ` Mike Galbraith @ 2009-09-28 5:55 ` Mike Galbraith 2009-09-28 17:48 ` Vivek Goyal [not found] ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 5:55 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel P.S. On Mon, 2009-09-28 at 06:04 +0200, Mike Galbraith wrote: > - = virgin tip v2.6.31-10215-ga3c9602 > + = with patchlet > Avg > dd pre 67.4 70.9 65.4 68.9 66.2 67.7- > 65.9 68.5 69.8 65.2 65.8 67.0- Avg > 70.4 70.3 65.1 66.4 70.1 68.4- 67.7- > 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968 > 63.8 67.9 65.2 65.1 64.4 65.2+ > 64.9 66.3 64.1 65.2 64.8 65.0+ > perf stat 8.66 16.29 9.65 14.88 9.45 11.7- > 15.36 9.71 15.47 10.44 12.93 12.7- > 10.55 15.11 10.22 15.35 10.32 12.3- 12.2- > 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745 > 7.73 10.12 8.19 11.87 8.07 9.1+ > 11.04 7.62 10.14 8.13 10.23 9.4+ > dd post 63.4 60.5 66.7 64.5 67.3 64.4- > 64.4 66.8 64.3 61.5 62.0 63.8- > 63.8 64.9 66.2 65.6 66.9 65.4- 64.5- > 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958 > 63.3 59.9 61.9 62.7 61.2 61.8+ > 60.1 63.7 59.5 61.5 60.6 61.0+ Deadline and noop fsc^W are less than wonderful choices for this load. perf stat 12.82 7.19 8.49 5.76 9.32 anticipatory 16.24 175.82 154.38 228.97 147.16 noop 43.23 57.39 96.13 148.25 180.09 deadline 28.65 167.40 195.95 183.69 178.61 deadline v2.6.27.35 -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 4:04 ` Mike Galbraith @ 2009-09-28 17:48 ` Vivek Goyal 2009-09-28 17:48 ` Vivek Goyal [not found] ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 17:48 UTC (permalink / raw) To: Mike Galbraith Cc: Jens Axboe, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Mon, Sep 28, 2009 at 06:04:08AM +0200, Mike Galbraith wrote: > On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote: > > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > > > I'll give it a shot first thing in the A.M. > > > > diff --git a/block/elevator.c b/block/elevator.c > > > index 1975b61..d00a72b 100644 > > > --- a/block/elevator.c > > > +++ b/block/elevator.c > > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) > > > * See if our hash lookup can find a potential backmerge. > > > */ > > > __rq = elv_rqhash_find(q, bio->bi_sector); > > > - if (__rq && elv_rq_merge_ok(__rq, bio)) { > > > - *req = __rq; > > > - return ELEVATOR_BACK_MERGE; > > > + if (__rq) { > > > + /* > > > + * If requests are queued behind this one, disallow merge. This > > > + * prevents streaming IO from continually passing new IO. > > > + */ > > > + if (elv_latter_request(q, __rq)) > > > + return ELEVATOR_NO_MERGE; > > > + if (elv_rq_merge_ok(__rq, bio)) { > > > + *req = __rq; > > > + return ELEVATOR_BACK_MERGE; > > > + } > > > } > > > > > > if (e->ops->elevator_merge_fn) > > - = virgin tip v2.6.31-10215-ga3c9602 > + = with patchlet > Avg > dd pre 67.4 70.9 65.4 68.9 66.2 67.7- > 65.9 68.5 69.8 65.2 65.8 67.0- Avg > 70.4 70.3 65.1 66.4 70.1 68.4- 67.7- > 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968 > 63.8 67.9 65.2 65.1 64.4 65.2+ > 64.9 66.3 64.1 65.2 64.8 65.0+ > perf stat 8.66 16.29 9.65 14.88 9.45 11.7- > 15.36 9.71 15.47 10.44 12.93 12.7- > 10.55 15.11 10.22 15.35 10.32 12.3- 12.2- > 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745 > 7.73 10.12 8.19 11.87 8.07 9.1+ > 11.04 7.62 10.14 8.13 10.23 9.4+ > dd post 63.4 60.5 66.7 64.5 67.3 64.4- > 64.4 66.8 64.3 61.5 62.0 63.8- > 63.8 64.9 66.2 65.6 66.9 65.4- 64.5- > 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958 > 63.3 59.9 61.9 62.7 61.2 61.8+ > 60.1 63.7 59.5 61.5 60.6 61.0+ > Hmm.., so close to 25% reduction on average in completion time of konsole. But this is in presece of writer. Does this help even in presence of 1 or more sequential readers going? So here latency seems to be coming from three sources. - Wait in CFQ before request is dispatched (only in case of competing seq readers). - seek latencies - latencies because of bigger requests are already dispatched to disk. So limiting the size of request will help with third factor but not with first two factors and here seek latencies seem to be the biggest contributor. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-28 17:48 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 17:48 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Mon, Sep 28, 2009 at 06:04:08AM +0200, Mike Galbraith wrote: > On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote: > > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > > > I'll give it a shot first thing in the A.M. > > > > diff --git a/block/elevator.c b/block/elevator.c > > > index 1975b61..d00a72b 100644 > > > --- a/block/elevator.c > > > +++ b/block/elevator.c > > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) > > > * See if our hash lookup can find a potential backmerge. > > > */ > > > __rq = elv_rqhash_find(q, bio->bi_sector); > > > - if (__rq && elv_rq_merge_ok(__rq, bio)) { > > > - *req = __rq; > > > - return ELEVATOR_BACK_MERGE; > > > + if (__rq) { > > > + /* > > > + * If requests are queued behind this one, disallow merge. This > > > + * prevents streaming IO from continually passing new IO. > > > + */ > > > + if (elv_latter_request(q, __rq)) > > > + return ELEVATOR_NO_MERGE; > > > + if (elv_rq_merge_ok(__rq, bio)) { > > > + *req = __rq; > > > + return ELEVATOR_BACK_MERGE; > > > + } > > > } > > > > > > if (e->ops->elevator_merge_fn) > > - = virgin tip v2.6.31-10215-ga3c9602 > + = with patchlet > Avg > dd pre 67.4 70.9 65.4 68.9 66.2 67.7- > 65.9 68.5 69.8 65.2 65.8 67.0- Avg > 70.4 70.3 65.1 66.4 70.1 68.4- 67.7- > 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968 > 63.8 67.9 65.2 65.1 64.4 65.2+ > 64.9 66.3 64.1 65.2 64.8 65.0+ > perf stat 8.66 16.29 9.65 14.88 9.45 11.7- > 15.36 9.71 15.47 10.44 12.93 12.7- > 10.55 15.11 10.22 15.35 10.32 12.3- 12.2- > 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745 > 7.73 10.12 8.19 11.87 8.07 9.1+ > 11.04 7.62 10.14 8.13 10.23 9.4+ > dd post 63.4 60.5 66.7 64.5 67.3 64.4- > 64.4 66.8 64.3 61.5 62.0 63.8- > 63.8 64.9 66.2 65.6 66.9 65.4- 64.5- > 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958 > 63.3 59.9 61.9 62.7 61.2 61.8+ > 60.1 63.7 59.5 61.5 60.6 61.0+ > Hmm.., so close to 25% reduction on average in completion time of konsole. But this is in presece of writer. Does this help even in presence of 1 or more sequential readers going? So here latency seems to be coming from three sources. - Wait in CFQ before request is dispatched (only in case of competing seq readers). - seek latencies - latencies because of bigger requests are already dispatched to disk. So limiting the size of request will help with third factor but not with first two factors and here seek latencies seem to be the biggest contributor. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 17:48 ` Vivek Goyal (?) @ 2009-09-28 18:24 ` Mike Galbraith -1 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 18:24 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Mon, 2009-09-28 at 13:48 -0400, Vivek Goyal wrote: > Hmm.., so close to 25% reduction on average in completion time of konsole. > But this is in presece of writer. Does this help even in presence of 1 or > more sequential readers going? Dunno, I've only tested sequential writer. > So here latency seems to be coming from three sources. > > - Wait in CFQ before request is dispatched (only in case of competing seq readers). > - seek latencies > - latencies because of bigger requests are already dispatched to disk. > > So limiting the size of request will help with third factor but not with first > two factors and here seek latencies seem to be the biggest contributor. Yeah, seek latency seems to dominate. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090928174809.GB3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090928174809.GB3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-28 18:24 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 18:24 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, 2009-09-28 at 13:48 -0400, Vivek Goyal wrote: > Hmm.., so close to 25% reduction on average in completion time of konsole. > But this is in presece of writer. Does this help even in presence of 1 or > more sequential readers going? Dunno, I've only tested sequential writer. > So here latency seems to be coming from three sources. > > - Wait in CFQ before request is dispatched (only in case of competing seq readers). > - seek latencies > - latencies because of bigger requests are already dispatched to disk. > > So limiting the size of request will help with third factor but not with first > two factors and here seek latencies seem to be the biggest contributor. Yeah, seek latency seems to dominate. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-28 5:55 ` Mike Galbraith 2009-09-28 17:48 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 5:55 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b P.S. On Mon, 2009-09-28 at 06:04 +0200, Mike Galbraith wrote: > - = virgin tip v2.6.31-10215-ga3c9602 > + = with patchlet > Avg > dd pre 67.4 70.9 65.4 68.9 66.2 67.7- > 65.9 68.5 69.8 65.2 65.8 67.0- Avg > 70.4 70.3 65.1 66.4 70.1 68.4- 67.7- > 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968 > 63.8 67.9 65.2 65.1 64.4 65.2+ > 64.9 66.3 64.1 65.2 64.8 65.0+ > perf stat 8.66 16.29 9.65 14.88 9.45 11.7- > 15.36 9.71 15.47 10.44 12.93 12.7- > 10.55 15.11 10.22 15.35 10.32 12.3- 12.2- > 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745 > 7.73 10.12 8.19 11.87 8.07 9.1+ > 11.04 7.62 10.14 8.13 10.23 9.4+ > dd post 63.4 60.5 66.7 64.5 67.3 64.4- > 64.4 66.8 64.3 61.5 62.0 63.8- > 63.8 64.9 66.2 65.6 66.9 65.4- 64.5- > 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958 > 63.3 59.9 61.9 62.7 61.2 61.8+ > 60.1 63.7 59.5 61.5 60.6 61.0+ Deadline and noop fsc^W are less than wonderful choices for this load. perf stat 12.82 7.19 8.49 5.76 9.32 anticipatory 16.24 175.82 154.38 228.97 147.16 noop 43.23 57.39 96.13 148.25 180.09 deadline 28.65 167.40 195.95 183.69 178.61 deadline v2.6.27.35 -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-28 5:55 ` Mike Galbraith @ 2009-09-28 17:48 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 17:48 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Sep 28, 2009 at 06:04:08AM +0200, Mike Galbraith wrote: > On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote: > > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > > > I'll give it a shot first thing in the A.M. > > > > diff --git a/block/elevator.c b/block/elevator.c > > > index 1975b61..d00a72b 100644 > > > --- a/block/elevator.c > > > +++ b/block/elevator.c > > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) > > > * See if our hash lookup can find a potential backmerge. > > > */ > > > __rq = elv_rqhash_find(q, bio->bi_sector); > > > - if (__rq && elv_rq_merge_ok(__rq, bio)) { > > > - *req = __rq; > > > - return ELEVATOR_BACK_MERGE; > > > + if (__rq) { > > > + /* > > > + * If requests are queued behind this one, disallow merge. This > > > + * prevents streaming IO from continually passing new IO. > > > + */ > > > + if (elv_latter_request(q, __rq)) > > > + return ELEVATOR_NO_MERGE; > > > + if (elv_rq_merge_ok(__rq, bio)) { > > > + *req = __rq; > > > + return ELEVATOR_BACK_MERGE; > > > + } > > > } > > > > > > if (e->ops->elevator_merge_fn) > > - = virgin tip v2.6.31-10215-ga3c9602 > + = with patchlet > Avg > dd pre 67.4 70.9 65.4 68.9 66.2 67.7- > 65.9 68.5 69.8 65.2 65.8 67.0- Avg > 70.4 70.3 65.1 66.4 70.1 68.4- 67.7- > 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968 > 63.8 67.9 65.2 65.1 64.4 65.2+ > 64.9 66.3 64.1 65.2 64.8 65.0+ > perf stat 8.66 16.29 9.65 14.88 9.45 11.7- > 15.36 9.71 15.47 10.44 12.93 12.7- > 10.55 15.11 10.22 15.35 10.32 12.3- 12.2- > 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745 > 7.73 10.12 8.19 11.87 8.07 9.1+ > 11.04 7.62 10.14 8.13 10.23 9.4+ > dd post 63.4 60.5 66.7 64.5 67.3 64.4- > 64.4 66.8 64.3 61.5 62.0 63.8- > 63.8 64.9 66.2 65.6 66.9 65.4- 64.5- > 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958 > 63.3 59.9 61.9 62.7 61.2 61.8+ > 60.1 63.7 59.5 61.5 60.6 61.0+ > Hmm.., so close to 25% reduction on average in completion time of konsole. But this is in presece of writer. Does this help even in presence of 1 or more sequential readers going? So here latency seems to be coming from three sources. - Wait in CFQ before request is dispatched (only in case of competing seq readers). - seek latencies - latencies because of bigger requests are already dispatched to disk. So limiting the size of request will help with third factor but not with first two factors and here seek latencies seem to be the biggest contributor. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-28 4:04 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 4:04 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sun, 2009-09-27 at 20:16 +0200, Mike Galbraith wrote: > On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > I'll give it a shot first thing in the A.M. > > diff --git a/block/elevator.c b/block/elevator.c > > index 1975b61..d00a72b 100644 > > --- a/block/elevator.c > > +++ b/block/elevator.c > > @@ -497,9 +497,17 @@ int elv_merge(struct request_queue *q, struct request **req, struct bio *bio) > > * See if our hash lookup can find a potential backmerge. > > */ > > __rq = elv_rqhash_find(q, bio->bi_sector); > > - if (__rq && elv_rq_merge_ok(__rq, bio)) { > > - *req = __rq; > > - return ELEVATOR_BACK_MERGE; > > + if (__rq) { > > + /* > > + * If requests are queued behind this one, disallow merge. This > > + * prevents streaming IO from continually passing new IO. > > + */ > > + if (elv_latter_request(q, __rq)) > > + return ELEVATOR_NO_MERGE; > > + if (elv_rq_merge_ok(__rq, bio)) { > > + *req = __rq; > > + return ELEVATOR_BACK_MERGE; > > + } > > } > > > > if (e->ops->elevator_merge_fn) - = virgin tip v2.6.31-10215-ga3c9602 + = with patchlet Avg dd pre 67.4 70.9 65.4 68.9 66.2 67.7- 65.9 68.5 69.8 65.2 65.8 67.0- Avg 70.4 70.3 65.1 66.4 70.1 68.4- 67.7- 73.1 64.6 65.3 65.3 64.9 66.6+ 65.6+ .968 63.8 67.9 65.2 65.1 64.4 65.2+ 64.9 66.3 64.1 65.2 64.8 65.0+ perf stat 8.66 16.29 9.65 14.88 9.45 11.7- 15.36 9.71 15.47 10.44 12.93 12.7- 10.55 15.11 10.22 15.35 10.32 12.3- 12.2- 9.87 7.53 10.62 7.51 9.95 9.0+ 9.1+ .745 7.73 10.12 8.19 11.87 8.07 9.1+ 11.04 7.62 10.14 8.13 10.23 9.4+ dd post 63.4 60.5 66.7 64.5 67.3 64.4- 64.4 66.8 64.3 61.5 62.0 63.8- 63.8 64.9 66.2 65.6 66.9 65.4- 64.5- 60.9 63.4 60.2 63.4 65.5 62.6+ 61.8+ .958 63.3 59.9 61.9 62.7 61.2 61.8+ 60.1 63.7 59.5 61.5 60.6 61.0+ ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-27 16:42 ` Jens Axboe [not found] ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-09-27 18:15 ` Mike Galbraith @ 2009-09-30 19:58 ` Mike Galbraith [not found] ` <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-30 20:05 ` Mike Galbraith 2 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-30 19:58 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Sun, 2009-09-27 at 18:42 +0200, Jens Axboe wrote: > It's a given that not merging will provide better latency. We can't > disable that or performance will suffer A LOT on some systems. There are > ways to make it better, though. One would be to make the max request > size smaller, but that would also hurt for streamed workloads. Can you > try whether the below patch makes a difference? It will basically > disallow merges to a request that isn't the last one. Thoughts about something like the below? The problem with the dd vs konsole -e exit type load seems to be kjournald overloading the disk between reads. When userland is blocked, kjournald is free to stuff 4*quantum into the queue instantly. Taking the hint from Vivek's fairness tweakable patch, I stamped the queue when a seeker was last seen, and disallowed overload within CIC_SEEK_THR of that time. Worked well. dd competing against perf stat -- konsole -e exec timings, 5 back to back runs Avg before 9.15 14.51 9.39 15.06 9.90 11.6 after 1.76 1.54 1.93 1.88 1.56 1.7 diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index e2a9b92..4a00129 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -174,6 +174,8 @@ struct cfq_data { unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned long last_seeker; + struct list_head cic_list; /* @@ -1326,6 +1328,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) return 0; /* + * We may have seeky queues, don't throttle up just yet. + */ + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) + return 0; + + /* * we are the only queue, allow up to 4 times of 'quantum' */ if (cfqq->dispatched >= 4 * max_dispatch) @@ -1941,7 +1949,7 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int old_idle, enable_idle; + int old_idle, enable_idle, seeky = 0; /* * Don't idle for async or idle io prio class @@ -1951,8 +1959,12 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); - if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (cfqd->hw_tag && CIC_SEEKY(cic))) + if (cfqd->hw_tag && CIC_SEEKY(cic)) { + cfqd->last_seeker = jiffies; + seeky = 1; + } + + if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) @@ -2482,6 +2494,7 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->hw_tag = 1; + cfqd->last_seeker = jiffies; return cfqd; } ^ permalink raw reply related [flat|nested] 349+ messages in thread
[parent not found: <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-30 20:05 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-30 20:05 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b > /* > + * We may have seeky queues, don't throttle up just yet. > + */ > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) > + return 0; > + bzzzt. Window too large, but the though is to let them overload, but not instantly. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-30 19:58 ` Mike Galbraith [not found] ` <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-30 20:05 ` Mike Galbraith 2009-09-30 20:24 ` Vivek Goyal [not found] ` <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-30 20:05 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel > /* > + * We may have seeky queues, don't throttle up just yet. > + */ > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) > + return 0; > + bzzzt. Window too large, but the though is to let them overload, but not instantly. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-30 20:05 ` Mike Galbraith @ 2009-09-30 20:24 ` Vivek Goyal [not found] ` <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 20:24 UTC (permalink / raw) To: Mike Galbraith Cc: Jens Axboe, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote: > > > > /* > > + * We may have seeky queues, don't throttle up just yet. > > + */ > > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) > > + return 0; > > + > > bzzzt. Window too large, but the though is to let them overload, but > not instantly. > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try using one "slice_idle" period of 8 ms. But it might turn out to be too short depending on the disk speed. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-30 20:24 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 20:24 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote: > > > > /* > > + * We may have seeky queues, don't throttle up just yet. > > + */ > > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) > > + return 0; > > + > > bzzzt. Window too large, but the though is to let them overload, but > not instantly. > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try using one "slice_idle" period of 8 ms. But it might turn out to be too short depending on the disk speed. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090930202447.GA28236-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090930202447.GA28236-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-01 7:33 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-01 7:33 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Wed, 2009-09-30 at 16:24 -0400, Vivek Goyal wrote: > On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote: > > > > > > > /* > > > + * We may have seeky queues, don't throttle up just yet. > > > + */ > > > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) > > > + return 0; > > > + > > > > bzzzt. Window too large, but the though is to let them overload, but > > not instantly. > > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > using one "slice_idle" period of 8 ms. But it might turn out to be too > short depending on the disk speed. Yeah, it is too short, as is even _400_ ms. Trouble is, by the time some new task is determined to be seeky, the damage is already done. The below does better, though not as well as "just say no to overload" of course ;-) I have a patchlet from Corrado to test, likely better time investment than poking this darn thing with sharp sticks. -Mike grep elapsed testo.log 0.894345911 seconds time elapsed <== solo seeky test measurement 3.732472877 seconds time elapsed 3.208443735 seconds time elapsed 4.249776673 seconds time elapsed 2.763449260 seconds time elapsed 4.235271019 seconds time elapsed (3.73 + 3.20 + 4.24 + 2.76 + 4.23) / 5 / 0.89 = 4... darn. diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index e2a9b92..44a888d 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -174,6 +174,8 @@ struct cfq_data { unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned long od_stamp; + struct list_head cic_list; /* @@ -1296,19 +1298,26 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) /* * Drain async requests before we start sync IO */ - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { + cfqd->od_stamp = jiffies; return 0; + } /* * If this is an async queue and we have sync IO in flight, let it wait */ - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { + cfqd->od_stamp = jiffies; return 0; + } max_dispatch = cfqd->cfq_quantum; if (cfq_class_idle(cfqq)) max_dispatch = 1; + if (cfqd->busy_queues > 1) + cfqd->od_stamp = jiffies; + /* * Does this cfqq already have too much IO in flight? */ @@ -1326,6 +1335,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) return 0; /* + * Don't start overloading until we've been alone for a bit. + */ + if (time_before(jiffies, cfqd->od_stamp + cfq_slice_sync)) + return 0; + + /* * we are the only queue, allow up to 4 times of 'quantum' */ if (cfqq->dispatched >= 4 * max_dispatch) @@ -1941,7 +1956,7 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int old_idle, enable_idle; + int old_idle, enable_idle, seeky = 0; /* * Don't idle for async or idle io prio class @@ -1949,10 +1964,19 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) return; + if (cfqd->hw_tag) { + if (CIC_SEEKY(cic)) + seeky = 1; + /* + * If known or incalculable seekiness, delay. + */ + if (seeky || !sample_valid(cic->seek_samples)) + cfqd->od_stamp = jiffies; + } + enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); - if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (cfqd->hw_tag && CIC_SEEKY(cic))) + if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) @@ -2482,6 +2506,7 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->hw_tag = 1; + cfqd->od_stamp = INITIAL_JIFFIES; return cfqd; } ^ permalink raw reply related [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-30 20:24 ` Vivek Goyal (?) (?) @ 2009-10-01 7:33 ` Mike Galbraith [not found] ` <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 18:08 ` Jens Axboe -1 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-01 7:33 UTC (permalink / raw) To: Vivek Goyal Cc: Jens Axboe, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Wed, 2009-09-30 at 16:24 -0400, Vivek Goyal wrote: > On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote: > > > > > > > /* > > > + * We may have seeky queues, don't throttle up just yet. > > > + */ > > > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) > > > + return 0; > > > + > > > > bzzzt. Window too large, but the though is to let them overload, but > > not instantly. > > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > using one "slice_idle" period of 8 ms. But it might turn out to be too > short depending on the disk speed. Yeah, it is too short, as is even _400_ ms. Trouble is, by the time some new task is determined to be seeky, the damage is already done. The below does better, though not as well as "just say no to overload" of course ;-) I have a patchlet from Corrado to test, likely better time investment than poking this darn thing with sharp sticks. -Mike grep elapsed testo.log 0.894345911 seconds time elapsed <== solo seeky test measurement 3.732472877 seconds time elapsed 3.208443735 seconds time elapsed 4.249776673 seconds time elapsed 2.763449260 seconds time elapsed 4.235271019 seconds time elapsed (3.73 + 3.20 + 4.24 + 2.76 + 4.23) / 5 / 0.89 = 4... darn. diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index e2a9b92..44a888d 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -174,6 +174,8 @@ struct cfq_data { unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned long od_stamp; + struct list_head cic_list; /* @@ -1296,19 +1298,26 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) /* * Drain async requests before we start sync IO */ - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { + cfqd->od_stamp = jiffies; return 0; + } /* * If this is an async queue and we have sync IO in flight, let it wait */ - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { + cfqd->od_stamp = jiffies; return 0; + } max_dispatch = cfqd->cfq_quantum; if (cfq_class_idle(cfqq)) max_dispatch = 1; + if (cfqd->busy_queues > 1) + cfqd->od_stamp = jiffies; + /* * Does this cfqq already have too much IO in flight? */ @@ -1326,6 +1335,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) return 0; /* + * Don't start overloading until we've been alone for a bit. + */ + if (time_before(jiffies, cfqd->od_stamp + cfq_slice_sync)) + return 0; + + /* * we are the only queue, allow up to 4 times of 'quantum' */ if (cfqq->dispatched >= 4 * max_dispatch) @@ -1941,7 +1956,7 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int old_idle, enable_idle; + int old_idle, enable_idle, seeky = 0; /* * Don't idle for async or idle io prio class @@ -1949,10 +1964,19 @@ cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) return; + if (cfqd->hw_tag) { + if (CIC_SEEKY(cic)) + seeky = 1; + /* + * If known or incalculable seekiness, delay. + */ + if (seeky || !sample_valid(cic->seek_samples)) + cfqd->od_stamp = jiffies; + } + enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); - if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (cfqd->hw_tag && CIC_SEEKY(cic))) + if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || seeky) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) @@ -2482,6 +2506,7 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->hw_tag = 1; + cfqd->od_stamp = INITIAL_JIFFIES; return cfqd; } ^ permalink raw reply related [flat|nested] 349+ messages in thread
[parent not found: <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 2009-10-01 7:33 ` Mike Galbraith @ 2009-10-01 18:58 ` Jens Axboe 2009-10-02 18:08 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-01 18:58 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, Oct 01 2009, Mike Galbraith wrote: > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > > using one "slice_idle" period of 8 ms. But it might turn out to be too > > short depending on the disk speed. > > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time > some new task is determined to be seeky, the damage is already done. > > The below does better, though not as well as "just say no to overload" > of course ;-) So this essentially takes the "avoid impact from previous slice" to a new extreme, but idling even before dispatching requests from the new queue. We basically do two things to prevent this already - one is to only set the slice when the first request is actually serviced, and the other is to drain async requests completely before starting sync ones. I'm a bit surprised that the former doesn't solve the problem fully, I guess what happens is that if the drive has been flooded with writes, it may service the new read immediately and then return to finish emptying its writeback cache. This will cause an impact for any sync IO until that cache is flushed, and then cause that sync queue to not get as much service as it should have. Perhaps the "set slice on first complete" isn't working correctly? Or perhaps we just need to be more extreme. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-01 18:58 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-01 18:58 UTC (permalink / raw) To: Mike Galbraith Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Thu, Oct 01 2009, Mike Galbraith wrote: > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > > using one "slice_idle" period of 8 ms. But it might turn out to be too > > short depending on the disk speed. > > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time > some new task is determined to be seeky, the damage is already done. > > The below does better, though not as well as "just say no to overload" > of course ;-) So this essentially takes the "avoid impact from previous slice" to a new extreme, but idling even before dispatching requests from the new queue. We basically do two things to prevent this already - one is to only set the slice when the first request is actually serviced, and the other is to drain async requests completely before starting sync ones. I'm a bit surprised that the former doesn't solve the problem fully, I guess what happens is that if the drive has been flooded with writes, it may service the new read immediately and then return to finish emptying its writeback cache. This will cause an impact for any sync IO until that cache is flushed, and then cause that sync queue to not get as much service as it should have. Perhaps the "set slice on first complete" isn't working correctly? Or perhaps we just need to be more extreme. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-01 18:58 ` Jens Axboe (?) @ 2009-10-02 6:23 ` Mike Galbraith [not found] ` <1254464628.7158.101.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 8:04 ` Jens Axboe -1 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 6:23 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote: > On Thu, Oct 01 2009, Mike Galbraith wrote: > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > > > using one "slice_idle" period of 8 ms. But it might turn out to be too > > > short depending on the disk speed. > > > > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time > > some new task is determined to be seeky, the damage is already done. > > > > The below does better, though not as well as "just say no to overload" > > of course ;-) > > So this essentially takes the "avoid impact from previous slice" to a > new extreme, but idling even before dispatching requests from the new > queue. We basically do two things to prevent this already - one is to > only set the slice when the first request is actually serviced, and the > other is to drain async requests completely before starting sync ones. > I'm a bit surprised that the former doesn't solve the problem fully, I > guess what happens is that if the drive has been flooded with writes, it > may service the new read immediately and then return to finish emptying > its writeback cache. This will cause an impact for any sync IO until > that cache is flushed, and then cause that sync queue to not get as much > service as it should have. I did the stamping selection other than how long have we been solo based on these possibly wrong speculations: If we're in the idle window and doing the async drain thing, we've at the spot where Vivek's patch helps a ton. Seemed like a great time to limit the size of any io that may land in front of my sync reader to plain "you are not alone" quantity. If we've got sync io in flight, that should mean that my new or old known seeky queue has been serviced at least once. There's likely to be more on the way, so delay overloading then too. The seeky bit is supposed to be the earlier "last time we saw a seeker" thing, but known seeky is too late to help a new task at all unless you turn off the overloading for ages, so I added the if incalculable check for good measure, hoping that meant the task is new, may want to exec. Stamping any place may (see below) possibly limit the size of the io the reader can generate as well as writer, but I figured what's good for the goose is good for the the gander, or it ain't really good. The overload was causing the observed pain, definitely ain't good for both at these times at least, so don't let it do that. > Perhaps the "set slice on first complete" isn't working correctly? Or > perhaps we just need to be more extreme. Dunno, I was just tossing rocks and sticks at it. I don't really understand the reasoning behind overloading: I can see that allows cutting thicker slabs for the disk, but with the streaming writer vs reader case, seems only the writers can do that. The reader is unlikely to be alone isn't it? Seems to me that either dd, a flusher thread or kjournald is going to be there with it, which gives dd a huge advantage.. it has two proxies to help it squabble over disk, konsole has none. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254464628.7158.101.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254464628.7158.101.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 8:04 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 8:04 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02 2009, Mike Galbraith wrote: > On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote: > > On Thu, Oct 01 2009, Mike Galbraith wrote: > > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > > > > using one "slice_idle" period of 8 ms. But it might turn out to be too > > > > short depending on the disk speed. > > > > > > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time > > > some new task is determined to be seeky, the damage is already done. > > > > > > The below does better, though not as well as "just say no to overload" > > > of course ;-) > > > > So this essentially takes the "avoid impact from previous slice" to a > > new extreme, but idling even before dispatching requests from the new > > queue. We basically do two things to prevent this already - one is to > > only set the slice when the first request is actually serviced, and the > > other is to drain async requests completely before starting sync ones. > > I'm a bit surprised that the former doesn't solve the problem fully, I > > guess what happens is that if the drive has been flooded with writes, it > > may service the new read immediately and then return to finish emptying > > its writeback cache. This will cause an impact for any sync IO until > > that cache is flushed, and then cause that sync queue to not get as much > > service as it should have. > > I did the stamping selection other than how long have we been solo based > on these possibly wrong speculations: > > If we're in the idle window and doing the async drain thing, we've at > the spot where Vivek's patch helps a ton. Seemed like a great time to > limit the size of any io that may land in front of my sync reader to > plain "you are not alone" quantity. You can't be in the idle window and doing async drain at the same time, the idle window doesn't start until the sync queue has completed a request. Hence my above rant on device interference. > If we've got sync io in flight, that should mean that my new or old > known seeky queue has been serviced at least once. There's likely to be > more on the way, so delay overloading then too. > > The seeky bit is supposed to be the earlier "last time we saw a seeker" > thing, but known seeky is too late to help a new task at all unless you > turn off the overloading for ages, so I added the if incalculable check > for good measure, hoping that meant the task is new, may want to exec. > > Stamping any place may (see below) possibly limit the size of the io the > reader can generate as well as writer, but I figured what's good for the > goose is good for the the gander, or it ain't really good. The overload > was causing the observed pain, definitely ain't good for both at these > times at least, so don't let it do that. > > > Perhaps the "set slice on first complete" isn't working correctly? Or > > perhaps we just need to be more extreme. > > Dunno, I was just tossing rocks and sticks at it. > > I don't really understand the reasoning behind overloading: I can see > that allows cutting thicker slabs for the disk, but with the streaming > writer vs reader case, seems only the writers can do that. The reader > is unlikely to be alone isn't it? Seems to me that either dd, a flusher > thread or kjournald is going to be there with it, which gives dd a huge > advantage.. it has two proxies to help it squabble over disk, konsole > has none. That is true, async queues have a huge advantage over sync ones. But sync vs async is only part of it, any combination of queued sync, queued sync random etc have different ramifications on behaviour of the individual queue. It's not hard to make the latency good, the hard bit is making sure we also perform well for all other scenarios. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 6:23 ` Mike Galbraith @ 2009-10-02 8:04 ` Jens Axboe 2009-10-02 8:04 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 8:04 UTC (permalink / raw) To: Mike Galbraith Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Fri, Oct 02 2009, Mike Galbraith wrote: > On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote: > > On Thu, Oct 01 2009, Mike Galbraith wrote: > > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > > > > using one "slice_idle" period of 8 ms. But it might turn out to be too > > > > short depending on the disk speed. > > > > > > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time > > > some new task is determined to be seeky, the damage is already done. > > > > > > The below does better, though not as well as "just say no to overload" > > > of course ;-) > > > > So this essentially takes the "avoid impact from previous slice" to a > > new extreme, but idling even before dispatching requests from the new > > queue. We basically do two things to prevent this already - one is to > > only set the slice when the first request is actually serviced, and the > > other is to drain async requests completely before starting sync ones. > > I'm a bit surprised that the former doesn't solve the problem fully, I > > guess what happens is that if the drive has been flooded with writes, it > > may service the new read immediately and then return to finish emptying > > its writeback cache. This will cause an impact for any sync IO until > > that cache is flushed, and then cause that sync queue to not get as much > > service as it should have. > > I did the stamping selection other than how long have we been solo based > on these possibly wrong speculations: > > If we're in the idle window and doing the async drain thing, we've at > the spot where Vivek's patch helps a ton. Seemed like a great time to > limit the size of any io that may land in front of my sync reader to > plain "you are not alone" quantity. You can't be in the idle window and doing async drain at the same time, the idle window doesn't start until the sync queue has completed a request. Hence my above rant on device interference. > If we've got sync io in flight, that should mean that my new or old > known seeky queue has been serviced at least once. There's likely to be > more on the way, so delay overloading then too. > > The seeky bit is supposed to be the earlier "last time we saw a seeker" > thing, but known seeky is too late to help a new task at all unless you > turn off the overloading for ages, so I added the if incalculable check > for good measure, hoping that meant the task is new, may want to exec. > > Stamping any place may (see below) possibly limit the size of the io the > reader can generate as well as writer, but I figured what's good for the > goose is good for the the gander, or it ain't really good. The overload > was causing the observed pain, definitely ain't good for both at these > times at least, so don't let it do that. > > > Perhaps the "set slice on first complete" isn't working correctly? Or > > perhaps we just need to be more extreme. > > Dunno, I was just tossing rocks and sticks at it. > > I don't really understand the reasoning behind overloading: I can see > that allows cutting thicker slabs for the disk, but with the streaming > writer vs reader case, seems only the writers can do that. The reader > is unlikely to be alone isn't it? Seems to me that either dd, a flusher > thread or kjournald is going to be there with it, which gives dd a huge > advantage.. it has two proxies to help it squabble over disk, konsole > has none. That is true, async queues have a huge advantage over sync ones. But sync vs async is only part of it, any combination of queued sync, queued sync random etc have different ramifications on behaviour of the individual queue. It's not hard to make the latency good, the hard bit is making sure we also perform well for all other scenarios. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 8:04 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 8:04 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, mingo, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Fri, Oct 02 2009, Mike Galbraith wrote: > On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote: > > On Thu, Oct 01 2009, Mike Galbraith wrote: > > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > > > > using one "slice_idle" period of 8 ms. But it might turn out to be too > > > > short depending on the disk speed. > > > > > > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time > > > some new task is determined to be seeky, the damage is already done. > > > > > > The below does better, though not as well as "just say no to overload" > > > of course ;-) > > > > So this essentially takes the "avoid impact from previous slice" to a > > new extreme, but idling even before dispatching requests from the new > > queue. We basically do two things to prevent this already - one is to > > only set the slice when the first request is actually serviced, and the > > other is to drain async requests completely before starting sync ones. > > I'm a bit surprised that the former doesn't solve the problem fully, I > > guess what happens is that if the drive has been flooded with writes, it > > may service the new read immediately and then return to finish emptying > > its writeback cache. This will cause an impact for any sync IO until > > that cache is flushed, and then cause that sync queue to not get as much > > service as it should have. > > I did the stamping selection other than how long have we been solo based > on these possibly wrong speculations: > > If we're in the idle window and doing the async drain thing, we've at > the spot where Vivek's patch helps a ton. Seemed like a great time to > limit the size of any io that may land in front of my sync reader to > plain "you are not alone" quantity. You can't be in the idle window and doing async drain at the same time, the idle window doesn't start until the sync queue has completed a request. Hence my above rant on device interference. > If we've got sync io in flight, that should mean that my new or old > known seeky queue has been serviced at least once. There's likely to be > more on the way, so delay overloading then too. > > The seeky bit is supposed to be the earlier "last time we saw a seeker" > thing, but known seeky is too late to help a new task at all unless you > turn off the overloading for ages, so I added the if incalculable check > for good measure, hoping that meant the task is new, may want to exec. > > Stamping any place may (see below) possibly limit the size of the io the > reader can generate as well as writer, but I figured what's good for the > goose is good for the the gander, or it ain't really good. The overload > was causing the observed pain, definitely ain't good for both at these > times at least, so don't let it do that. > > > Perhaps the "set slice on first complete" isn't working correctly? Or > > perhaps we just need to be more extreme. > > Dunno, I was just tossing rocks and sticks at it. > > I don't really understand the reasoning behind overloading: I can see > that allows cutting thicker slabs for the disk, but with the streaming > writer vs reader case, seems only the writers can do that. The reader > is unlikely to be alone isn't it? Seems to me that either dd, a flusher > thread or kjournald is going to be there with it, which gives dd a huge > advantage.. it has two proxies to help it squabble over disk, konsole > has none. That is true, async queues have a huge advantage over sync ones. But sync vs async is only part of it, any combination of queued sync, queued sync random etc have different ramifications on behaviour of the individual queue. It's not hard to make the latency good, the hard bit is making sure we also perform well for all other scenarios. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 8:04 ` Jens Axboe (?) @ 2009-10-02 8:53 ` Mike Galbraith [not found] ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> ` (2 more replies) -1 siblings, 3 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 8:53 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote: > On Fri, Oct 02 2009, Mike Galbraith wrote: > > If we're in the idle window and doing the async drain thing, we've at > > the spot where Vivek's patch helps a ton. Seemed like a great time to > > limit the size of any io that may land in front of my sync reader to > > plain "you are not alone" quantity. > > You can't be in the idle window and doing async drain at the same time, > the idle window doesn't start until the sync queue has completed a > request. Hence my above rant on device interference. I'll take your word for it. /* * Drain async requests before we start sync IO */ if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) Looked about the same to me as.. enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); ..where Vivek prevented turning 1 into 0, so I stamped it ;-) > > Dunno, I was just tossing rocks and sticks at it. > > > > I don't really understand the reasoning behind overloading: I can see > > that allows cutting thicker slabs for the disk, but with the streaming > > writer vs reader case, seems only the writers can do that. The reader > > is unlikely to be alone isn't it? Seems to me that either dd, a flusher > > thread or kjournald is going to be there with it, which gives dd a huge > > advantage.. it has two proxies to help it squabble over disk, konsole > > has none. > > That is true, async queues have a huge advantage over sync ones. But > sync vs async is only part of it, any combination of queued sync, queued > sync random etc have different ramifications on behaviour of the > individual queue. > > It's not hard to make the latency good, the hard bit is making sure we > also perform well for all other scenarios. Yeah, that's why I'm trying to be careful about what I say, I know full well this ain't easy to get right. I'm not even thinking of submitting anything, it's just diagnostic testing. WRT my who can overload theory, I instrumented for my own edification. Overload totally forbidden, stamps ergo disabled. fairness=0 11.3 avg (ie == virgin source) fairness=1 2.8 avg Back to virgin settings, instrument who is overloading during sequences of.. echo 2 > /proc/sys/vm/drop_caches sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE ..with dd continually running. 1 second counts for above. ... [ 916.585880] od_sync: 0 od_async: 87 reject_sync: 0 reject_async: 37 [ 917.662585] od_sync: 0 od_async: 126 reject_sync: 0 reject_async: 53 [ 918.732872] od_sync: 0 od_async: 96 reject_sync: 0 reject_async: 22 [ 919.743730] od_sync: 0 od_async: 75 reject_sync: 0 reject_async: 15 [ 920.914549] od_sync: 0 od_async: 81 reject_sync: 0 reject_async: 17 [ 921.988198] od_sync: 0 od_async: 123 reject_sync: 0 reject_async: 30 ...minutes long (reject == fqq->dispatched >= 4 * max_dispatch) Doing the same with firefox, I did see the burst below one time, dunno what triggered that. I watched 6 runs, and only saw such a burst once. Typically, numbers are the same as konsole, with a very rare 4 or 5 for sync sneaking in. [ 1988.177758] od_sync: 0 od_async: 104 reject_sync: 0 reject_async: 48 [ 1992.291779] od_sync: 19 od_async: 83 reject_sync: 0 reject_async: 82 [ 1993.300850] od_sync: 79 od_async: 0 reject_sync: 28 reject_async: 0 [ 1994.313327] od_sync: 147 od_async: 104 reject_sync: 90 reject_async: 16 [ 1995.378025] od_sync: 14 od_async: 45 reject_sync: 0 reject_async: 2 [ 1996.456871] od_sync: 15 od_async: 74 reject_sync: 1 reject_async: 7 [ 1997.611226] od_sync: 0 od_async: 84 reject_sync: 0 reject_async: 14 Never noticed a sync overload watching a make -j4 for a couple minutes. ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 9:00 ` Mike Galbraith 2009-10-02 9:55 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 9:00 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b > WRT my who can overload theory, I instrumented for my own edification. > > Overload totally forbidden, stamps ergo disabled. > > fairness=0 11.3 avg (ie == virgin source) > fairness=1 2.8 avg (oops, quantum was set to 16 as well there. not that it matters, but for completeness) ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 9:00 ` Mike Galbraith @ 2009-10-02 9:55 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 9:55 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote: > > On Fri, Oct 02 2009, Mike Galbraith wrote: > > > > If we're in the idle window and doing the async drain thing, we've at > > > the spot where Vivek's patch helps a ton. Seemed like a great time to > > > limit the size of any io that may land in front of my sync reader to > > > plain "you are not alone" quantity. > > > > You can't be in the idle window and doing async drain at the same time, > > the idle window doesn't start until the sync queue has completed a > > request. Hence my above rant on device interference. > > I'll take your word for it. > > /* > * Drain async requests before we start sync IO > */ > if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > > Looked about the same to me as.. > > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > ..where Vivek prevented turning 1 into 0, so I stamped it ;-) cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter idling, not that it is currently idling. The actual idling happens from cfq_completed_request(), here: else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) && sync && !rq_noidle(rq)) cfq_arm_slice_timer(cfqd); and after that the queue will be marked as waiting, so cfq_cfqq_wait_request(cfqq) is a better indication of whether we are currently waiting for a request (idling) or not. > > > Dunno, I was just tossing rocks and sticks at it. > > > > > > I don't really understand the reasoning behind overloading: I can see > > > that allows cutting thicker slabs for the disk, but with the streaming > > > writer vs reader case, seems only the writers can do that. The reader > > > is unlikely to be alone isn't it? Seems to me that either dd, a flusher > > > thread or kjournald is going to be there with it, which gives dd a huge > > > advantage.. it has two proxies to help it squabble over disk, konsole > > > has none. > > > > That is true, async queues have a huge advantage over sync ones. But > > sync vs async is only part of it, any combination of queued sync, queued > > sync random etc have different ramifications on behaviour of the > > individual queue. > > > > It's not hard to make the latency good, the hard bit is making sure we > > also perform well for all other scenarios. > > Yeah, that's why I'm trying to be careful about what I say, I know full > well this ain't easy to get right. I'm not even thinking of submitting > anything, it's just diagnostic testing. It's much appreciated btw, if we can make this better without killing throughput, then I'm surely interested in picking up your interesting bits and getting them massaged into something we can include. So don't be discouraged, I'm just being realistic :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 8:53 ` Mike Galbraith [not found] ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 9:00 ` Mike Galbraith 2009-10-02 9:55 ` Jens Axboe 2 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 9:00 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel > WRT my who can overload theory, I instrumented for my own edification. > > Overload totally forbidden, stamps ergo disabled. > > fairness=0 11.3 avg (ie == virgin source) > fairness=1 2.8 avg (oops, quantum was set to 16 as well there. not that it matters, but for completeness) ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 8:53 ` Mike Galbraith [not found] ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 9:00 ` Mike Galbraith @ 2009-10-02 9:55 ` Jens Axboe 2009-10-02 12:22 ` Mike Galbraith [not found] ` <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2 siblings, 2 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 9:55 UTC (permalink / raw) To: Mike Galbraith Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote: > > On Fri, Oct 02 2009, Mike Galbraith wrote: > > > > If we're in the idle window and doing the async drain thing, we've at > > > the spot where Vivek's patch helps a ton. Seemed like a great time to > > > limit the size of any io that may land in front of my sync reader to > > > plain "you are not alone" quantity. > > > > You can't be in the idle window and doing async drain at the same time, > > the idle window doesn't start until the sync queue has completed a > > request. Hence my above rant on device interference. > > I'll take your word for it. > > /* > * Drain async requests before we start sync IO > */ > if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > > Looked about the same to me as.. > > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > ..where Vivek prevented turning 1 into 0, so I stamped it ;-) cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter idling, not that it is currently idling. The actual idling happens from cfq_completed_request(), here: else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) && sync && !rq_noidle(rq)) cfq_arm_slice_timer(cfqd); and after that the queue will be marked as waiting, so cfq_cfqq_wait_request(cfqq) is a better indication of whether we are currently waiting for a request (idling) or not. > > > Dunno, I was just tossing rocks and sticks at it. > > > > > > I don't really understand the reasoning behind overloading: I can see > > > that allows cutting thicker slabs for the disk, but with the streaming > > > writer vs reader case, seems only the writers can do that. The reader > > > is unlikely to be alone isn't it? Seems to me that either dd, a flusher > > > thread or kjournald is going to be there with it, which gives dd a huge > > > advantage.. it has two proxies to help it squabble over disk, konsole > > > has none. > > > > That is true, async queues have a huge advantage over sync ones. But > > sync vs async is only part of it, any combination of queued sync, queued > > sync random etc have different ramifications on behaviour of the > > individual queue. > > > > It's not hard to make the latency good, the hard bit is making sure we > > also perform well for all other scenarios. > > Yeah, that's why I'm trying to be careful about what I say, I know full > well this ain't easy to get right. I'm not even thinking of submitting > anything, it's just diagnostic testing. It's much appreciated btw, if we can make this better without killing throughput, then I'm surely interested in picking up your interesting bits and getting them massaged into something we can include. So don't be discouraged, I'm just being realistic :-) -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 9:55 ` Jens Axboe @ 2009-10-02 12:22 ` Mike Galbraith [not found] ` <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 12:22 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Fri, 2009-10-02 at 11:55 +0200, Jens Axboe wrote: > On Fri, Oct 02 2009, Mike Galbraith wrote: > > > > /* > > * Drain async requests before we start sync IO > > */ > > if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > > > > Looked about the same to me as.. > > > > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > > > ..where Vivek prevented turning 1 into 0, so I stamped it ;-) > > cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter > idling, not that it is currently idling. The actual idling happens from > cfq_completed_request(), here: > > else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) && > sync && !rq_noidle(rq)) > cfq_arm_slice_timer(cfqd); > > and after that the queue will be marked as waiting, so > cfq_cfqq_wait_request(cfqq) is a better indication of whether we are > currently waiting for a request (idling) or not. Hm. Then cfq_cfqq_idle_window(cfqq) actually suits my intent better. (If I want to reduce async's advantage, I should target specifically, ie only stamp if this queue is a sync queue....otoh, if this queue is sync, it is now officially too late, whereas if this queue is dd about to inflict the wrath of kjournald on my reader's world, stamping now is a really good idea.. scritch scritch scritch <smoke>) I'll go tinker with it. Thanks for the clue. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 12:22 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 12:22 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, 2009-10-02 at 11:55 +0200, Jens Axboe wrote: > On Fri, Oct 02 2009, Mike Galbraith wrote: > > > > /* > > * Drain async requests before we start sync IO > > */ > > if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > > > > Looked about the same to me as.. > > > > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > > > ..where Vivek prevented turning 1 into 0, so I stamped it ;-) > > cfq_cfqq_idle_window(cfqq) just tells you whether this queue may enter > idling, not that it is currently idling. The actual idling happens from > cfq_completed_request(), here: > > else if (cfqq_empty && !cfq_close_cooperator(cfqd, cfqq, 1) && > sync && !rq_noidle(rq)) > cfq_arm_slice_timer(cfqd); > > and after that the queue will be marked as waiting, so > cfq_cfqq_wait_request(cfqq) is a better indication of whether we are > currently waiting for a request (idling) or not. Hm. Then cfq_cfqq_idle_window(cfqq) actually suits my intent better. (If I want to reduce async's advantage, I should target specifically, ie only stamp if this queue is a sync queue....otoh, if this queue is sync, it is now officially too late, whereas if this queue is dd about to inflict the wrath of kjournald on my reader's world, stamping now is a really good idea.. scritch scritch scritch <smoke>) I'll go tinker with it. Thanks for the clue. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002080417.GG14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002080417.GG14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 8:53 ` Mike Galbraith 2009-10-02 9:24 ` Ingo Molnar 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 8:53 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, 2009-10-02 at 10:04 +0200, Jens Axboe wrote: > On Fri, Oct 02 2009, Mike Galbraith wrote: > > If we're in the idle window and doing the async drain thing, we've at > > the spot where Vivek's patch helps a ton. Seemed like a great time to > > limit the size of any io that may land in front of my sync reader to > > plain "you are not alone" quantity. > > You can't be in the idle window and doing async drain at the same time, > the idle window doesn't start until the sync queue has completed a > request. Hence my above rant on device interference. I'll take your word for it. /* * Drain async requests before we start sync IO */ if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) Looked about the same to me as.. enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); ..where Vivek prevented turning 1 into 0, so I stamped it ;-) > > Dunno, I was just tossing rocks and sticks at it. > > > > I don't really understand the reasoning behind overloading: I can see > > that allows cutting thicker slabs for the disk, but with the streaming > > writer vs reader case, seems only the writers can do that. The reader > > is unlikely to be alone isn't it? Seems to me that either dd, a flusher > > thread or kjournald is going to be there with it, which gives dd a huge > > advantage.. it has two proxies to help it squabble over disk, konsole > > has none. > > That is true, async queues have a huge advantage over sync ones. But > sync vs async is only part of it, any combination of queued sync, queued > sync random etc have different ramifications on behaviour of the > individual queue. > > It's not hard to make the latency good, the hard bit is making sure we > also perform well for all other scenarios. Yeah, that's why I'm trying to be careful about what I say, I know full well this ain't easy to get right. I'm not even thinking of submitting anything, it's just diagnostic testing. WRT my who can overload theory, I instrumented for my own edification. Overload totally forbidden, stamps ergo disabled. fairness=0 11.3 avg (ie == virgin source) fairness=1 2.8 avg Back to virgin settings, instrument who is overloading during sequences of.. echo 2 > /proc/sys/vm/drop_caches sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE ..with dd continually running. 1 second counts for above. ... [ 916.585880] od_sync: 0 od_async: 87 reject_sync: 0 reject_async: 37 [ 917.662585] od_sync: 0 od_async: 126 reject_sync: 0 reject_async: 53 [ 918.732872] od_sync: 0 od_async: 96 reject_sync: 0 reject_async: 22 [ 919.743730] od_sync: 0 od_async: 75 reject_sync: 0 reject_async: 15 [ 920.914549] od_sync: 0 od_async: 81 reject_sync: 0 reject_async: 17 [ 921.988198] od_sync: 0 od_async: 123 reject_sync: 0 reject_async: 30 ...minutes long (reject == fqq->dispatched >= 4 * max_dispatch) Doing the same with firefox, I did see the burst below one time, dunno what triggered that. I watched 6 runs, and only saw such a burst once. Typically, numbers are the same as konsole, with a very rare 4 or 5 for sync sneaking in. [ 1988.177758] od_sync: 0 od_async: 104 reject_sync: 0 reject_async: 48 [ 1992.291779] od_sync: 19 od_async: 83 reject_sync: 0 reject_async: 82 [ 1993.300850] od_sync: 79 od_async: 0 reject_sync: 28 reject_async: 0 [ 1994.313327] od_sync: 147 od_async: 104 reject_sync: 90 reject_async: 16 [ 1995.378025] od_sync: 14 od_async: 45 reject_sync: 0 reject_async: 2 [ 1996.456871] od_sync: 15 od_async: 74 reject_sync: 1 reject_async: 7 [ 1997.611226] od_sync: 0 od_async: 84 reject_sync: 0 reject_async: 14 Never noticed a sync overload watching a make -j4 for a couple minutes. ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091002080417.GG14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 8:53 ` Mike Galbraith @ 2009-10-02 9:24 ` Ingo Molnar 1 sibling, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 9:24 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > It's not hard to make the latency good, the hard bit is making sure we > also perform well for all other scenarios. Looking at the numbers from Mike: | dd competing against perf stat -- konsole -e exec timings, 5 back to | back runs | Avg | before 9.15 14.51 9.39 15.06 9.90 11.6 | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 _PLEASE_ make read latencies this good - the numbers are _vastly_ better. We'll worry about the 'other' things _after_ we've reached good latencies. I thought this principle was a well established basic rule of Linux IO scheduling. Why do we have to have a 'latency vs. bandwidth' discussion again and again? I thought latency won hands down. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 8:04 ` Jens Axboe @ 2009-10-02 9:24 ` Ingo Molnar -1 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 9:24 UTC (permalink / raw) To: Jens Axboe Cc: Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel * Jens Axboe <jens.axboe@oracle.com> wrote: > It's not hard to make the latency good, the hard bit is making sure we > also perform well for all other scenarios. Looking at the numbers from Mike: | dd competing against perf stat -- konsole -e exec timings, 5 back to | back runs | Avg | before 9.15 14.51 9.39 15.06 9.90 11.6 | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 _PLEASE_ make read latencies this good - the numbers are _vastly_ better. We'll worry about the 'other' things _after_ we've reached good latencies. I thought this principle was a well established basic rule of Linux IO scheduling. Why do we have to have a 'latency vs. bandwidth' discussion again and again? I thought latency won hands down. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 9:24 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 9:24 UTC (permalink / raw) To: Jens Axboe Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds * Jens Axboe <jens.axboe@oracle.com> wrote: > It's not hard to make the latency good, the hard bit is making sure we > also perform well for all other scenarios. Looking at the numbers from Mike: | dd competing against perf stat -- konsole -e exec timings, 5 back to | back runs | Avg | before 9.15 14.51 9.39 15.06 9.90 11.6 | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 _PLEASE_ make read latencies this good - the numbers are _vastly_ better. We'll worry about the 'other' things _after_ we've reached good latencies. I thought this principle was a well established basic rule of Linux IO scheduling. Why do we have to have a 'latency vs. bandwidth' discussion again and again? I thought latency won hands down. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 9:24 ` Ingo Molnar @ 2009-10-02 9:28 ` Jens Axboe -1 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 9:28 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > It's not hard to make the latency good, the hard bit is making sure we > > also perform well for all other scenarios. > > Looking at the numbers from Mike: > > | dd competing against perf stat -- konsole -e exec timings, 5 back to > | back runs > | Avg > | before 9.15 14.51 9.39 15.06 9.90 11.6 > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 > > _PLEASE_ make read latencies this good - the numbers are _vastly_ > better. We'll worry about the 'other' things _after_ we've reached good > latencies. > > I thought this principle was a well established basic rule of Linux IO > scheduling. Why do we have to have a 'latency vs. bandwidth' discussion > again and again? I thought latency won hands down. It's really not that simple, if we go and do easy latency bits, then throughput drops 30% or more. You can't say it's black and white latency vs throughput issue, that's just not how the real world works. The server folks would be most unpleased. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 9:28 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 9:28 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, torvalds On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > It's not hard to make the latency good, the hard bit is making sure we > > also perform well for all other scenarios. > > Looking at the numbers from Mike: > > | dd competing against perf stat -- konsole -e exec timings, 5 back to > | back runs > | Avg > | before 9.15 14.51 9.39 15.06 9.90 11.6 > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 > > _PLEASE_ make read latencies this good - the numbers are _vastly_ > better. We'll worry about the 'other' things _after_ we've reached good > latencies. > > I thought this principle was a well established basic rule of Linux IO > scheduling. Why do we have to have a 'latency vs. bandwidth' discussion > again and again? I thought latency won hands down. It's really not that simple, if we go and do easy latency bits, then throughput drops 30% or more. You can't say it's black and white latency vs throughput issue, that's just not how the real world works. The server folks would be most unpleased. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002092839.GA26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 2009-10-02 9:28 ` Jens Axboe @ 2009-10-02 14:24 ` Linus Torvalds -1 siblings, 0 replies; 349+ messages in thread From: Linus Torvalds @ 2009-10-02 14:24 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Fri, 2 Oct 2009, Jens Axboe wrote: > > It's really not that simple, if we go and do easy latency bits, then > throughput drops 30% or more. Well, if we're talking 500-950% improvement vs 30% deprovement, I think it's pretty clear, though. Even the server people do care about latencies. Often they care quite a bit, in fact. And Mike's patch didn't look big or complicated. > You can't say it's black and white latency vs throughput issue, Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn black-and-white _regardless_ of what you're measuring. Plus you probably made up the 30% - have you tested the patch? And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's just harder to measure, so people seldom attach numbers to it. But that again means that when people _are_ able to attach numbers to it, we should take those numbers _more_ seriously rather than less. So the 30% you threw out as a number is pretty much worthless. Linus ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 14:24 ` Linus Torvalds 0 siblings, 0 replies; 349+ messages in thread From: Linus Torvalds @ 2009-10-02 14:24 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2 Oct 2009, Jens Axboe wrote: > > It's really not that simple, if we go and do easy latency bits, then > throughput drops 30% or more. Well, if we're talking 500-950% improvement vs 30% deprovement, I think it's pretty clear, though. Even the server people do care about latencies. Often they care quite a bit, in fact. And Mike's patch didn't look big or complicated. > You can't say it's black and white latency vs throughput issue, Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn black-and-white _regardless_ of what you're measuring. Plus you probably made up the 30% - have you tested the patch? And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's just harder to measure, so people seldom attach numbers to it. But that again means that when people _are_ able to attach numbers to it, we should take those numbers _more_ seriously rather than less. So the 30% you threw out as a number is pretty much worthless. Linus ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 14:24 ` Linus Torvalds (?) @ 2009-10-02 14:45 ` Mike Galbraith 2009-10-02 14:57 ` Jens Axboe [not found] ` <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 14:45 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Ingo Molnar, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote: > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. > > Well, if we're talking 500-950% improvement vs 30% deprovement, I think > it's pretty clear, though. Even the server people do care about latencies. > > Often they care quite a bit, in fact. > > And Mike's patch didn't look big or complicated. But it is a hack. (thought about and measured, but hack nonetheless) I haven't tested it on much other than reader vs streaming writer. It may well destroy the rest of the IO universe. I don't have the hw to even test any hairy chested IO. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 14:45 ` Mike Galbraith @ 2009-10-02 14:57 ` Jens Axboe [not found] ` <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 14:57 UTC (permalink / raw) To: Mike Galbraith Cc: Linus Torvalds, Ingo Molnar, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote: > > > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > > > It's really not that simple, if we go and do easy latency bits, then > > > throughput drops 30% or more. > > > > Well, if we're talking 500-950% improvement vs 30% deprovement, I think > > it's pretty clear, though. Even the server people do care about latencies. > > > > Often they care quite a bit, in fact. > > > > And Mike's patch didn't look big or complicated. > > But it is a hack. (thought about and measured, but hack nonetheless) > > I haven't tested it on much other than reader vs streaming writer. It > may well destroy the rest of the IO universe. I don't have the hw to > even test any hairy chested IO. I'll get a desktop box going on this too. The plan is to make the latency as good as we can without making too many stupid decisions in the io scheduler, then we can care about the throughput later. Rinse and repeat. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 14:57 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 14:57 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote: > > > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > > > It's really not that simple, if we go and do easy latency bits, then > > > throughput drops 30% or more. > > > > Well, if we're talking 500-950% improvement vs 30% deprovement, I think > > it's pretty clear, though. Even the server people do care about latencies. > > > > Often they care quite a bit, in fact. > > > > And Mike's patch didn't look big or complicated. > > But it is a hack. (thought about and measured, but hack nonetheless) > > I haven't tested it on much other than reader vs streaming writer. It > may well destroy the rest of the IO universe. I don't have the hw to > even test any hairy chested IO. I'll get a desktop box going on this too. The plan is to make the latency as good as we can without making too many stupid decisions in the io scheduler, then we can care about the throughput later. Rinse and repeat. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 14:24 ` Linus Torvalds @ 2009-10-02 14:56 ` Jens Axboe -1 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 14:56 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Linus Torvalds wrote: > > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. > > Well, if we're talking 500-950% improvement vs 30% deprovement, I think > it's pretty clear, though. Even the server people do care about latencies. > > Often they care quite a bit, in fact. Mostly they care about throughput, and when they come running because some their favorite app/benchmark/etc is now 2% slower, I get to hear about it all the time. So yes, latency is not ignored, but mostly they yack about throughput. > And Mike's patch didn't look big or complicated. It wasn't, it was more of a hack than something mergeable though (and I think Mike will agree on that). So I'll repeat what I said to Mike, I'm very well prepared to get something worked out and merged and I very much appreciate the work he's putting into this. > > You can't say it's black and white latency vs throughput issue, > > Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn > black-and-white _regardless_ of what you're measuring. Plus you probably > made up the 30% - have you tested the patch? The 30% is totally made up, it's based on previous latency vs throughput tradeoffs. I haven't tested Mike's patch. > And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's > just harder to measure, so people seldom attach numbers to it. But that > again means that when people _are_ able to attach numbers to it, we should > take those numbers _more_ seriously rather than less. I agree, we can easily make CFQ be very about about latency. If you think that is fine, then lets just do that. Then we'll get to fix the server side up when the next RHEL/SLES/whatever cycle is honing in on a kernel, hopefully we wont have to start over when that happens. > So the 30% you threw out as a number is pretty much worthless. It's hand waving, definitely. But I've been doing io scheduler tweaking for years, and I know how hard it is to balance. If you want latency, then you basically only ever give the device 1 thing to do. And you let things cool down before switching over. If you do that, then your nice big array of SSDs or rotating drives will easily drop to 1/4th of the original performance. So we try and tweak the logic to make everybody happy. In some cases I wish we had a server vs desktop switch, since it would decisions on this easier. I know you say that servers care about latency, but not at all to the extent that desktops do. Most desktop users would gladly give away the top of the performance for latency, that's not true of most server users. Depends on what the server does, of course. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 14:56 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 14:56 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea On Fri, Oct 02 2009, Linus Torvalds wrote: > > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. > > Well, if we're talking 500-950% improvement vs 30% deprovement, I think > it's pretty clear, though. Even the server people do care about latencies. > > Often they care quite a bit, in fact. Mostly they care about throughput, and when they come running because some their favorite app/benchmark/etc is now 2% slower, I get to hear about it all the time. So yes, latency is not ignored, but mostly they yack about throughput. > And Mike's patch didn't look big or complicated. It wasn't, it was more of a hack than something mergeable though (and I think Mike will agree on that). So I'll repeat what I said to Mike, I'm very well prepared to get something worked out and merged and I very much appreciate the work he's putting into this. > > You can't say it's black and white latency vs throughput issue, > > Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn > black-and-white _regardless_ of what you're measuring. Plus you probably > made up the 30% - have you tested the patch? The 30% is totally made up, it's based on previous latency vs throughput tradeoffs. I haven't tested Mike's patch. > And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's > just harder to measure, so people seldom attach numbers to it. But that > again means that when people _are_ able to attach numbers to it, we should > take those numbers _more_ seriously rather than less. I agree, we can easily make CFQ be very about about latency. If you think that is fine, then lets just do that. Then we'll get to fix the server side up when the next RHEL/SLES/whatever cycle is honing in on a kernel, hopefully we wont have to start over when that happens. > So the 30% you threw out as a number is pretty much worthless. It's hand waving, definitely. But I've been doing io scheduler tweaking for years, and I know how hard it is to balance. If you want latency, then you basically only ever give the device 1 thing to do. And you let things cool down before switching over. If you do that, then your nice big array of SSDs or rotating drives will easily drop to 1/4th of the original performance. So we try and tweak the logic to make everybody happy. In some cases I wish we had a server vs desktop switch, since it would decisions on this easier. I know you say that servers care about latency, but not at all to the extent that desktops do. Most desktop users would gladly give away the top of the performance for latency, that's not true of most server users. Depends on what the server does, of course. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 14:56 ` Jens Axboe @ 2009-10-02 15:14 ` Linus Torvalds -1 siblings, 0 replies; 349+ messages in thread From: Linus Torvalds @ 2009-10-02 15:14 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2 Oct 2009, Jens Axboe wrote: > > Mostly they care about throughput, and when they come running because > some their favorite app/benchmark/etc is now 2% slower, I get to hear > about it all the time. So yes, latency is not ignored, but mostly they > yack about throughput. The reason they yack about it is that they can measure it. Give them the benchmark where it goes the other way, and tell them why they see a 2% deprovement. Give them some button they can tweak, because they will. But make the default be low-latency. Because everybody cares about low latency, and the people who do so are _not_ the people who you give buttons to tweak things with. > I agree, we can easily make CFQ be very about about latency. If you > think that is fine, then lets just do that. Then we'll get to fix the > server side up when the next RHEL/SLES/whatever cycle is honing in on a > kernel, hopefully we wont have to start over when that happens. I really think we should do latency first, and throughput second. It's _easy_ to get throughput. The people who care just about throughput can always just disable all the work we do for latency. If they really care about just throughput, they won't want fairness either - none of that complex stuff. Linus ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 15:14 ` Linus Torvalds 0 siblings, 0 replies; 349+ messages in thread From: Linus Torvalds @ 2009-10-02 15:14 UTC (permalink / raw) To: Jens Axboe Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea On Fri, 2 Oct 2009, Jens Axboe wrote: > > Mostly they care about throughput, and when they come running because > some their favorite app/benchmark/etc is now 2% slower, I get to hear > about it all the time. So yes, latency is not ignored, but mostly they > yack about throughput. The reason they yack about it is that they can measure it. Give them the benchmark where it goes the other way, and tell them why they see a 2% deprovement. Give them some button they can tweak, because they will. But make the default be low-latency. Because everybody cares about low latency, and the people who do so are _not_ the people who you give buttons to tweak things with. > I agree, we can easily make CFQ be very about about latency. If you > think that is fine, then lets just do that. Then we'll get to fix the > server side up when the next RHEL/SLES/whatever cycle is honing in on a > kernel, hopefully we wont have to start over when that happens. I really think we should do latency first, and throughput second. It's _easy_ to get throughput. The people who care just about throughput can always just disable all the work we do for latency. If they really care about just throughput, they won't want fairness either - none of that complex stuff. Linus ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 15:14 ` Linus Torvalds @ 2009-10-02 16:01 ` jim owens -1 siblings, 0 replies; 349+ messages in thread From: jim owens @ 2009-10-02 16:01 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel Linus Torvalds wrote: > > I really think we should do latency first, and throughput second. Agree. > It's _easy_ to get throughput. The people who care just about throughput > can always just disable all the work we do for latency. But in my experience it is not that simple... The argument latency vs throughput or desktop vs server is wrong. I/O can never keep up with the ability of CPUs to dirty data. On desktops and servers (really many-user-desktops) we want minimum latency but the enemy is dirty VM. If we ignore the need for throughput to flush dirty pages, VM gets angry and forced VM page cleaning I/O is bad I/O. We want min latency with low dirty page percent but need to switch to max write throughput at some high dirty page percent. We can not prevent the cliff we fall off where the system chokes because the dirty page load is too high, but if we only worry about latency, we bring that choke point cliff in so it happens with a lower load. A 10% lower overload point might be fine to get 100% better latency, but would desktop users accept a 50% lower overload point where running one more application makes the system appear hung? Even desktop users commonly measure "how much work can I do before the system becomes unresponsive". jim ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 16:01 ` jim owens 0 siblings, 0 replies; 349+ messages in thread From: jim owens @ 2009-10-02 16:01 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea Linus Torvalds wrote: > > I really think we should do latency first, and throughput second. Agree. > It's _easy_ to get throughput. The people who care just about throughput > can always just disable all the work we do for latency. But in my experience it is not that simple... The argument latency vs throughput or desktop vs server is wrong. I/O can never keep up with the ability of CPUs to dirty data. On desktops and servers (really many-user-desktops) we want minimum latency but the enemy is dirty VM. If we ignore the need for throughput to flush dirty pages, VM gets angry and forced VM page cleaning I/O is bad I/O. We want min latency with low dirty page percent but need to switch to max write throughput at some high dirty page percent. We can not prevent the cliff we fall off where the system chokes because the dirty page load is too high, but if we only worry about latency, we bring that choke point cliff in so it happens with a lower load. A 10% lower overload point might be fine to get 100% better latency, but would desktop users accept a 50% lower overload point where running one more application makes the system appear hung? Even desktop users commonly measure "how much work can I do before the system becomes unresponsive". jim ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 15:14 ` Linus Torvalds @ 2009-10-02 17:11 ` Jens Axboe -1 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:11 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Linus Torvalds wrote: > > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > Mostly they care about throughput, and when they come running because > > some their favorite app/benchmark/etc is now 2% slower, I get to hear > > about it all the time. So yes, latency is not ignored, but mostly they > > yack about throughput. > > The reason they yack about it is that they can measure it. > > Give them the benchmark where it goes the other way, and tell them why > they see a 2% deprovement. Give them some button they can tweak, because > they will. To some extent that's true, and I didn't want to generalize. If they are adament that the benchmark models their real life, then no amount of pointing in the other direction will change that. Your point about tuning is definitely true, these people are used to tuning things. For the desktop we care a lot more about working out of the box. > But make the default be low-latency. Because everybody cares about low > latency, and the people who do so are _not_ the people who you give > buttons to tweak things with. Totally agree. > > I agree, we can easily make CFQ be very about about latency. If you > > think that is fine, then lets just do that. Then we'll get to fix the > > server side up when the next RHEL/SLES/whatever cycle is honing in on a > > kernel, hopefully we wont have to start over when that happens. > > I really think we should do latency first, and throughput second. > > It's _easy_ to get throughput. The people who care just about throughput > can always just disable all the work we do for latency. If they really > care about just throughput, they won't want fairness either - none of that > complex stuff. It's not _that_ easy, it depends a lot on the access patterns. A good example of that is actually the idling that we already do. Say you have two applications, each starting up. If you start them both at the same time and just care for the dumb low latency, then you'll do one IO from each of them in turn. Latency will be good, but throughput will be aweful. And this means that in 20s they are both started, while with the slice idling and priority disk access that CFQ does, you'd hopefully have both up and running in 2s. So latency is good, definitely, but sometimes you have to worry about the bigger picture too. Latency is more than single IOs, it's often for complete operation which may involve lots of IOs. Single IO latency is a benchmark thing, it's not a real life issue. And that's where it becomes complex and not so black and white. Mike's test is a really good example of that. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 17:11 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:11 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea On Fri, Oct 02 2009, Linus Torvalds wrote: > > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > Mostly they care about throughput, and when they come running because > > some their favorite app/benchmark/etc is now 2% slower, I get to hear > > about it all the time. So yes, latency is not ignored, but mostly they > > yack about throughput. > > The reason they yack about it is that they can measure it. > > Give them the benchmark where it goes the other way, and tell them why > they see a 2% deprovement. Give them some button they can tweak, because > they will. To some extent that's true, and I didn't want to generalize. If they are adament that the benchmark models their real life, then no amount of pointing in the other direction will change that. Your point about tuning is definitely true, these people are used to tuning things. For the desktop we care a lot more about working out of the box. > But make the default be low-latency. Because everybody cares about low > latency, and the people who do so are _not_ the people who you give > buttons to tweak things with. Totally agree. > > I agree, we can easily make CFQ be very about about latency. If you > > think that is fine, then lets just do that. Then we'll get to fix the > > server side up when the next RHEL/SLES/whatever cycle is honing in on a > > kernel, hopefully we wont have to start over when that happens. > > I really think we should do latency first, and throughput second. > > It's _easy_ to get throughput. The people who care just about throughput > can always just disable all the work we do for latency. If they really > care about just throughput, they won't want fairness either - none of that > complex stuff. It's not _that_ easy, it depends a lot on the access patterns. A good example of that is actually the idling that we already do. Say you have two applications, each starting up. If you start them both at the same time and just care for the dumb low latency, then you'll do one IO from each of them in turn. Latency will be good, but throughput will be aweful. And this means that in 20s they are both started, while with the slice idling and priority disk access that CFQ does, you'd hopefully have both up and running in 2s. So latency is good, definitely, but sometimes you have to worry about the bigger picture too. Latency is more than single IOs, it's often for complete operation which may involve lots of IOs. Single IO latency is a benchmark thing, it's not a real life issue. And that's where it becomes complex and not so black and white. Mike's test is a really good example of that. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 17:11 ` Jens Axboe @ 2009-10-02 17:20 ` Ingo Molnar -1 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:20 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel * Jens Axboe <jens.axboe@oracle.com> wrote: > It's not _that_ easy, it depends a lot on the access patterns. A good > example of that is actually the idling that we already do. Say you > have two applications, each starting up. If you start them both at the > same time and just care for the dumb low latency, then you'll do one > IO from each of them in turn. Latency will be good, but throughput > will be aweful. And this means that in 20s they are both started, > while with the slice idling and priority disk access that CFQ does, > you'd hopefully have both up and running in 2s. > > So latency is good, definitely, but sometimes you have to worry about > the bigger picture too. Latency is more than single IOs, it's often > for complete operation which may involve lots of IOs. Single IO > latency is a benchmark thing, it's not a real life issue. And that's > where it becomes complex and not so black and white. Mike's test is a > really good example of that. To the extent of you arguing that Mike's test is artificial (i'm not sure you are arguing that) - Mike certainly did not do an artificial test - he tested 'konsole' cache-cold startup latency, such as: sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE against a streaming dd. That is a _very_ relevant benchmark IMHO and konsole's cache footprint is far from trivial. (In fact i'd argue it's one of the most important IO benchmarks on a desktop system - how does your desktop hold up to something doing streaming IO.) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 17:20 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:20 UTC (permalink / raw) To: Jens Axboe Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds * Jens Axboe <jens.axboe@oracle.com> wrote: > It's not _that_ easy, it depends a lot on the access patterns. A good > example of that is actually the idling that we already do. Say you > have two applications, each starting up. If you start them both at the > same time and just care for the dumb low latency, then you'll do one > IO from each of them in turn. Latency will be good, but throughput > will be aweful. And this means that in 20s they are both started, > while with the slice idling and priority disk access that CFQ does, > you'd hopefully have both up and running in 2s. > > So latency is good, definitely, but sometimes you have to worry about > the bigger picture too. Latency is more than single IOs, it's often > for complete operation which may involve lots of IOs. Single IO > latency is a benchmark thing, it's not a real life issue. And that's > where it becomes complex and not so black and white. Mike's test is a > really good example of that. To the extent of you arguing that Mike's test is artificial (i'm not sure you are arguing that) - Mike certainly did not do an artificial test - he tested 'konsole' cache-cold startup latency, such as: sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE against a streaming dd. That is a _very_ relevant benchmark IMHO and konsole's cache footprint is far from trivial. (In fact i'd argue it's one of the most important IO benchmarks on a desktop system - how does your desktop hold up to something doing streaming IO.) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 17:20 ` Ingo Molnar @ 2009-10-02 17:25 ` Jens Axboe -1 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:25 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > It's not _that_ easy, it depends a lot on the access patterns. A good > > example of that is actually the idling that we already do. Say you > > have two applications, each starting up. If you start them both at the > > same time and just care for the dumb low latency, then you'll do one > > IO from each of them in turn. Latency will be good, but throughput > > will be aweful. And this means that in 20s they are both started, > > while with the slice idling and priority disk access that CFQ does, > > you'd hopefully have both up and running in 2s. > > > > So latency is good, definitely, but sometimes you have to worry about > > the bigger picture too. Latency is more than single IOs, it's often > > for complete operation which may involve lots of IOs. Single IO > > latency is a benchmark thing, it's not a real life issue. And that's > > where it becomes complex and not so black and white. Mike's test is a > > really good example of that. > > To the extent of you arguing that Mike's test is artificial (i'm not > sure you are arguing that) - Mike certainly did not do an artificial > test - he tested 'konsole' cache-cold startup latency, such as: [snip] I was saying the exact opposite, that Mike's test is a good example of a valid test. It's not measuring single IO latencies, it's doing a sequence of valid events and looking at the latency for those. It's benchmarking the bigger picture, not a microbenchmark. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 17:25 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:25 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > It's not _that_ easy, it depends a lot on the access patterns. A good > > example of that is actually the idling that we already do. Say you > > have two applications, each starting up. If you start them both at the > > same time and just care for the dumb low latency, then you'll do one > > IO from each of them in turn. Latency will be good, but throughput > > will be aweful. And this means that in 20s they are both started, > > while with the slice idling and priority disk access that CFQ does, > > you'd hopefully have both up and running in 2s. > > > > So latency is good, definitely, but sometimes you have to worry about > > the bigger picture too. Latency is more than single IOs, it's often > > for complete operation which may involve lots of IOs. Single IO > > latency is a benchmark thing, it's not a real life issue. And that's > > where it becomes complex and not so black and white. Mike's test is a > > really good example of that. > > To the extent of you arguing that Mike's test is artificial (i'm not > sure you are arguing that) - Mike certainly did not do an artificial > test - he tested 'konsole' cache-cold startup latency, such as: [snip] I was saying the exact opposite, that Mike's test is a good example of a valid test. It's not measuring single IO latencies, it's doing a sequence of valid events and looking at the latency for those. It's benchmarking the bigger picture, not a microbenchmark. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 17:25 ` Jens Axboe @ 2009-10-02 17:28 ` Ingo Molnar -1 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:28 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel * Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > good example of that is actually the idling that we already do. > > > Say you have two applications, each starting up. If you start them > > > both at the same time and just care for the dumb low latency, then > > > you'll do one IO from each of them in turn. Latency will be good, > > > but throughput will be aweful. And this means that in 20s they are > > > both started, while with the slice idling and priority disk access > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > So latency is good, definitely, but sometimes you have to worry > > > about the bigger picture too. Latency is more than single IOs, > > > it's often for complete operation which may involve lots of IOs. > > > Single IO latency is a benchmark thing, it's not a real life > > > issue. And that's where it becomes complex and not so black and > > > white. Mike's test is a really good example of that. > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > sure you are arguing that) - Mike certainly did not do an artificial > > test - he tested 'konsole' cache-cold startup latency, such as: > > [snip] > > I was saying the exact opposite, that Mike's test is a good example of > a valid test. It's not measuring single IO latencies, it's doing a > sequence of valid events and looking at the latency for those. It's > benchmarking the bigger picture, not a microbenchmark. Good, so we are in violent agreement :-) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 17:28 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:28 UTC (permalink / raw) To: Jens Axboe Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds * Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > good example of that is actually the idling that we already do. > > > Say you have two applications, each starting up. If you start them > > > both at the same time and just care for the dumb low latency, then > > > you'll do one IO from each of them in turn. Latency will be good, > > > but throughput will be aweful. And this means that in 20s they are > > > both started, while with the slice idling and priority disk access > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > So latency is good, definitely, but sometimes you have to worry > > > about the bigger picture too. Latency is more than single IOs, > > > it's often for complete operation which may involve lots of IOs. > > > Single IO latency is a benchmark thing, it's not a real life > > > issue. And that's where it becomes complex and not so black and > > > white. Mike's test is a really good example of that. > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > sure you are arguing that) - Mike certainly did not do an artificial > > test - he tested 'konsole' cache-cold startup latency, such as: > > [snip] > > I was saying the exact opposite, that Mike's test is a good example of > a valid test. It's not measuring single IO latencies, it's doing a > sequence of valid events and looking at the latency for those. It's > benchmarking the bigger picture, not a microbenchmark. Good, so we are in violent agreement :-) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002172842.GA4884-X9Un+BFzKDI@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002172842.GA4884-X9Un+BFzKDI@public.gmane.org> @ 2009-10-02 17:37 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:37 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > good example of that is actually the idling that we already do. > > > > Say you have two applications, each starting up. If you start them > > > > both at the same time and just care for the dumb low latency, then > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > but throughput will be aweful. And this means that in 20s they are > > > > both started, while with the slice idling and priority disk access > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > about the bigger picture too. Latency is more than single IOs, > > > > it's often for complete operation which may involve lots of IOs. > > > > Single IO latency is a benchmark thing, it's not a real life > > > > issue. And that's where it becomes complex and not so black and > > > > white. Mike's test is a really good example of that. > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > sure you are arguing that) - Mike certainly did not do an artificial > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > [snip] > > > > I was saying the exact opposite, that Mike's test is a good example of > > a valid test. It's not measuring single IO latencies, it's doing a > > sequence of valid events and looking at the latency for those. It's > > benchmarking the bigger picture, not a microbenchmark. > > Good, so we are in violent agreement :-) Yes, perhaps that last sentence didn't provide enough evidence of which category I put Mike's test into :-) So to kick things off, I added an 'interactive' knob to CFQ and defaulted it to on, along with re-enabling slice idling for hardware that does tagged command queuing. This is almost completely identical to what Vivek Goyal originally posted, it's just combined into one and uses the term 'interactive' instead of 'fairness'. I think the former is a better umbrella under which to add further tweaks that may sacrifice throughput slightly, in the quest for better latency. It's queued up in the for-linus branch. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 17:28 ` Ingo Molnar (?) (?) @ 2009-10-02 17:37 ` Jens Axboe [not found] ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> ` (2 more replies) -1 siblings, 3 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:37 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > good example of that is actually the idling that we already do. > > > > Say you have two applications, each starting up. If you start them > > > > both at the same time and just care for the dumb low latency, then > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > but throughput will be aweful. And this means that in 20s they are > > > > both started, while with the slice idling and priority disk access > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > about the bigger picture too. Latency is more than single IOs, > > > > it's often for complete operation which may involve lots of IOs. > > > > Single IO latency is a benchmark thing, it's not a real life > > > > issue. And that's where it becomes complex and not so black and > > > > white. Mike's test is a really good example of that. > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > sure you are arguing that) - Mike certainly did not do an artificial > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > [snip] > > > > I was saying the exact opposite, that Mike's test is a good example of > > a valid test. It's not measuring single IO latencies, it's doing a > > sequence of valid events and looking at the latency for those. It's > > benchmarking the bigger picture, not a microbenchmark. > > Good, so we are in violent agreement :-) Yes, perhaps that last sentence didn't provide enough evidence of which category I put Mike's test into :-) So to kick things off, I added an 'interactive' knob to CFQ and defaulted it to on, along with re-enabling slice idling for hardware that does tagged command queuing. This is almost completely identical to what Vivek Goyal originally posted, it's just combined into one and uses the term 'interactive' instead of 'fairness'. I think the former is a better umbrella under which to add further tweaks that may sacrifice throughput slightly, in the quest for better latency. It's queued up in the for-linus branch. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 17:56 ` Ingo Molnar 2009-10-02 18:13 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:56 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > good example of that is actually the idling that we already do. > > > > > Say you have two applications, each starting up. If you start them > > > > > both at the same time and just care for the dumb low latency, then > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > both started, while with the slice idling and priority disk access > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > it's often for complete operation which may involve lots of IOs. > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > issue. And that's where it becomes complex and not so black and > > > > > white. Mike's test is a really good example of that. > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > [snip] > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > a valid test. It's not measuring single IO latencies, it's doing a > > > sequence of valid events and looking at the latency for those. It's > > > benchmarking the bigger picture, not a microbenchmark. > > > > Good, so we are in violent agreement :-) > > Yes, perhaps that last sentence didn't provide enough evidence of > which category I put Mike's test into :-) > > So to kick things off, I added an 'interactive' knob to CFQ and > defaulted it to on, along with re-enabling slice idling for hardware > that does tagged command queuing. This is almost completely identical > to what Vivek Goyal originally posted, it's just combined into one and > uses the term 'interactive' instead of 'fairness'. I think the former > is a better umbrella under which to add further tweaks that may > sacrifice throughput slightly, in the quest for better latency. > > It's queued up in the for-linus branch. i'd say 'latency' describes it even better. 'interactivity' as a term is a bit overladen. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 17:56 ` Ingo Molnar @ 2009-10-02 18:13 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:13 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > good example of that is actually the idling that we already do. > > > > > Say you have two applications, each starting up. If you start them > > > > > both at the same time and just care for the dumb low latency, then > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > both started, while with the slice idling and priority disk access > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > it's often for complete operation which may involve lots of IOs. > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > issue. And that's where it becomes complex and not so black and > > > > > white. Mike's test is a really good example of that. > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > [snip] > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > a valid test. It's not measuring single IO latencies, it's doing a > > > sequence of valid events and looking at the latency for those. It's > > > benchmarking the bigger picture, not a microbenchmark. > > > > Good, so we are in violent agreement :-) > > Yes, perhaps that last sentence didn't provide enough evidence of which > category I put Mike's test into :-) > > So to kick things off, I added an 'interactive' knob to CFQ and > defaulted it to on, along with re-enabling slice idling for hardware > that does tagged command queuing. This is almost completely identical to > what Vivek Goyal originally posted, it's just combined into one and uses > the term 'interactive' instead of 'fairness'. I think the former is a > better umbrella under which to add further tweaks that may sacrifice > throughput slightly, in the quest for better latency. > > It's queued up in the for-linus branch. FWIW, I did a matrix of Vivek's patch combined with my hack. Seems we do lose a bit of dd throughput over stock with either or both. dd pre 65.1 65.4 67.5 64.8 65.1 65.5 fairness=1 overload_delay=1 perf stat 1.70 1.94 1.32 1.89 1.87 1.7 dd post 69.4 62.3 69.7 70.3 69.6 68.2 dd pre 67.0 67.8 64.7 64.7 64.9 65.8 fairness=1 overload_delay=0 perf stat 4.89 3.13 2.98 2.71 2.17 3.1 dd post 67.2 63.3 62.6 62.8 63.1 63.8 dd pre 65.0 66.0 66.9 64.6 67.0 65.9 fairness=0 overload_delay=1 perf stat 4.66 3.81 4.23 2.98 4.23 3.9 dd post 62.0 60.8 62.4 61.4 62.2 61.7 dd pre 65.3 65.6 64.9 69.5 65.8 66.2 fairness=0 overload_delay=0 perf stat 14.79 9.11 14.16 8.44 13.67 12.0 dd post 64.1 66.5 64.0 66.5 64.4 65.1 ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 17:37 ` Jens Axboe @ 2009-10-02 17:56 ` Ingo Molnar 2009-10-02 17:56 ` Ingo Molnar 2009-10-02 18:13 ` Mike Galbraith 2 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:56 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel * Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > good example of that is actually the idling that we already do. > > > > > Say you have two applications, each starting up. If you start them > > > > > both at the same time and just care for the dumb low latency, then > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > both started, while with the slice idling and priority disk access > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > it's often for complete operation which may involve lots of IOs. > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > issue. And that's where it becomes complex and not so black and > > > > > white. Mike's test is a really good example of that. > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > [snip] > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > a valid test. It's not measuring single IO latencies, it's doing a > > > sequence of valid events and looking at the latency for those. It's > > > benchmarking the bigger picture, not a microbenchmark. > > > > Good, so we are in violent agreement :-) > > Yes, perhaps that last sentence didn't provide enough evidence of > which category I put Mike's test into :-) > > So to kick things off, I added an 'interactive' knob to CFQ and > defaulted it to on, along with re-enabling slice idling for hardware > that does tagged command queuing. This is almost completely identical > to what Vivek Goyal originally posted, it's just combined into one and > uses the term 'interactive' instead of 'fairness'. I think the former > is a better umbrella under which to add further tweaks that may > sacrifice throughput slightly, in the quest for better latency. > > It's queued up in the for-linus branch. i'd say 'latency' describes it even better. 'interactivity' as a term is a bit overladen. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 17:56 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:56 UTC (permalink / raw) To: Jens Axboe Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds * Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > good example of that is actually the idling that we already do. > > > > > Say you have two applications, each starting up. If you start them > > > > > both at the same time and just care for the dumb low latency, then > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > both started, while with the slice idling and priority disk access > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > it's often for complete operation which may involve lots of IOs. > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > issue. And that's where it becomes complex and not so black and > > > > > white. Mike's test is a really good example of that. > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > [snip] > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > a valid test. It's not measuring single IO latencies, it's doing a > > > sequence of valid events and looking at the latency for those. It's > > > benchmarking the bigger picture, not a microbenchmark. > > > > Good, so we are in violent agreement :-) > > Yes, perhaps that last sentence didn't provide enough evidence of > which category I put Mike's test into :-) > > So to kick things off, I added an 'interactive' knob to CFQ and > defaulted it to on, along with re-enabling slice idling for hardware > that does tagged command queuing. This is almost completely identical > to what Vivek Goyal originally posted, it's just combined into one and > uses the term 'interactive' instead of 'fairness'. I think the former > is a better umbrella under which to add further tweaks that may > sacrifice throughput slightly, in the quest for better latency. > > It's queued up in the for-linus branch. i'd say 'latency' describes it even better. 'interactivity' as a term is a bit overladen. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002175629.GA14860-X9Un+BFzKDI@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002175629.GA14860-X9Un+BFzKDI@public.gmane.org> @ 2009-10-02 18:04 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:04 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > > good example of that is actually the idling that we already do. > > > > > > Say you have two applications, each starting up. If you start them > > > > > > both at the same time and just care for the dumb low latency, then > > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > > both started, while with the slice idling and priority disk access > > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > > it's often for complete operation which may involve lots of IOs. > > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > > issue. And that's where it becomes complex and not so black and > > > > > > white. Mike's test is a really good example of that. > > > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > > > [snip] > > > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > > a valid test. It's not measuring single IO latencies, it's doing a > > > > sequence of valid events and looking at the latency for those. It's > > > > benchmarking the bigger picture, not a microbenchmark. > > > > > > Good, so we are in violent agreement :-) > > > > Yes, perhaps that last sentence didn't provide enough evidence of > > which category I put Mike's test into :-) > > > > So to kick things off, I added an 'interactive' knob to CFQ and > > defaulted it to on, along with re-enabling slice idling for hardware > > that does tagged command queuing. This is almost completely identical > > to what Vivek Goyal originally posted, it's just combined into one and > > uses the term 'interactive' instead of 'fairness'. I think the former > > is a better umbrella under which to add further tweaks that may > > sacrifice throughput slightly, in the quest for better latency. > > > > It's queued up in the for-linus branch. > > i'd say 'latency' describes it even better. 'interactivity' as a term is > a bit overladen. I'm not too crazy about it either. How about just using 'desktop' since this is obviously what we are really targetting? 'latency' isn't fully descriptive either, since it may not necessarily provide the best single IO latency (noop would). -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 17:56 ` Ingo Molnar (?) (?) @ 2009-10-02 18:04 ` Jens Axboe [not found] ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> ` (2 more replies) -1 siblings, 3 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:04 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > > good example of that is actually the idling that we already do. > > > > > > Say you have two applications, each starting up. If you start them > > > > > > both at the same time and just care for the dumb low latency, then > > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > > both started, while with the slice idling and priority disk access > > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > > it's often for complete operation which may involve lots of IOs. > > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > > issue. And that's where it becomes complex and not so black and > > > > > > white. Mike's test is a really good example of that. > > > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > > > [snip] > > > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > > a valid test. It's not measuring single IO latencies, it's doing a > > > > sequence of valid events and looking at the latency for those. It's > > > > benchmarking the bigger picture, not a microbenchmark. > > > > > > Good, so we are in violent agreement :-) > > > > Yes, perhaps that last sentence didn't provide enough evidence of > > which category I put Mike's test into :-) > > > > So to kick things off, I added an 'interactive' knob to CFQ and > > defaulted it to on, along with re-enabling slice idling for hardware > > that does tagged command queuing. This is almost completely identical > > to what Vivek Goyal originally posted, it's just combined into one and > > uses the term 'interactive' instead of 'fairness'. I think the former > > is a better umbrella under which to add further tweaks that may > > sacrifice throughput slightly, in the quest for better latency. > > > > It's queued up in the for-linus branch. > > i'd say 'latency' describes it even better. 'interactivity' as a term is > a bit overladen. I'm not too crazy about it either. How about just using 'desktop' since this is obviously what we are really targetting? 'latency' isn't fully descriptive either, since it may not necessarily provide the best single IO latency (noop would). -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 18:22 ` Mike Galbraith 2009-10-02 18:36 ` Theodore Tso 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:22 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote: > I'm not too crazy about it either. How about just using 'desktop' since > this is obviously what we are really targetting? 'latency' isn't fully > descriptive either, since it may not necessarily provide the best single > IO latency (noop would). Grin. "Perfect is the enemy of good" :) Avg 16.24 175.82 154.38 228.97 147.16 144.5 noop 43.23 57.39 96.13 148.25 180.09 105.0 deadline ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 18:22 ` Mike Galbraith @ 2009-10-02 18:36 ` Theodore Tso 1 sibling, 0 replies; 349+ messages in thread From: Theodore Tso @ 2009-10-02 18:36 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > a bit overladen. > > I'm not too crazy about it either. How about just using 'desktop' since > this is obviously what we are really targetting? 'latency' isn't fully > descriptive either, since it may not necessarily provide the best single > IO latency (noop would). As Linus has already pointed out, it's not necessarily "desktop" versus "server". There will be certain high frequency transaction database workloads (for example) that will very much care about latency. I think "low_latency" may be the best term to use. - Ted ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:04 ` Jens Axboe [not found] ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 18:22 ` Mike Galbraith 2009-10-02 18:26 ` Jens Axboe [not found] ` <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 18:36 ` Theodore Tso 2 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:22 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote: > I'm not too crazy about it either. How about just using 'desktop' since > this is obviously what we are really targetting? 'latency' isn't fully > descriptive either, since it may not necessarily provide the best single > IO latency (noop would). Grin. "Perfect is the enemy of good" :) Avg 16.24 175.82 154.38 228.97 147.16 144.5 noop 43.23 57.39 96.13 148.25 180.09 105.0 deadline ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:22 ` Mike Galbraith @ 2009-10-02 18:26 ` Jens Axboe 2009-10-02 18:33 ` Mike Galbraith [not found] ` <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> [not found] ` <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 2 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:26 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote: > > > I'm not too crazy about it either. How about just using 'desktop' since > > this is obviously what we are really targetting? 'latency' isn't fully > > descriptive either, since it may not necessarily provide the best single > > IO latency (noop would). > > Grin. "Perfect is the enemy of good" :) > Avg > 16.24 175.82 154.38 228.97 147.16 144.5 noop > 43.23 57.39 96.13 148.25 180.09 105.0 deadline Yep, that's where it falls down. Noop basically fails here because it treats all IO as equal, which obviously isn't true for most people. But even for pure read workloads (is the above the mixed read/write, or just read?), latency would be excellent with noop but the desktop experience would not. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:26 ` Jens Axboe @ 2009-10-02 18:33 ` Mike Galbraith [not found] ` <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:33 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2009-10-02 at 20:26 +0200, Jens Axboe wrote: > On Fri, Oct 02 2009, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote: > > > > > I'm not too crazy about it either. How about just using 'desktop' since > > > this is obviously what we are really targetting? 'latency' isn't fully > > > descriptive either, since it may not necessarily provide the best single > > > IO latency (noop would). > > > > Grin. "Perfect is the enemy of good" :) > > Avg > > 16.24 175.82 154.38 228.97 147.16 144.5 noop > > 43.23 57.39 96.13 148.25 180.09 105.0 deadline > > Yep, that's where it falls down. Noop basically fails here because it > treats all IO as equal, which obviously isn't true for most people. But > even for pure read workloads (is the above the mixed read/write, or just > read?), latency would be excellent with noop but the desktop experience > would not. Yeah, it's the dd vs konsole -e exit. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 18:33 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:33 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, 2009-10-02 at 20:26 +0200, Jens Axboe wrote: > On Fri, Oct 02 2009, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote: > > > > > I'm not too crazy about it either. How about just using 'desktop' since > > > this is obviously what we are really targetting? 'latency' isn't fully > > > descriptive either, since it may not necessarily provide the best single > > > IO latency (noop would). > > > > Grin. "Perfect is the enemy of good" :) > > Avg > > 16.24 175.82 154.38 228.97 147.16 144.5 noop > > 43.23 57.39 96.13 148.25 180.09 105.0 deadline > > Yep, that's where it falls down. Noop basically fails here because it > treats all IO as equal, which obviously isn't true for most people. But > even for pure read workloads (is the above the mixed read/write, or just > read?), latency would be excellent with noop but the desktop experience > would not. Yeah, it's the dd vs konsole -e exit. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 18:26 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:26 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 20:04 +0200, Jens Axboe wrote: > > > I'm not too crazy about it either. How about just using 'desktop' since > > this is obviously what we are really targetting? 'latency' isn't fully > > descriptive either, since it may not necessarily provide the best single > > IO latency (noop would). > > Grin. "Perfect is the enemy of good" :) > Avg > 16.24 175.82 154.38 228.97 147.16 144.5 noop > 43.23 57.39 96.13 148.25 180.09 105.0 deadline Yep, that's where it falls down. Noop basically fails here because it treats all IO as equal, which obviously isn't true for most people. But even for pure read workloads (is the above the mixed read/write, or just read?), latency would be excellent with noop but the desktop experience would not. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:04 ` Jens Axboe [not found] ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 18:22 ` Mike Galbraith @ 2009-10-02 18:36 ` Theodore Tso 2009-10-02 18:45 ` Jens Axboe [not found] ` <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org> 2 siblings, 2 replies; 349+ messages in thread From: Theodore Tso @ 2009-10-02 18:36 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > a bit overladen. > > I'm not too crazy about it either. How about just using 'desktop' since > this is obviously what we are really targetting? 'latency' isn't fully > descriptive either, since it may not necessarily provide the best single > IO latency (noop would). As Linus has already pointed out, it's not necessarily "desktop" versus "server". There will be certain high frequency transaction database workloads (for example) that will very much care about latency. I think "low_latency" may be the best term to use. - Ted ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:36 ` Theodore Tso @ 2009-10-02 18:45 ` Jens Axboe [not found] ` <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:45 UTC (permalink / raw) To: Theodore Tso Cc: Ingo Molnar, Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Theodore Tso wrote: > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > > a bit overladen. > > > > I'm not too crazy about it either. How about just using 'desktop' since > > this is obviously what we are really targetting? 'latency' isn't fully > > descriptive either, since it may not necessarily provide the best single > > IO latency (noop would). > > As Linus has already pointed out, it's not necessarily "desktop" > versus "server". There will be certain high frequency transaction > database workloads (for example) that will very much care about > latency. I think "low_latency" may be the best term to use. Not necessarily, but typically it will be. As already noted, I don't think latency itself is a very descriptive term for this. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 18:45 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:45 UTC (permalink / raw) To: Theodore Tso Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds On Fri, Oct 02 2009, Theodore Tso wrote: > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > > a bit overladen. > > > > I'm not too crazy about it either. How about just using 'desktop' since > > this is obviously what we are really targetting? 'latency' isn't fully > > descriptive either, since it may not necessarily provide the best single > > IO latency (noop would). > > As Linus has already pointed out, it's not necessarily "desktop" > versus "server". There will be certain high frequency transaction > database workloads (for example) that will very much care about > latency. I think "low_latency" may be the best term to use. Not necessarily, but typically it will be. As already noted, I don't think latency itself is a very descriptive term for this. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:45 ` Jens Axboe (?) @ 2009-10-02 19:01 ` Ingo Molnar 2009-10-02 19:09 ` Jens Axboe [not found] ` <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 19:01 UTC (permalink / raw) To: Jens Axboe Cc: Theodore Tso, Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel * Jens Axboe <jens.axboe@oracle.com> wrote: > On Fri, Oct 02 2009, Theodore Tso wrote: > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > > > a bit overladen. > > > > > > I'm not too crazy about it either. How about just using 'desktop' > > > since this is obviously what we are really targetting? 'latency' > > > isn't fully descriptive either, since it may not necessarily > > > provide the best single IO latency (noop would). > > > > As Linus has already pointed out, it's not necessarily "desktop" > > versus "server". There will be certain high frequency transaction > > database workloads (for example) that will very much care about > > latency. I think "low_latency" may be the best term to use. > > Not necessarily, but typically it will be. As already noted, I don't > think latency itself is a very descriptive term for this. Why not? Nobody will think of 'latency' as something that requires noop, but as something that in practice achieves low latencies, for stuff that people use. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 19:01 ` Ingo Molnar @ 2009-10-02 19:09 ` Jens Axboe [not found] ` <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 19:09 UTC (permalink / raw) To: Ingo Molnar Cc: Theodore Tso, Linus Torvalds, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Fri, Oct 02 2009, Theodore Tso wrote: > > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > > > > a bit overladen. > > > > > > > > I'm not too crazy about it either. How about just using 'desktop' > > > > since this is obviously what we are really targetting? 'latency' > > > > isn't fully descriptive either, since it may not necessarily > > > > provide the best single IO latency (noop would). > > > > > > As Linus has already pointed out, it's not necessarily "desktop" > > > versus "server". There will be certain high frequency transaction > > > database workloads (for example) that will very much care about > > > latency. I think "low_latency" may be the best term to use. > > > > Not necessarily, but typically it will be. As already noted, I don't > > think latency itself is a very descriptive term for this. > > Why not? Nobody will think of 'latency' as something that requires noop, > but as something that in practice achieves low latencies, for stuff that > people use. Alright, I'll acknowledge that if that's the general consensus. I may be somewhat biased myself. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 19:09 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 19:09 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, Theodore Tso, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > On Fri, Oct 02 2009, Theodore Tso wrote: > > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > > > > a bit overladen. > > > > > > > > I'm not too crazy about it either. How about just using 'desktop' > > > > since this is obviously what we are really targetting? 'latency' > > > > isn't fully descriptive either, since it may not necessarily > > > > provide the best single IO latency (noop would). > > > > > > As Linus has already pointed out, it's not necessarily "desktop" > > > versus "server". There will be certain high frequency transaction > > > database workloads (for example) that will very much care about > > > latency. I think "low_latency" may be the best term to use. > > > > Not necessarily, but typically it will be. As already noted, I don't > > think latency itself is a very descriptive term for this. > > Why not? Nobody will think of 'latency' as something that requires noop, > but as something that in practice achieves low latencies, for stuff that > people use. Alright, I'll acknowledge that if that's the general consensus. I may be somewhat biased myself. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org> @ 2009-10-02 19:09 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 19:09 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Theodore Tso, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > On Fri, Oct 02 2009, Theodore Tso wrote: > > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > > > > a bit overladen. > > > > > > > > I'm not too crazy about it either. How about just using 'desktop' > > > > since this is obviously what we are really targetting? 'latency' > > > > isn't fully descriptive either, since it may not necessarily > > > > provide the best single IO latency (noop would). > > > > > > As Linus has already pointed out, it's not necessarily "desktop" > > > versus "server". There will be certain high frequency transaction > > > database workloads (for example) that will very much care about > > > latency. I think "low_latency" may be the best term to use. > > > > Not necessarily, but typically it will be. As already noted, I don't > > think latency itself is a very descriptive term for this. > > Why not? Nobody will think of 'latency' as something that requires noop, > but as something that in practice achieves low latencies, for stuff that > people use. Alright, I'll acknowledge that if that's the general consensus. I may be somewhat biased myself. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002184549.GS31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002184549.GS31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 19:01 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 19:01 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, Theodore Tso, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > On Fri, Oct 02 2009, Theodore Tso wrote: > > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > > > a bit overladen. > > > > > > I'm not too crazy about it either. How about just using 'desktop' > > > since this is obviously what we are really targetting? 'latency' > > > isn't fully descriptive either, since it may not necessarily > > > provide the best single IO latency (noop would). > > > > As Linus has already pointed out, it's not necessarily "desktop" > > versus "server". There will be certain high frequency transaction > > database workloads (for example) that will very much care about > > latency. I think "low_latency" may be the best term to use. > > Not necessarily, but typically it will be. As already noted, I don't > think latency itself is a very descriptive term for this. Why not? Nobody will think of 'latency' as something that requires noop, but as something that in practice achieves low latencies, for stuff that people use. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org> @ 2009-10-02 18:45 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:45 UTC (permalink / raw) To: Theodore Tso Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Theodore Tso wrote: > On Fri, Oct 02, 2009 at 08:04:37PM +0200, Jens Axboe wrote: > > > i'd say 'latency' describes it even better. 'interactivity' as a term is > > > a bit overladen. > > > > I'm not too crazy about it either. How about just using 'desktop' since > > this is obviously what we are really targetting? 'latency' isn't fully > > descriptive either, since it may not necessarily provide the best single > > IO latency (noop would). > > As Linus has already pointed out, it's not necessarily "desktop" > versus "server". There will be certain high frequency transaction > database workloads (for example) that will very much care about > latency. I think "low_latency" may be the best term to use. Not necessarily, but typically it will be. As already noted, I don't think latency itself is a very descriptive term for this. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 17:37 ` Jens Axboe [not found] ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 17:56 ` Ingo Molnar @ 2009-10-02 18:13 ` Mike Galbraith 2009-10-02 18:19 ` Jens Axboe [not found] ` <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:13 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > good example of that is actually the idling that we already do. > > > > > Say you have two applications, each starting up. If you start them > > > > > both at the same time and just care for the dumb low latency, then > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > both started, while with the slice idling and priority disk access > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > it's often for complete operation which may involve lots of IOs. > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > issue. And that's where it becomes complex and not so black and > > > > > white. Mike's test is a really good example of that. > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > [snip] > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > a valid test. It's not measuring single IO latencies, it's doing a > > > sequence of valid events and looking at the latency for those. It's > > > benchmarking the bigger picture, not a microbenchmark. > > > > Good, so we are in violent agreement :-) > > Yes, perhaps that last sentence didn't provide enough evidence of which > category I put Mike's test into :-) > > So to kick things off, I added an 'interactive' knob to CFQ and > defaulted it to on, along with re-enabling slice idling for hardware > that does tagged command queuing. This is almost completely identical to > what Vivek Goyal originally posted, it's just combined into one and uses > the term 'interactive' instead of 'fairness'. I think the former is a > better umbrella under which to add further tweaks that may sacrifice > throughput slightly, in the quest for better latency. > > It's queued up in the for-linus branch. FWIW, I did a matrix of Vivek's patch combined with my hack. Seems we do lose a bit of dd throughput over stock with either or both. dd pre 65.1 65.4 67.5 64.8 65.1 65.5 fairness=1 overload_delay=1 perf stat 1.70 1.94 1.32 1.89 1.87 1.7 dd post 69.4 62.3 69.7 70.3 69.6 68.2 dd pre 67.0 67.8 64.7 64.7 64.9 65.8 fairness=1 overload_delay=0 perf stat 4.89 3.13 2.98 2.71 2.17 3.1 dd post 67.2 63.3 62.6 62.8 63.1 63.8 dd pre 65.0 66.0 66.9 64.6 67.0 65.9 fairness=0 overload_delay=1 perf stat 4.66 3.81 4.23 2.98 4.23 3.9 dd post 62.0 60.8 62.4 61.4 62.2 61.7 dd pre 65.3 65.6 64.9 69.5 65.8 66.2 fairness=0 overload_delay=0 perf stat 14.79 9.11 14.16 8.44 13.67 12.0 dd post 64.1 66.5 64.0 66.5 64.4 65.1 ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:13 ` Mike Galbraith @ 2009-10-02 18:19 ` Jens Axboe 2009-10-02 18:57 ` Mike Galbraith ` (2 more replies) [not found] ` <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 3 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:19 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote: > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > > good example of that is actually the idling that we already do. > > > > > > Say you have two applications, each starting up. If you start them > > > > > > both at the same time and just care for the dumb low latency, then > > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > > both started, while with the slice idling and priority disk access > > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > > it's often for complete operation which may involve lots of IOs. > > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > > issue. And that's where it becomes complex and not so black and > > > > > > white. Mike's test is a really good example of that. > > > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > > > [snip] > > > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > > a valid test. It's not measuring single IO latencies, it's doing a > > > > sequence of valid events and looking at the latency for those. It's > > > > benchmarking the bigger picture, not a microbenchmark. > > > > > > Good, so we are in violent agreement :-) > > > > Yes, perhaps that last sentence didn't provide enough evidence of which > > category I put Mike's test into :-) > > > > So to kick things off, I added an 'interactive' knob to CFQ and > > defaulted it to on, along with re-enabling slice idling for hardware > > that does tagged command queuing. This is almost completely identical to > > what Vivek Goyal originally posted, it's just combined into one and uses > > the term 'interactive' instead of 'fairness'. I think the former is a > > better umbrella under which to add further tweaks that may sacrifice > > throughput slightly, in the quest for better latency. > > > > It's queued up in the for-linus branch. > > FWIW, I did a matrix of Vivek's patch combined with my hack. Seems we > do lose a bit of dd throughput over stock with either or both. > > dd pre 65.1 65.4 67.5 64.8 65.1 65.5 fairness=1 overload_delay=1 > perf stat 1.70 1.94 1.32 1.89 1.87 1.7 > dd post 69.4 62.3 69.7 70.3 69.6 68.2 > > dd pre 67.0 67.8 64.7 64.7 64.9 65.8 fairness=1 overload_delay=0 > perf stat 4.89 3.13 2.98 2.71 2.17 3.1 > dd post 67.2 63.3 62.6 62.8 63.1 63.8 > > dd pre 65.0 66.0 66.9 64.6 67.0 65.9 fairness=0 overload_delay=1 > perf stat 4.66 3.81 4.23 2.98 4.23 3.9 > dd post 62.0 60.8 62.4 61.4 62.2 61.7 > > dd pre 65.3 65.6 64.9 69.5 65.8 66.2 fairness=0 overload_delay=0 > perf stat 14.79 9.11 14.16 8.44 13.67 12.0 > dd post 64.1 66.5 64.0 66.5 64.4 65.1 I'm not too worried about the "single IO producer" scenarios, and it looks like (from a quick look) that most of your numbers are within some expected noise levels. It's the more complex mixes that are likely to cause a bit of a stink, but lets worry about that later. One quick thing would be to read eg 2 or more files sequentially from disk and see how that performs. If you could do a cleaned up version of your overload patch based on this: http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 then lets take it from there. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:19 ` Jens Axboe @ 2009-10-02 18:57 ` Mike Galbraith 2009-10-02 20:47 ` Mike Galbraith [not found] ` <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 5:48 ` Mike Galbraith [not found] ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:57 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > I'm not too worried about the "single IO producer" scenarios, and it > looks like (from a quick look) that most of your numbers are within some > expected noise levels. It's the more complex mixes that are likely to > cause a bit of a stink, but lets worry about that later. One quick thing > would be to read eg 2 or more files sequentially from disk and see how > that performs. Hm. git(s) should be good for a nice repeatable load. Suggestions? > If you could do a cleaned up version of your overload patch based on > this: > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > then lets take it from there. I'll try to find a good repeatable git beater first. At this point, I only know it helps with one load. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:57 ` Mike Galbraith @ 2009-10-02 20:47 ` Mike Galbraith [not found] ` <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 20:47 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2009-10-02 at 20:57 +0200, Mike Galbraith wrote: > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > I'm not too worried about the "single IO producer" scenarios, and it > > looks like (from a quick look) that most of your numbers are within some > > expected noise levels. It's the more complex mixes that are likely to > > cause a bit of a stink, but lets worry about that later. One quick thing > > would be to read eg 2 or more files sequentially from disk and see how > > that performs. > > Hm. git(s) should be good for a nice repeatable load. Suggestions? > > > If you could do a cleaned up version of your overload patch based on > > this: > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > then lets take it from there. > > I'll try to find a good repeatable git beater first. At this point, I > only know it helps with one load. Seems to help mixed concurrent read/write a bit too. perf stat testo.sh Avg 108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0 93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1 90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0 89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1 #!/bin/sh LOGFILE=testo.log rm -f $LOGFILE echo 3 > /proc/sys/vm/drop_caches sh -c "(cd linux-2.6.23; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.23.tar)" 2>&1|tee -a $LOGFILE & sh -c "(cd linux-2.6.24; perf stat -- git archive --format=tar HEAD > ../linux-2.6.24.tar; git checkout -f)" 2>&1|tee -a $LOGFILE & sh -c "(cd linux-2.6.25; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.25.tar)" 2>&1|tee -a $LOGFILE & sh -c "(cd linux-2.6.26; perf stat -- git archive --format=tar HEAD > ../linux-2.6.26.tar; git checkout -f)" 2>&1|tee -a $LOGFILE & wait ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 20:47 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 20:47 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, 2009-10-02 at 20:57 +0200, Mike Galbraith wrote: > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > I'm not too worried about the "single IO producer" scenarios, and it > > looks like (from a quick look) that most of your numbers are within some > > expected noise levels. It's the more complex mixes that are likely to > > cause a bit of a stink, but lets worry about that later. One quick thing > > would be to read eg 2 or more files sequentially from disk and see how > > that performs. > > Hm. git(s) should be good for a nice repeatable load. Suggestions? > > > If you could do a cleaned up version of your overload patch based on > > this: > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > then lets take it from there. > > I'll try to find a good repeatable git beater first. At this point, I > only know it helps with one load. Seems to help mixed concurrent read/write a bit too. perf stat testo.sh Avg 108.12 106.33 106.34 97.00 106.52 104.8 1.000 fairness=0 overload_delay=0 93.98 102.44 94.47 97.70 98.90 97.4 .929 fairness=0 overload_delay=1 90.87 95.40 95.79 93.09 94.25 93.8 .895 fairness=1 overload_delay=0 89.93 90.57 89.13 93.43 93.72 91.3 .871 fairness=1 overload_delay=1 #!/bin/sh LOGFILE=testo.log rm -f $LOGFILE echo 3 > /proc/sys/vm/drop_caches sh -c "(cd linux-2.6.23; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.23.tar)" 2>&1|tee -a $LOGFILE & sh -c "(cd linux-2.6.24; perf stat -- git archive --format=tar HEAD > ../linux-2.6.24.tar; git checkout -f)" 2>&1|tee -a $LOGFILE & sh -c "(cd linux-2.6.25; perf stat -- git checkout -f; git archive --format=tar HEAD > ../linux-2.6.25.tar)" 2>&1|tee -a $LOGFILE & sh -c "(cd linux-2.6.26; perf stat -- git archive --format=tar HEAD > ../linux-2.6.26.tar; git checkout -f)" 2>&1|tee -a $LOGFILE & wait ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:19 ` Jens Axboe 2009-10-02 18:57 ` Mike Galbraith @ 2009-10-03 5:48 ` Mike Galbraith 2009-10-03 5:56 ` Mike Galbraith ` (2 more replies) [not found] ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2 siblings, 3 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-03 5:48 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > If you could do a cleaned up version of your overload patch based on > this: > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > then lets take it from there. If take it from there ends up meaning apply, and see who squeaks, feel free to delete the "Not", and my somewhat defective sense of humor. Block: Delay overloading of CFQ queues to improve read latency. Introduce a delay maximum dispatch timestamp, and stamp it when: 1. we encounter a known seeky or possibly new sync IO queue. 2. the current queue may go idle and we're draining async IO. 3. we have sync IO in flight and are servicing an async queue. 4 we are not the sole user of disk. Disallow exceeding quantum if any of these events have occurred recently. Protect this behavioral change with a "desktop_dispatch" knob and default it to "on".. providing an easy means of regression verification prior to hate-mail dispatch :) to CC list. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> ... others who let somewhat hacky tweak slip by LKML-Reference: <new-submission> --- block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 41 insertions(+), 4 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -174,6 +174,9 @@ struct cfq_data { unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; unsigned int cfq_desktop; + unsigned int cfq_desktop_dispatch; + + unsigned long desktop_dispatch_ts; struct list_head cic_list; @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct struct cfq_data *cfqd = q->elevator->elevator_data; struct cfq_queue *cfqq; unsigned int max_dispatch; + unsigned long delay; if (!cfqd->busy_queues) return 0; @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct /* * Drain async requests before we start sync IO */ - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { + cfqd->desktop_dispatch_ts = jiffies; return 0; + } /* * If this is an async queue and we have sync IO in flight, let it wait */ - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { + cfqd->desktop_dispatch_ts = jiffies; return 0; + } max_dispatch = cfqd->cfq_quantum; if (cfq_class_idle(cfqq)) max_dispatch = 1; + if (cfqd->busy_queues > 1) + cfqd->desktop_dispatch_ts = jiffies; + /* * Does this cfqq already have too much IO in flight? */ @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct return 0; /* + * Don't start overloading until we've been alone for a bit. + */ + if (cfqd->cfq_desktop_dispatch) { + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync; + + if (time_before(jiffies, max_delay)) + return 0; + } + + /* * we are the only queue, allow up to 4 times of 'quantum' */ if (cfqq->dispatched >= 4 * max_dispatch) @@ -1942,7 +1963,7 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int old_idle, enable_idle; + int old_idle, enable_idle, seeky = 0; /* * Don't idle for async or idle io prio class @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data * if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) return; + if (cfqd->hw_tag) { + if (CIC_SEEKY(cic)) + seeky = 1; + /* + * If seeky or incalculable seekiness, delay overloading. + */ + if (seeky || !sample_valid(cic->seek_samples)) + cfqd->desktop_dispatch_ts = jiffies; + } + enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic))) + (!cfqd->cfq_desktop && seeky)) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->cfq_desktop = 1; + cfqd->cfq_desktop_dispatch = 1; + + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES; cfqd->hw_tag = 1; return cfqd; @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd- SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0); +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0); +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0); #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] = CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), CFQ_ATTR(desktop), + CFQ_ATTR(desktop_dispatch), __ATTR_NULL }; ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 5:48 ` Mike Galbraith @ 2009-10-03 5:56 ` Mike Galbraith 2009-10-03 7:24 ` Jens Axboe ` (2 more replies) [not found] ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 7:20 ` Ingo Molnar 2 siblings, 3 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-03 5:56 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote: > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > If you could do a cleaned up version of your overload patch based on > > this: > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > then lets take it from there. Note to self: build the darn thing after last minute changes. Block: Delay overloading of CFQ queues to improve read latency. Introduce a delay maximum dispatch timestamp, and stamp it when: 1. we encounter a known seeky or possibly new sync IO queue. 2. the current queue may go idle and we're draining async IO. 3. we have sync IO in flight and are servicing an async queue. 4 we are not the sole user of disk. Disallow exceeding quantum if any of these events have occurred recently. Protect this behavioral change with a "desktop_dispatch" knob and default it to "on".. providing an easy means of regression verification prior to hate-mail dispatch :) to CC list. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> ... others who let somewhat hacky tweak slip by --- block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 41 insertions(+), 4 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -174,6 +174,9 @@ struct cfq_data { unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; unsigned int cfq_desktop; + unsigned int cfq_desktop_dispatch; + + unsigned long desktop_dispatch_ts; struct list_head cic_list; @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct struct cfq_data *cfqd = q->elevator->elevator_data; struct cfq_queue *cfqq; unsigned int max_dispatch; + unsigned long delay; if (!cfqd->busy_queues) return 0; @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct /* * Drain async requests before we start sync IO */ - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { + cfqd->desktop_dispatch_ts = jiffies; return 0; + } /* * If this is an async queue and we have sync IO in flight, let it wait */ - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { + cfqd->desktop_dispatch_ts = jiffies; return 0; + } max_dispatch = cfqd->cfq_quantum; if (cfq_class_idle(cfqq)) max_dispatch = 1; + if (cfqd->busy_queues > 1) + cfqd->desktop_dispatch_ts = jiffies; + /* * Does this cfqq already have too much IO in flight? */ @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct return 0; /* + * Don't start overloading until we've been alone for a bit. + */ + if (cfqd->cfq_desktop_dispatch) { + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync; + + if (time_before(jiffies, max_delay)) + return 0; + } + + /* * we are the only queue, allow up to 4 times of 'quantum' */ if (cfqq->dispatched >= 4 * max_dispatch) @@ -1942,7 +1963,7 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int old_idle, enable_idle; + int old_idle, enable_idle, seeky = 0; /* * Don't idle for async or idle io prio class @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data * if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) return; + if (cfqd->hw_tag) { + if (CIC_SEEKY(cic)) + seeky = 1; + /* + * If seeky or incalculable seekiness, delay overloading. + */ + if (seeky || !sample_valid(cic->seek_samples)) + cfqd->desktop_dispatch_ts = jiffies; + } + enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic))) + (!cfqd->cfq_desktop && seeky)) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->cfq_desktop = 1; + cfqd->cfq_desktop_dispatch = 1; + + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES; cfqd->hw_tag = 1; return cfqd; @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd- SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0); +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0); +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0); #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] = CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), CFQ_ATTR(desktop), + CFQ_ATTR(desktop_dispatch), __ATTR_NULL }; ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 5:56 ` Mike Galbraith @ 2009-10-03 7:24 ` Jens Axboe 2009-10-03 9:00 ` Mike Galbraith [not found] ` <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-03 11:29 ` Vivek Goyal [not found] ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2 siblings, 2 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 7:24 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Sat, Oct 03 2009, Mike Galbraith wrote: > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > > > If you could do a cleaned up version of your overload patch based on > > > this: > > > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > > > then lets take it from there. > > Note to self: build the darn thing after last minute changes. > > Block: Delay overloading of CFQ queues to improve read latency. > > Introduce a delay maximum dispatch timestamp, and stamp it when: > 1. we encounter a known seeky or possibly new sync IO queue. > 2. the current queue may go idle and we're draining async IO. > 3. we have sync IO in flight and are servicing an async queue. > 4 we are not the sole user of disk. > Disallow exceeding quantum if any of these events have occurred recently. > > Protect this behavioral change with a "desktop_dispatch" knob and default > it to "on".. providing an easy means of regression verification prior to > hate-mail dispatch :) to CC list. It still doesn't build: block/cfq-iosched.c: In function ?cfq_dispatch_requests?: block/cfq-iosched.c:1345: error: ?max_delay? undeclared (first use in this function) After shutting down the computer yesterday, I was thinking a bit about this issue and how to solve it without incurring too much delay. If we add a stricter control of the depth, that may help. So instead of allowing up to max_quantum (or larger) depths, only allow gradual build up of that the farther we get away from a dispatch from the sync IO queues. For example, when switching to an async or seeky sync queue, initially allow just 1 in flight. For the next round, if there still hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue again, immediately drop to 1. It could tie in with (or partly replace) the overload feature. The key to good latency and decent throughput is knowing when to allow queue build up and when not to. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 7:24 ` Jens Axboe @ 2009-10-03 9:00 ` Mike Galbraith 2009-10-03 9:12 ` Corrado Zoccolo ` (2 more replies) [not found] ` <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 1 sibling, 3 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-03 9:00 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > After shutting down the computer yesterday, I was thinking a bit about > this issue and how to solve it without incurring too much delay. If we > add a stricter control of the depth, that may help. So instead of > allowing up to max_quantum (or larger) depths, only allow gradual build > up of that the farther we get away from a dispatch from the sync IO > queues. For example, when switching to an async or seeky sync queue, > initially allow just 1 in flight. For the next round, if there still > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue > again, immediately drop to 1. > > It could tie in with (or partly replace) the overload feature. The key > to good latency and decent throughput is knowing when to allow queue > build up and when not to. Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to build/unleash any sizable IO, but that's just my gut talking. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 9:00 ` Mike Galbraith @ 2009-10-03 9:12 ` Corrado Zoccolo [not found] ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 13:17 ` Jens Axboe 2 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-03 9:12 UTC (permalink / raw) To: Mike Galbraith Cc: Jens Axboe, Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel Hi, On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote: > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > >> After shutting down the computer yesterday, I was thinking a bit about >> this issue and how to solve it without incurring too much delay. If we >> add a stricter control of the depth, that may help. So instead of >> allowing up to max_quantum (or larger) depths, only allow gradual build >> up of that the farther we get away from a dispatch from the sync IO >> queues. For example, when switching to an async or seeky sync queue, >> initially allow just 1 in flight. For the next round, if there still >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue >> again, immediately drop to 1. >> I would limit just async I/O. Seeky sync queues are automatically throttled by being sync, and have already high latency, so we shouldn't increase it artificially. I think, instead, that we should send multiple seeky requests (possibly coming from different queues) at once. They will help especially with raid devices, where the seeks for requests going to different disks will happen in parallel. >> It could tie in with (or partly replace) the overload feature. The key >> to good latency and decent throughput is knowing when to allow queue >> build up and when not to. > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > build/unleash any sizable IO, but that's just my gut talking. > On the other hand, sending 1 write first and then waiting it to complete before submitting new ones, will help performing more merges, so the subsequent requests will be bigger and thus more efficient. Corrado > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-03 9:12 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-03 9:12 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, Linus Torvalds Hi, On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote: > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > >> After shutting down the computer yesterday, I was thinking a bit about >> this issue and how to solve it without incurring too much delay. If we >> add a stricter control of the depth, that may help. So instead of >> allowing up to max_quantum (or larger) depths, only allow gradual build >> up of that the farther we get away from a dispatch from the sync IO >> queues. For example, when switching to an async or seeky sync queue, >> initially allow just 1 in flight. For the next round, if there still >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue >> again, immediately drop to 1. >> I would limit just async I/O. Seeky sync queues are automatically throttled by being sync, and have already high latency, so we shouldn't increase it artificially. I think, instead, that we should send multiple seeky requests (possibly coming from different queues) at once. They will help especially with raid devices, where the seeks for requests going to different disks will happen in parallel. >> It could tie in with (or partly replace) the overload feature. The key >> to good latency and decent throughput is knowing when to allow queue >> build up and when not to. > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > build/unleash any sizable IO, but that's just my gut talking. > On the other hand, sending 1 write first and then waiting it to complete before submitting new ones, will help performing more merges, so the subsequent requests will be bigger and thus more efficient. Corrado > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0910030212y50f97d97nc2e17c35d855cc63-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4e5e476b0910030212y50f97d97nc2e17c35d855cc63-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-10-03 13:18 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 13:18 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Sat, Oct 03 2009, Corrado Zoccolo wrote: > Hi, > On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> wrote: > > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > > > >> After shutting down the computer yesterday, I was thinking a bit about > >> this issue and how to solve it without incurring too much delay. If we > >> add a stricter control of the depth, that may help. So instead of > >> allowing up to max_quantum (or larger) depths, only allow gradual build > >> up of that the farther we get away from a dispatch from the sync IO > >> queues. For example, when switching to an async or seeky sync queue, > >> initially allow just 1 in flight. For the next round, if there still > >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue > >> again, immediately drop to 1. > >> > > I would limit just async I/O. Seeky sync queues are automatically > throttled by being sync, and have already high latency, so we > shouldn't increase it artificially. I think, instead, that we should > send multiple seeky requests (possibly coming from different queues) > at once. They will help especially with raid devices, where the seeks > for requests going to different disks will happen in parallel. > Async is the prime offendor, definitely. > >> It could tie in with (or partly replace) the overload feature. The key > >> to good latency and decent throughput is knowing when to allow queue > >> build up and when not to. > > > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > > build/unleash any sizable IO, but that's just my gut talking. > > > On the other hand, sending 1 write first and then waiting it to > complete before submitting new ones, will help performing more merges, > so the subsequent requests will be bigger and thus more efficient. Usually async writes stack up very quickly, so as long as you don't drain completely, the merging will happen automagically anyway. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 9:12 ` Corrado Zoccolo @ 2009-10-03 13:18 ` Jens Axboe -1 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 13:18 UTC (permalink / raw) To: Corrado Zoccolo Cc: Mike Galbraith, Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Sat, Oct 03 2009, Corrado Zoccolo wrote: > Hi, > On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote: > > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > > > >> After shutting down the computer yesterday, I was thinking a bit about > >> this issue and how to solve it without incurring too much delay. If we > >> add a stricter control of the depth, that may help. So instead of > >> allowing up to max_quantum (or larger) depths, only allow gradual build > >> up of that the farther we get away from a dispatch from the sync IO > >> queues. For example, when switching to an async or seeky sync queue, > >> initially allow just 1 in flight. For the next round, if there still > >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue > >> again, immediately drop to 1. > >> > > I would limit just async I/O. Seeky sync queues are automatically > throttled by being sync, and have already high latency, so we > shouldn't increase it artificially. I think, instead, that we should > send multiple seeky requests (possibly coming from different queues) > at once. They will help especially with raid devices, where the seeks > for requests going to different disks will happen in parallel. > Async is the prime offendor, definitely. > >> It could tie in with (or partly replace) the overload feature. The key > >> to good latency and decent throughput is knowing when to allow queue > >> build up and when not to. > > > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > > build/unleash any sizable IO, but that's just my gut talking. > > > On the other hand, sending 1 write first and then waiting it to > complete before submitting new ones, will help performing more merges, > so the subsequent requests will be bigger and thus more efficient. Usually async writes stack up very quickly, so as long as you don't drain completely, the merging will happen automagically anyway. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-03 13:18 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 13:18 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds On Sat, Oct 03 2009, Corrado Zoccolo wrote: > Hi, > On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote: > > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > > > >> After shutting down the computer yesterday, I was thinking a bit about > >> this issue and how to solve it without incurring too much delay. If we > >> add a stricter control of the depth, that may help. So instead of > >> allowing up to max_quantum (or larger) depths, only allow gradual build > >> up of that the farther we get away from a dispatch from the sync IO > >> queues. For example, when switching to an async or seeky sync queue, > >> initially allow just 1 in flight. For the next round, if there still > >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue > >> again, immediately drop to 1. > >> > > I would limit just async I/O. Seeky sync queues are automatically > throttled by being sync, and have already high latency, so we > shouldn't increase it artificially. I think, instead, that we should > send multiple seeky requests (possibly coming from different queues) > at once. They will help especially with raid devices, where the seeks > for requests going to different disks will happen in parallel. > Async is the prime offendor, definitely. > >> It could tie in with (or partly replace) the overload feature. The key > >> to good latency and decent throughput is knowing when to allow queue > >> build up and when not to. > > > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > > build/unleash any sizable IO, but that's just my gut talking. > > > On the other hand, sending 1 write first and then waiting it to > complete before submitting new ones, will help performing more merges, > so the subsequent requests will be bigger and thus more efficient. Usually async writes stack up very quickly, so as long as you don't drain completely, the merging will happen automagically anyway. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-03 9:12 ` Corrado Zoccolo 2009-10-03 13:17 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-03 9:12 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds Hi, On Sat, Oct 3, 2009 at 11:00 AM, Mike Galbraith <efault@gmx.de> wrote: > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > >> After shutting down the computer yesterday, I was thinking a bit about >> this issue and how to solve it without incurring too much delay. If we >> add a stricter control of the depth, that may help. So instead of >> allowing up to max_quantum (or larger) depths, only allow gradual build >> up of that the farther we get away from a dispatch from the sync IO >> queues. For example, when switching to an async or seeky sync queue, >> initially allow just 1 in flight. For the next round, if there still >> hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue >> again, immediately drop to 1. >> I would limit just async I/O. Seeky sync queues are automatically throttled by being sync, and have already high latency, so we shouldn't increase it artificially. I think, instead, that we should send multiple seeky requests (possibly coming from different queues) at once. They will help especially with raid devices, where the seeks for requests going to different disks will happen in parallel. >> It could tie in with (or partly replace) the overload feature. The key >> to good latency and decent throughput is knowing when to allow queue >> build up and when not to. > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > build/unleash any sizable IO, but that's just my gut talking. > On the other hand, sending 1 write first and then waiting it to complete before submitting new ones, will help performing more merges, so the subsequent requests will be bigger and thus more efficient. Corrado > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 9:12 ` Corrado Zoccolo @ 2009-10-03 13:17 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 13:17 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Sat, Oct 03 2009, Mike Galbraith wrote: > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > > > After shutting down the computer yesterday, I was thinking a bit about > > this issue and how to solve it without incurring too much delay. If we > > add a stricter control of the depth, that may help. So instead of > > allowing up to max_quantum (or larger) depths, only allow gradual build > > up of that the farther we get away from a dispatch from the sync IO > > queues. For example, when switching to an async or seeky sync queue, > > initially allow just 1 in flight. For the next round, if there still > > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue > > again, immediately drop to 1. > > > > It could tie in with (or partly replace) the overload feature. The key > > to good latency and decent throughput is knowing when to allow queue > > build up and when not to. > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > build/unleash any sizable IO, but that's just my gut talking. Not sure, will need some testing of course. But it'll build up quickly. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 9:00 ` Mike Galbraith @ 2009-10-03 13:17 ` Jens Axboe [not found] ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 13:17 ` Jens Axboe 2 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 13:17 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Sat, Oct 03 2009, Mike Galbraith wrote: > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > > > After shutting down the computer yesterday, I was thinking a bit about > > this issue and how to solve it without incurring too much delay. If we > > add a stricter control of the depth, that may help. So instead of > > allowing up to max_quantum (or larger) depths, only allow gradual build > > up of that the farther we get away from a dispatch from the sync IO > > queues. For example, when switching to an async or seeky sync queue, > > initially allow just 1 in flight. For the next round, if there still > > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue > > again, immediately drop to 1. > > > > It could tie in with (or partly replace) the overload feature. The key > > to good latency and decent throughput is knowing when to allow queue > > build up and when not to. > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > build/unleash any sizable IO, but that's just my gut talking. Not sure, will need some testing of course. But it'll build up quickly. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-03 13:17 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 13:17 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, Linus Torvalds On Sat, Oct 03 2009, Mike Galbraith wrote: > On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > > > After shutting down the computer yesterday, I was thinking a bit about > > this issue and how to solve it without incurring too much delay. If we > > add a stricter control of the depth, that may help. So instead of > > allowing up to max_quantum (or larger) depths, only allow gradual build > > up of that the farther we get away from a dispatch from the sync IO > > queues. For example, when switching to an async or seeky sync queue, > > initially allow just 1 in flight. For the next round, if there still > > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue > > again, immediately drop to 1. > > > > It could tie in with (or partly replace) the overload feature. The key > > to good latency and decent throughput is knowing when to allow queue > > build up and when not to. > > Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to > build/unleash any sizable IO, but that's just my gut talking. Not sure, will need some testing of course. But it'll build up quickly. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-03 9:00 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-03 9:00 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Sat, 2009-10-03 at 09:24 +0200, Jens Axboe wrote: > After shutting down the computer yesterday, I was thinking a bit about > this issue and how to solve it without incurring too much delay. If we > add a stricter control of the depth, that may help. So instead of > allowing up to max_quantum (or larger) depths, only allow gradual build > up of that the farther we get away from a dispatch from the sync IO > queues. For example, when switching to an async or seeky sync queue, > initially allow just 1 in flight. For the next round, if there still > hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue > again, immediately drop to 1. > > It could tie in with (or partly replace) the overload feature. The key > to good latency and decent throughput is knowing when to allow queue > build up and when not to. Hm. Starting at 1 sounds a bit thin (like IDLE), multiple iterations to build/unleash any sizable IO, but that's just my gut talking. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 5:56 ` Mike Galbraith @ 2009-10-03 11:29 ` Vivek Goyal 2009-10-03 11:29 ` Vivek Goyal [not found] ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-03 11:29 UTC (permalink / raw) To: Mike Galbraith Cc: Jens Axboe, Ingo Molnar, Linus Torvalds, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote: > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > > > If you could do a cleaned up version of your overload patch based on > > > this: > > > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > > > then lets take it from there. > > Note to self: build the darn thing after last minute changes. > > Block: Delay overloading of CFQ queues to improve read latency. > > Introduce a delay maximum dispatch timestamp, and stamp it when: > 1. we encounter a known seeky or possibly new sync IO queue. > 2. the current queue may go idle and we're draining async IO. > 3. we have sync IO in flight and are servicing an async queue. > 4 we are not the sole user of disk. > Disallow exceeding quantum if any of these events have occurred recently. > So it looks like primarily the issue seems to be that we done lot of dispatch from async queue and if some sync queue comes in now, it will experience latencies. For a ongoing seeky sync queue issue will be solved up to some extent because previously we did not choose to idle for that queue now we will idle, hence async queue will not get a chance to overload the dispatch queue. For the sync queues where we choose not to enable idle, we still will see the latencies. Instead of time stamping on all the above events, can we just keep track of last sync request completed in the system and don't allow async queue to flood/overload the dispatch queue with-in certain time limit of that last sync request completion. This just gives a buffer period to that sync queue to come back and submit more requests and still not suffer large latencies? Thanks Vivek > Protect this behavioral change with a "desktop_dispatch" knob and default > it to "on".. providing an easy means of regression verification prior to > hate-mail dispatch :) to CC list. > > Signed-off-by: Mike Galbraith <efault@gmx.de> > Cc: Jens Axboe <jens.axboe@oracle.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > ... others who let somewhat hacky tweak slip by > > --- > block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 41 insertions(+), 4 deletions(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -174,6 +174,9 @@ struct cfq_data { > unsigned int cfq_slice_async_rq; > unsigned int cfq_slice_idle; > unsigned int cfq_desktop; > + unsigned int cfq_desktop_dispatch; > + > + unsigned long desktop_dispatch_ts; > > struct list_head cic_list; > > @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct > struct cfq_data *cfqd = q->elevator->elevator_data; > struct cfq_queue *cfqq; > unsigned int max_dispatch; > + unsigned long delay; > > if (!cfqd->busy_queues) > return 0; > @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct > /* > * Drain async requests before we start sync IO > */ > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } > > /* > * If this is an async queue and we have sync IO in flight, let it wait > */ > - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } > > max_dispatch = cfqd->cfq_quantum; > if (cfq_class_idle(cfqq)) > max_dispatch = 1; > > + if (cfqd->busy_queues > 1) > + cfqd->desktop_dispatch_ts = jiffies; > + > /* > * Does this cfqq already have too much IO in flight? > */ > @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct > return 0; > > /* > + * Don't start overloading until we've been alone for a bit. > + */ > + if (cfqd->cfq_desktop_dispatch) { > + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync; > + > + if (time_before(jiffies, max_delay)) > + return 0; > + } > + > + /* > * we are the only queue, allow up to 4 times of 'quantum' > */ > if (cfqq->dispatched >= 4 * max_dispatch) > @@ -1942,7 +1963,7 @@ static void > cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, > struct cfq_io_context *cic) > { > - int old_idle, enable_idle; > + int old_idle, enable_idle, seeky = 0; > > /* > * Don't idle for async or idle io prio class > @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data * > if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) > return; > > + if (cfqd->hw_tag) { > + if (CIC_SEEKY(cic)) > + seeky = 1; > + /* > + * If seeky or incalculable seekiness, delay overloading. > + */ > + if (seeky || !sample_valid(cic->seek_samples)) > + cfqd->desktop_dispatch_ts = jiffies; > + } > + > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic))) > + (!cfqd->cfq_desktop && seeky)) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > if (cic->ttime_mean > cfqd->cfq_slice_idle) > @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque > cfqd->cfq_slice_async_rq = cfq_slice_async_rq; > cfqd->cfq_slice_idle = cfq_slice_idle; > cfqd->cfq_desktop = 1; > + cfqd->cfq_desktop_dispatch = 1; > + > + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES; > cfqd->hw_tag = 1; > > return cfqd; > @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd- > SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); > SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); > SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0); > +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0); > #undef SHOW_FUNCTION > > #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ > @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c > STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, > UINT_MAX, 0); > STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0); > +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0); > #undef STORE_FUNCTION > > #define CFQ_ATTR(name) \ > @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] = > CFQ_ATTR(slice_async_rq), > CFQ_ATTR(slice_idle), > CFQ_ATTR(desktop), > + CFQ_ATTR(desktop_dispatch), > __ATTR_NULL > }; > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-03 11:29 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-03 11:29 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, Linus Torvalds On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote: > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > > > If you could do a cleaned up version of your overload patch based on > > > this: > > > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > > > then lets take it from there. > > Note to self: build the darn thing after last minute changes. > > Block: Delay overloading of CFQ queues to improve read latency. > > Introduce a delay maximum dispatch timestamp, and stamp it when: > 1. we encounter a known seeky or possibly new sync IO queue. > 2. the current queue may go idle and we're draining async IO. > 3. we have sync IO in flight and are servicing an async queue. > 4 we are not the sole user of disk. > Disallow exceeding quantum if any of these events have occurred recently. > So it looks like primarily the issue seems to be that we done lot of dispatch from async queue and if some sync queue comes in now, it will experience latencies. For a ongoing seeky sync queue issue will be solved up to some extent because previously we did not choose to idle for that queue now we will idle, hence async queue will not get a chance to overload the dispatch queue. For the sync queues where we choose not to enable idle, we still will see the latencies. Instead of time stamping on all the above events, can we just keep track of last sync request completed in the system and don't allow async queue to flood/overload the dispatch queue with-in certain time limit of that last sync request completion. This just gives a buffer period to that sync queue to come back and submit more requests and still not suffer large latencies? Thanks Vivek > Protect this behavioral change with a "desktop_dispatch" knob and default > it to "on".. providing an easy means of regression verification prior to > hate-mail dispatch :) to CC list. > > Signed-off-by: Mike Galbraith <efault@gmx.de> > Cc: Jens Axboe <jens.axboe@oracle.com> > Cc: Linus Torvalds <torvalds@linux-foundation.org> > Cc: Andrew Morton <akpm@linux-foundation.org> > ... others who let somewhat hacky tweak slip by > > --- > block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 41 insertions(+), 4 deletions(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -174,6 +174,9 @@ struct cfq_data { > unsigned int cfq_slice_async_rq; > unsigned int cfq_slice_idle; > unsigned int cfq_desktop; > + unsigned int cfq_desktop_dispatch; > + > + unsigned long desktop_dispatch_ts; > > struct list_head cic_list; > > @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct > struct cfq_data *cfqd = q->elevator->elevator_data; > struct cfq_queue *cfqq; > unsigned int max_dispatch; > + unsigned long delay; > > if (!cfqd->busy_queues) > return 0; > @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct > /* > * Drain async requests before we start sync IO > */ > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } > > /* > * If this is an async queue and we have sync IO in flight, let it wait > */ > - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } > > max_dispatch = cfqd->cfq_quantum; > if (cfq_class_idle(cfqq)) > max_dispatch = 1; > > + if (cfqd->busy_queues > 1) > + cfqd->desktop_dispatch_ts = jiffies; > + > /* > * Does this cfqq already have too much IO in flight? > */ > @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct > return 0; > > /* > + * Don't start overloading until we've been alone for a bit. > + */ > + if (cfqd->cfq_desktop_dispatch) { > + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync; > + > + if (time_before(jiffies, max_delay)) > + return 0; > + } > + > + /* > * we are the only queue, allow up to 4 times of 'quantum' > */ > if (cfqq->dispatched >= 4 * max_dispatch) > @@ -1942,7 +1963,7 @@ static void > cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, > struct cfq_io_context *cic) > { > - int old_idle, enable_idle; > + int old_idle, enable_idle, seeky = 0; > > /* > * Don't idle for async or idle io prio class > @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data * > if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) > return; > > + if (cfqd->hw_tag) { > + if (CIC_SEEKY(cic)) > + seeky = 1; > + /* > + * If seeky or incalculable seekiness, delay overloading. > + */ > + if (seeky || !sample_valid(cic->seek_samples)) > + cfqd->desktop_dispatch_ts = jiffies; > + } > + > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic))) > + (!cfqd->cfq_desktop && seeky)) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > if (cic->ttime_mean > cfqd->cfq_slice_idle) > @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque > cfqd->cfq_slice_async_rq = cfq_slice_async_rq; > cfqd->cfq_slice_idle = cfq_slice_idle; > cfqd->cfq_desktop = 1; > + cfqd->cfq_desktop_dispatch = 1; > + > + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES; > cfqd->hw_tag = 1; > > return cfqd; > @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd- > SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); > SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); > SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0); > +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0); > #undef SHOW_FUNCTION > > #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ > @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c > STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, > UINT_MAX, 0); > STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0); > +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0); > #undef STORE_FUNCTION > > #define CFQ_ATTR(name) \ > @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] = > CFQ_ATTR(slice_async_rq), > CFQ_ATTR(slice_idle), > CFQ_ATTR(desktop), > + CFQ_ATTR(desktop_dispatch), > __ATTR_NULL > }; > > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-03 7:24 ` Jens Axboe 2009-10-03 11:29 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 7:24 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Sat, Oct 03 2009, Mike Galbraith wrote: > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > > > If you could do a cleaned up version of your overload patch based on > > > this: > > > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > > > then lets take it from there. > > Note to self: build the darn thing after last minute changes. > > Block: Delay overloading of CFQ queues to improve read latency. > > Introduce a delay maximum dispatch timestamp, and stamp it when: > 1. we encounter a known seeky or possibly new sync IO queue. > 2. the current queue may go idle and we're draining async IO. > 3. we have sync IO in flight and are servicing an async queue. > 4 we are not the sole user of disk. > Disallow exceeding quantum if any of these events have occurred recently. > > Protect this behavioral change with a "desktop_dispatch" knob and default > it to "on".. providing an easy means of regression verification prior to > hate-mail dispatch :) to CC list. It still doesn't build: block/cfq-iosched.c: In function ?cfq_dispatch_requests?: block/cfq-iosched.c:1345: error: ?max_delay? undeclared (first use in this function) After shutting down the computer yesterday, I was thinking a bit about this issue and how to solve it without incurring too much delay. If we add a stricter control of the depth, that may help. So instead of allowing up to max_quantum (or larger) depths, only allow gradual build up of that the farther we get away from a dispatch from the sync IO queues. For example, when switching to an async or seeky sync queue, initially allow just 1 in flight. For the next round, if there still hasn't been sync activity, allow 2, then 4, etc. If we see sync IO queue again, immediately drop to 1. It could tie in with (or partly replace) the overload feature. The key to good latency and decent throughput is knowing when to allow queue build up and when not to. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 7:24 ` Jens Axboe @ 2009-10-03 11:29 ` Vivek Goyal 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-03 11:29 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Sat, Oct 03, 2009 at 07:56:18AM +0200, Mike Galbraith wrote: > On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote: > > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > > > If you could do a cleaned up version of your overload patch based on > > > this: > > > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > > > then lets take it from there. > > Note to self: build the darn thing after last minute changes. > > Block: Delay overloading of CFQ queues to improve read latency. > > Introduce a delay maximum dispatch timestamp, and stamp it when: > 1. we encounter a known seeky or possibly new sync IO queue. > 2. the current queue may go idle and we're draining async IO. > 3. we have sync IO in flight and are servicing an async queue. > 4 we are not the sole user of disk. > Disallow exceeding quantum if any of these events have occurred recently. > So it looks like primarily the issue seems to be that we done lot of dispatch from async queue and if some sync queue comes in now, it will experience latencies. For a ongoing seeky sync queue issue will be solved up to some extent because previously we did not choose to idle for that queue now we will idle, hence async queue will not get a chance to overload the dispatch queue. For the sync queues where we choose not to enable idle, we still will see the latencies. Instead of time stamping on all the above events, can we just keep track of last sync request completed in the system and don't allow async queue to flood/overload the dispatch queue with-in certain time limit of that last sync request completion. This just gives a buffer period to that sync queue to come back and submit more requests and still not suffer large latencies? Thanks Vivek > Protect this behavioral change with a "desktop_dispatch" knob and default > it to "on".. providing an easy means of regression verification prior to > hate-mail dispatch :) to CC list. > > Signed-off-by: Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> > Cc: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> > ... others who let somewhat hacky tweak slip by > > --- > block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++---- > 1 file changed, 41 insertions(+), 4 deletions(-) > > Index: linux-2.6/block/cfq-iosched.c > =================================================================== > --- linux-2.6.orig/block/cfq-iosched.c > +++ linux-2.6/block/cfq-iosched.c > @@ -174,6 +174,9 @@ struct cfq_data { > unsigned int cfq_slice_async_rq; > unsigned int cfq_slice_idle; > unsigned int cfq_desktop; > + unsigned int cfq_desktop_dispatch; > + > + unsigned long desktop_dispatch_ts; > > struct list_head cic_list; > > @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct > struct cfq_data *cfqd = q->elevator->elevator_data; > struct cfq_queue *cfqq; > unsigned int max_dispatch; > + unsigned long delay; > > if (!cfqd->busy_queues) > return 0; > @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct > /* > * Drain async requests before we start sync IO > */ > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } > > /* > * If this is an async queue and we have sync IO in flight, let it wait > */ > - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) > + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } > > max_dispatch = cfqd->cfq_quantum; > if (cfq_class_idle(cfqq)) > max_dispatch = 1; > > + if (cfqd->busy_queues > 1) > + cfqd->desktop_dispatch_ts = jiffies; > + > /* > * Does this cfqq already have too much IO in flight? > */ > @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct > return 0; > > /* > + * Don't start overloading until we've been alone for a bit. > + */ > + if (cfqd->cfq_desktop_dispatch) { > + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync; > + > + if (time_before(jiffies, max_delay)) > + return 0; > + } > + > + /* > * we are the only queue, allow up to 4 times of 'quantum' > */ > if (cfqq->dispatched >= 4 * max_dispatch) > @@ -1942,7 +1963,7 @@ static void > cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, > struct cfq_io_context *cic) > { > - int old_idle, enable_idle; > + int old_idle, enable_idle, seeky = 0; > > /* > * Don't idle for async or idle io prio class > @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data * > if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) > return; > > + if (cfqd->hw_tag) { > + if (CIC_SEEKY(cic)) > + seeky = 1; > + /* > + * If seeky or incalculable seekiness, delay overloading. > + */ > + if (seeky || !sample_valid(cic->seek_samples)) > + cfqd->desktop_dispatch_ts = jiffies; > + } > + > enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); > > if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || > - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic))) > + (!cfqd->cfq_desktop && seeky)) > enable_idle = 0; > else if (sample_valid(cic->ttime_samples)) { > if (cic->ttime_mean > cfqd->cfq_slice_idle) > @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque > cfqd->cfq_slice_async_rq = cfq_slice_async_rq; > cfqd->cfq_slice_idle = cfq_slice_idle; > cfqd->cfq_desktop = 1; > + cfqd->cfq_desktop_dispatch = 1; > + > + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES; > cfqd->hw_tag = 1; > > return cfqd; > @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd- > SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); > SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); > SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0); > +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0); > #undef SHOW_FUNCTION > > #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ > @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c > STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, > UINT_MAX, 0); > STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0); > +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0); > #undef STORE_FUNCTION > > #define CFQ_ATTR(name) \ > @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] = > CFQ_ATTR(slice_async_rq), > CFQ_ATTR(slice_idle), > CFQ_ATTR(desktop), > + CFQ_ATTR(desktop_dispatch), > __ATTR_NULL > }; > > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-03 5:56 ` Mike Galbraith 2009-10-03 7:20 ` Ingo Molnar 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-03 5:56 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Sat, 2009-10-03 at 07:49 +0200, Mike Galbraith wrote: > On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > > > If you could do a cleaned up version of your overload patch based on > > this: > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > > > then lets take it from there. Note to self: build the darn thing after last minute changes. Block: Delay overloading of CFQ queues to improve read latency. Introduce a delay maximum dispatch timestamp, and stamp it when: 1. we encounter a known seeky or possibly new sync IO queue. 2. the current queue may go idle and we're draining async IO. 3. we have sync IO in flight and are servicing an async queue. 4 we are not the sole user of disk. Disallow exceeding quantum if any of these events have occurred recently. Protect this behavioral change with a "desktop_dispatch" knob and default it to "on".. providing an easy means of regression verification prior to hate-mail dispatch :) to CC list. Signed-off-by: Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> Cc: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> ... others who let somewhat hacky tweak slip by --- block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 41 insertions(+), 4 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -174,6 +174,9 @@ struct cfq_data { unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; unsigned int cfq_desktop; + unsigned int cfq_desktop_dispatch; + + unsigned long desktop_dispatch_ts; struct list_head cic_list; @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct struct cfq_data *cfqd = q->elevator->elevator_data; struct cfq_queue *cfqq; unsigned int max_dispatch; + unsigned long delay; if (!cfqd->busy_queues) return 0; @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct /* * Drain async requests before we start sync IO */ - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { + cfqd->desktop_dispatch_ts = jiffies; return 0; + } /* * If this is an async queue and we have sync IO in flight, let it wait */ - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { + cfqd->desktop_dispatch_ts = jiffies; return 0; + } max_dispatch = cfqd->cfq_quantum; if (cfq_class_idle(cfqq)) max_dispatch = 1; + if (cfqd->busy_queues > 1) + cfqd->desktop_dispatch_ts = jiffies; + /* * Does this cfqq already have too much IO in flight? */ @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct return 0; /* + * Don't start overloading until we've been alone for a bit. + */ + if (cfqd->cfq_desktop_dispatch) { + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync; + + if (time_before(jiffies, max_delay)) + return 0; + } + + /* * we are the only queue, allow up to 4 times of 'quantum' */ if (cfqq->dispatched >= 4 * max_dispatch) @@ -1942,7 +1963,7 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int old_idle, enable_idle; + int old_idle, enable_idle, seeky = 0; /* * Don't idle for async or idle io prio class @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data * if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) return; + if (cfqd->hw_tag) { + if (CIC_SEEKY(cic)) + seeky = 1; + /* + * If seeky or incalculable seekiness, delay overloading. + */ + if (seeky || !sample_valid(cic->seek_samples)) + cfqd->desktop_dispatch_ts = jiffies; + } + enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic))) + (!cfqd->cfq_desktop && seeky)) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->cfq_desktop = 1; + cfqd->cfq_desktop_dispatch = 1; + + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES; cfqd->hw_tag = 1; return cfqd; @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd- SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0); +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0); +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0); #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] = CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), CFQ_ATTR(desktop), + CFQ_ATTR(desktop_dispatch), __ATTR_NULL }; ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 5:56 ` Mike Galbraith @ 2009-10-03 7:20 ` Ingo Molnar 1 sibling, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-03 7:20 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds * Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> wrote: > unsigned int cfq_desktop; > + unsigned int cfq_desktop_dispatch; > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } btw., i hope all those desktop_ things will be named latency_ pretty soon as the consensus seems to be - the word 'desktop' feels so wrong in this context. 'desktop' is a form of use of computers and the implication of good latencies goes far beyond that category of systems. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 5:48 ` Mike Galbraith @ 2009-10-03 7:20 ` Ingo Molnar [not found] ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 7:20 ` Ingo Molnar 2 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-03 7:20 UTC (permalink / raw) To: Mike Galbraith Cc: Jens Axboe, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel * Mike Galbraith <efault@gmx.de> wrote: > unsigned int cfq_desktop; > + unsigned int cfq_desktop_dispatch; > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } btw., i hope all those desktop_ things will be named latency_ pretty soon as the consensus seems to be - the word 'desktop' feels so wrong in this context. 'desktop' is a form of use of computers and the implication of good latencies goes far beyond that category of systems. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-03 7:20 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-03 7:20 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, Linus Torvalds * Mike Galbraith <efault@gmx.de> wrote: > unsigned int cfq_desktop; > + unsigned int cfq_desktop_dispatch; > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > + cfqd->desktop_dispatch_ts = jiffies; > return 0; > + } btw., i hope all those desktop_ things will be named latency_ pretty soon as the consensus seems to be - the word 'desktop' feels so wrong in this context. 'desktop' is a form of use of computers and the implication of good latencies goes far beyond that category of systems. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091003072021.GB21407-X9Un+BFzKDI@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091003072021.GB21407-X9Un+BFzKDI@public.gmane.org> @ 2009-10-03 7:25 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 7:25 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Sat, Oct 03 2009, Ingo Molnar wrote: > > * Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> wrote: > > > unsigned int cfq_desktop; > > + unsigned int cfq_desktop_dispatch; > > > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > > + cfqd->desktop_dispatch_ts = jiffies; > > return 0; > > + } > > btw., i hope all those desktop_ things will be named latency_ pretty > soon as the consensus seems to be - the word 'desktop' feels so wrong in > this context. > > 'desktop' is a form of use of computers and the implication of good > latencies goes far beyond that category of systems. I will rename it, for now it doesn't matter (lets not get bogged down in bike shed colors, please). Oh and Mike, I forgot to mention this in the previous email - no more tunables, please. We'll keep this under a single knob. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 7:20 ` Ingo Molnar @ 2009-10-03 7:25 ` Jens Axboe -1 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 7:25 UTC (permalink / raw) To: Ingo Molnar Cc: Mike Galbraith, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Sat, Oct 03 2009, Ingo Molnar wrote: > > * Mike Galbraith <efault@gmx.de> wrote: > > > unsigned int cfq_desktop; > > + unsigned int cfq_desktop_dispatch; > > > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > > + cfqd->desktop_dispatch_ts = jiffies; > > return 0; > > + } > > btw., i hope all those desktop_ things will be named latency_ pretty > soon as the consensus seems to be - the word 'desktop' feels so wrong in > this context. > > 'desktop' is a form of use of computers and the implication of good > latencies goes far beyond that category of systems. I will rename it, for now it doesn't matter (lets not get bogged down in bike shed colors, please). Oh and Mike, I forgot to mention this in the previous email - no more tunables, please. We'll keep this under a single knob. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-03 7:25 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-03 7:25 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds On Sat, Oct 03 2009, Ingo Molnar wrote: > > * Mike Galbraith <efault@gmx.de> wrote: > > > unsigned int cfq_desktop; > > + unsigned int cfq_desktop_dispatch; > > > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) > > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { > > + cfqd->desktop_dispatch_ts = jiffies; > > return 0; > > + } > > btw., i hope all those desktop_ things will be named latency_ pretty > soon as the consensus seems to be - the word 'desktop' feels so wrong in > this context. > > 'desktop' is a form of use of computers and the implication of good > latencies goes far beyond that category of systems. I will rename it, for now it doesn't matter (lets not get bogged down in bike shed colors, please). Oh and Mike, I forgot to mention this in the previous email - no more tunables, please. We'll keep this under a single knob. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 7:25 ` Jens Axboe (?) @ 2009-10-03 8:53 ` Mike Galbraith -1 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-03 8:53 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Sat, 2009-10-03 at 09:25 +0200, Jens Axboe wrote: > On Sat, Oct 03 2009, Ingo Molnar wrote: > Oh and Mike, I forgot to mention this in the previous email - no more > tunables, please. We'll keep this under a single knob. OK. Since I don't seem to be competent to operate quilt this morning anyway, I won't send a fixed version yet. Anyone who wants to test can easily fix the rename booboo. With the knob in place, it's easier to see what load is affected by what change. Back to rummage/test. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091003072540.GW31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091003072540.GW31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-03 8:53 ` Mike Galbraith 2009-10-03 9:01 ` Corrado Zoccolo 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-03 8:53 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Sat, 2009-10-03 at 09:25 +0200, Jens Axboe wrote: > On Sat, Oct 03 2009, Ingo Molnar wrote: > Oh and Mike, I forgot to mention this in the previous email - no more > tunables, please. We'll keep this under a single knob. OK. Since I don't seem to be competent to operate quilt this morning anyway, I won't send a fixed version yet. Anyone who wants to test can easily fix the rename booboo. With the knob in place, it's easier to see what load is affected by what change. Back to rummage/test. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091003072540.GW31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-03 8:53 ` Mike Galbraith @ 2009-10-03 9:01 ` Corrado Zoccolo 1 sibling, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-03 9:01 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds Hi Jens, On Sat, Oct 3, 2009 at 9:25 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > On Sat, Oct 03 2009, Ingo Molnar wrote: >> >> * Mike Galbraith <efault@gmx.de> wrote: >> >> > unsigned int cfq_desktop; >> > + unsigned int cfq_desktop_dispatch; >> >> > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) >> > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { >> > + cfqd->desktop_dispatch_ts = jiffies; >> > return 0; >> > + } >> >> btw., i hope all those desktop_ things will be named latency_ pretty >> soon as the consensus seems to be - the word 'desktop' feels so wrong in >> this context. >> >> 'desktop' is a form of use of computers and the implication of good >> latencies goes far beyond that category of systems. > > I will rename it, for now it doesn't matter (lets not get bogged down in > bike shed colors, please). > > Oh and Mike, I forgot to mention this in the previous email - no more > tunables, please. We'll keep this under a single knob. Did you have a look at my http://patchwork.kernel.org/patch/47750/ ? It already introduces a 'target_latency' tunable, expressed in ms. If we can quantify the benefits of each technique, we could enable them based on the target latency requested by that single tunable. Corrado > > -- > Jens Axboe > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-03 7:25 ` Jens Axboe ` (2 preceding siblings ...) (?) @ 2009-10-03 9:01 ` Corrado Zoccolo -1 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-10-03 9:01 UTC (permalink / raw) To: Jens Axboe Cc: Ingo Molnar, Mike Galbraith, Linus Torvalds, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel Hi Jens, On Sat, Oct 3, 2009 at 9:25 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > On Sat, Oct 03 2009, Ingo Molnar wrote: >> >> * Mike Galbraith <efault@gmx.de> wrote: >> >> > unsigned int cfq_desktop; >> > + unsigned int cfq_desktop_dispatch; >> >> > - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) >> > + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { >> > + cfqd->desktop_dispatch_ts = jiffies; >> > return 0; >> > + } >> >> btw., i hope all those desktop_ things will be named latency_ pretty >> soon as the consensus seems to be - the word 'desktop' feels so wrong in >> this context. >> >> 'desktop' is a form of use of computers and the implication of good >> latencies goes far beyond that category of systems. > > I will rename it, for now it doesn't matter (lets not get bogged down in > bike shed colors, please). > > Oh and Mike, I forgot to mention this in the previous email - no more > tunables, please. We'll keep this under a single knob. Did you have a look at my http://patchwork.kernel.org/patch/47750/ ? It already introduces a 'target_latency' tunable, expressed in ms. If we can quantify the benefits of each technique, we could enable them based on the target latency requested by that single tunable. Corrado > > -- > Jens Axboe > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 18:57 ` Mike Galbraith 2009-10-03 5:48 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:57 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > I'm not too worried about the "single IO producer" scenarios, and it > looks like (from a quick look) that most of your numbers are within some > expected noise levels. It's the more complex mixes that are likely to > cause a bit of a stink, but lets worry about that later. One quick thing > would be to read eg 2 or more files sequentially from disk and see how > that performs. Hm. git(s) should be good for a nice repeatable load. Suggestions? > If you could do a cleaned up version of your overload patch based on > this: > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > then lets take it from there. I'll try to find a good repeatable git beater first. At this point, I only know it helps with one load. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 18:57 ` Mike Galbraith @ 2009-10-03 5:48 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-03 5:48 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, 2009-10-02 at 20:19 +0200, Jens Axboe wrote: > If you could do a cleaned up version of your overload patch based on > this: > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 > > then lets take it from there. If take it from there ends up meaning apply, and see who squeaks, feel free to delete the "Not", and my somewhat defective sense of humor. Block: Delay overloading of CFQ queues to improve read latency. Introduce a delay maximum dispatch timestamp, and stamp it when: 1. we encounter a known seeky or possibly new sync IO queue. 2. the current queue may go idle and we're draining async IO. 3. we have sync IO in flight and are servicing an async queue. 4 we are not the sole user of disk. Disallow exceeding quantum if any of these events have occurred recently. Protect this behavioral change with a "desktop_dispatch" knob and default it to "on".. providing an easy means of regression verification prior to hate-mail dispatch :) to CC list. Signed-off-by: Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> Cc: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Cc: Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> Cc: Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> ... others who let somewhat hacky tweak slip by LKML-Reference: <new-submission> --- block/cfq-iosched.c | 45 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 41 insertions(+), 4 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -174,6 +174,9 @@ struct cfq_data { unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; unsigned int cfq_desktop; + unsigned int cfq_desktop_dispatch; + + unsigned long desktop_dispatch_ts; struct list_head cic_list; @@ -1283,6 +1286,7 @@ static int cfq_dispatch_requests(struct struct cfq_data *cfqd = q->elevator->elevator_data; struct cfq_queue *cfqq; unsigned int max_dispatch; + unsigned long delay; if (!cfqd->busy_queues) return 0; @@ -1297,19 +1301,26 @@ static int cfq_dispatch_requests(struct /* * Drain async requests before we start sync IO */ - if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) { + cfqd->desktop_dispatch_ts = jiffies; return 0; + } /* * If this is an async queue and we have sync IO in flight, let it wait */ - if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) + if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) { + cfqd->desktop_dispatch_ts = jiffies; return 0; + } max_dispatch = cfqd->cfq_quantum; if (cfq_class_idle(cfqq)) max_dispatch = 1; + if (cfqd->busy_queues > 1) + cfqd->desktop_dispatch_ts = jiffies; + /* * Does this cfqq already have too much IO in flight? */ @@ -1327,6 +1338,16 @@ static int cfq_dispatch_requests(struct return 0; /* + * Don't start overloading until we've been alone for a bit. + */ + if (cfqd->cfq_desktop_dispatch) { + delay = cfqd->desktop_dispatch_ts + cfq_slice_sync; + + if (time_before(jiffies, max_delay)) + return 0; + } + + /* * we are the only queue, allow up to 4 times of 'quantum' */ if (cfqq->dispatched >= 4 * max_dispatch) @@ -1942,7 +1963,7 @@ static void cfq_update_idle_window(struct cfq_data *cfqd, struct cfq_queue *cfqq, struct cfq_io_context *cic) { - int old_idle, enable_idle; + int old_idle, enable_idle, seeky = 0; /* * Don't idle for async or idle io prio class @@ -1950,10 +1971,20 @@ cfq_update_idle_window(struct cfq_data * if (!cfq_cfqq_sync(cfqq) || cfq_class_idle(cfqq)) return; + if (cfqd->hw_tag) { + if (CIC_SEEKY(cic)) + seeky = 1; + /* + * If seeky or incalculable seekiness, delay overloading. + */ + if (seeky || !sample_valid(cic->seek_samples)) + cfqd->desktop_dispatch_ts = jiffies; + } + enable_idle = old_idle = cfq_cfqq_idle_window(cfqq); if (!atomic_read(&cic->ioc->nr_tasks) || !cfqd->cfq_slice_idle || - (!cfqd->cfq_desktop && cfqd->hw_tag && CIC_SEEKY(cic))) + (!cfqd->cfq_desktop && seeky)) enable_idle = 0; else if (sample_valid(cic->ttime_samples)) { if (cic->ttime_mean > cfqd->cfq_slice_idle) @@ -2483,6 +2514,9 @@ static void *cfq_init_queue(struct reque cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; cfqd->cfq_desktop = 1; + cfqd->cfq_desktop_dispatch = 1; + + cfqd->desktop_dispatch_ts = INITIAL_JIFFIES; cfqd->hw_tag = 1; return cfqd; @@ -2553,6 +2587,7 @@ SHOW_FUNCTION(cfq_slice_sync_show, cfqd- SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); SHOW_FUNCTION(cfq_desktop_show, cfqd->cfq_desktop, 0); +SHOW_FUNCTION(cfq_desktop_dispatch_show, cfqd->cfq_desktop_dispatch, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -2585,6 +2620,7 @@ STORE_FUNCTION(cfq_slice_async_store, &c STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); STORE_FUNCTION(cfq_desktop_store, &cfqd->cfq_desktop, 0, 1, 0); +STORE_FUNCTION(cfq_desktop_dispatch_store, &cfqd->cfq_desktop_dispatch, 0, 1, 0); #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -2601,6 +2637,7 @@ static struct elv_fs_entry cfq_attrs[] = CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), CFQ_ATTR(desktop), + CFQ_ATTR(desktop_dispatch), __ATTR_NULL }; ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 18:19 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:19 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 19:37 +0200, Jens Axboe wrote: > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > > > > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > > > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > > > > good example of that is actually the idling that we already do. > > > > > > Say you have two applications, each starting up. If you start them > > > > > > both at the same time and just care for the dumb low latency, then > > > > > > you'll do one IO from each of them in turn. Latency will be good, > > > > > > but throughput will be aweful. And this means that in 20s they are > > > > > > both started, while with the slice idling and priority disk access > > > > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > > > > > > > So latency is good, definitely, but sometimes you have to worry > > > > > > about the bigger picture too. Latency is more than single IOs, > > > > > > it's often for complete operation which may involve lots of IOs. > > > > > > Single IO latency is a benchmark thing, it's not a real life > > > > > > issue. And that's where it becomes complex and not so black and > > > > > > white. Mike's test is a really good example of that. > > > > > > > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > > > > sure you are arguing that) - Mike certainly did not do an artificial > > > > > test - he tested 'konsole' cache-cold startup latency, such as: > > > > > > > > [snip] > > > > > > > > I was saying the exact opposite, that Mike's test is a good example of > > > > a valid test. It's not measuring single IO latencies, it's doing a > > > > sequence of valid events and looking at the latency for those. It's > > > > benchmarking the bigger picture, not a microbenchmark. > > > > > > Good, so we are in violent agreement :-) > > > > Yes, perhaps that last sentence didn't provide enough evidence of which > > category I put Mike's test into :-) > > > > So to kick things off, I added an 'interactive' knob to CFQ and > > defaulted it to on, along with re-enabling slice idling for hardware > > that does tagged command queuing. This is almost completely identical to > > what Vivek Goyal originally posted, it's just combined into one and uses > > the term 'interactive' instead of 'fairness'. I think the former is a > > better umbrella under which to add further tweaks that may sacrifice > > throughput slightly, in the quest for better latency. > > > > It's queued up in the for-linus branch. > > FWIW, I did a matrix of Vivek's patch combined with my hack. Seems we > do lose a bit of dd throughput over stock with either or both. > > dd pre 65.1 65.4 67.5 64.8 65.1 65.5 fairness=1 overload_delay=1 > perf stat 1.70 1.94 1.32 1.89 1.87 1.7 > dd post 69.4 62.3 69.7 70.3 69.6 68.2 > > dd pre 67.0 67.8 64.7 64.7 64.9 65.8 fairness=1 overload_delay=0 > perf stat 4.89 3.13 2.98 2.71 2.17 3.1 > dd post 67.2 63.3 62.6 62.8 63.1 63.8 > > dd pre 65.0 66.0 66.9 64.6 67.0 65.9 fairness=0 overload_delay=1 > perf stat 4.66 3.81 4.23 2.98 4.23 3.9 > dd post 62.0 60.8 62.4 61.4 62.2 61.7 > > dd pre 65.3 65.6 64.9 69.5 65.8 66.2 fairness=0 overload_delay=0 > perf stat 14.79 9.11 14.16 8.44 13.67 12.0 > dd post 64.1 66.5 64.0 66.5 64.4 65.1 I'm not too worried about the "single IO producer" scenarios, and it looks like (from a quick look) that most of your numbers are within some expected noise levels. It's the more complex mixes that are likely to cause a bit of a stink, but lets worry about that later. One quick thing would be to read eg 2 or more files sequentially from disk and see how that performs. If you could do a cleaned up version of your overload patch based on this: http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=1d2235152dc745c6d94bedb550fea84cffdbf768 then lets take it from there. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002172554.GJ31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002172554.GJ31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 17:28 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:28 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > On Fri, Oct 02 2009, Ingo Molnar wrote: > > > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > It's not _that_ easy, it depends a lot on the access patterns. A > > > good example of that is actually the idling that we already do. > > > Say you have two applications, each starting up. If you start them > > > both at the same time and just care for the dumb low latency, then > > > you'll do one IO from each of them in turn. Latency will be good, > > > but throughput will be aweful. And this means that in 20s they are > > > both started, while with the slice idling and priority disk access > > > that CFQ does, you'd hopefully have both up and running in 2s. > > > > > > So latency is good, definitely, but sometimes you have to worry > > > about the bigger picture too. Latency is more than single IOs, > > > it's often for complete operation which may involve lots of IOs. > > > Single IO latency is a benchmark thing, it's not a real life > > > issue. And that's where it becomes complex and not so black and > > > white. Mike's test is a really good example of that. > > > > To the extent of you arguing that Mike's test is artificial (i'm not > > sure you are arguing that) - Mike certainly did not do an artificial > > test - he tested 'konsole' cache-cold startup latency, such as: > > [snip] > > I was saying the exact opposite, that Mike's test is a good example of > a valid test. It's not measuring single IO latencies, it's doing a > sequence of valid events and looking at the latency for those. It's > benchmarking the bigger picture, not a microbenchmark. Good, so we are in violent agreement :-) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002172046.GA2376-X9Un+BFzKDI@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002172046.GA2376-X9Un+BFzKDI@public.gmane.org> @ 2009-10-02 17:25 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:25 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > It's not _that_ easy, it depends a lot on the access patterns. A good > > example of that is actually the idling that we already do. Say you > > have two applications, each starting up. If you start them both at the > > same time and just care for the dumb low latency, then you'll do one > > IO from each of them in turn. Latency will be good, but throughput > > will be aweful. And this means that in 20s they are both started, > > while with the slice idling and priority disk access that CFQ does, > > you'd hopefully have both up and running in 2s. > > > > So latency is good, definitely, but sometimes you have to worry about > > the bigger picture too. Latency is more than single IOs, it's often > > for complete operation which may involve lots of IOs. Single IO > > latency is a benchmark thing, it's not a real life issue. And that's > > where it becomes complex and not so black and white. Mike's test is a > > really good example of that. > > To the extent of you arguing that Mike's test is artificial (i'm not > sure you are arguing that) - Mike certainly did not do an artificial > test - he tested 'konsole' cache-cold startup latency, such as: [snip] I was saying the exact opposite, that Mike's test is a good example of a valid test. It's not measuring single IO latencies, it's doing a sequence of valid events and looking at the latency for those. It's benchmarking the bigger picture, not a microbenchmark. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002171129.GG31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002171129.GG31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 17:20 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 17:20 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > It's not _that_ easy, it depends a lot on the access patterns. A good > example of that is actually the idling that we already do. Say you > have two applications, each starting up. If you start them both at the > same time and just care for the dumb low latency, then you'll do one > IO from each of them in turn. Latency will be good, but throughput > will be aweful. And this means that in 20s they are both started, > while with the slice idling and priority disk access that CFQ does, > you'd hopefully have both up and running in 2s. > > So latency is good, definitely, but sometimes you have to worry about > the bigger picture too. Latency is more than single IOs, it's often > for complete operation which may involve lots of IOs. Single IO > latency is a benchmark thing, it's not a real life issue. And that's > where it becomes complex and not so black and white. Mike's test is a > really good example of that. To the extent of you arguing that Mike's test is artificial (i'm not sure you are arguing that) - Mike certainly did not do an artificial test - he tested 'konsole' cache-cold startup latency, such as: sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE against a streaming dd. That is a _very_ relevant benchmark IMHO and konsole's cache footprint is far from trivial. (In fact i'd argue it's one of the most important IO benchmarks on a desktop system - how does your desktop hold up to something doing streaming IO.) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <alpine.LFD.2.01.0910020811490.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <alpine.LFD.2.01.0910020811490.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2009-10-02 16:01 ` jim owens 2009-10-02 17:11 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: jim owens @ 2009-10-02 16:01 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w Linus Torvalds wrote: > > I really think we should do latency first, and throughput second. Agree. > It's _easy_ to get throughput. The people who care just about throughput > can always just disable all the work we do for latency. But in my experience it is not that simple... The argument latency vs throughput or desktop vs server is wrong. I/O can never keep up with the ability of CPUs to dirty data. On desktops and servers (really many-user-desktops) we want minimum latency but the enemy is dirty VM. If we ignore the need for throughput to flush dirty pages, VM gets angry and forced VM page cleaning I/O is bad I/O. We want min latency with low dirty page percent but need to switch to max write throughput at some high dirty page percent. We can not prevent the cliff we fall off where the system chokes because the dirty page load is too high, but if we only worry about latency, we bring that choke point cliff in so it happens with a lower load. A 10% lower overload point might be fine to get 100% better latency, but would desktop users accept a 50% lower overload point where running one more application makes the system appear hung? Even desktop users commonly measure "how much work can I do before the system becomes unresponsive". jim ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <alpine.LFD.2.01.0910020811490.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2009-10-02 16:01 ` jim owens @ 2009-10-02 17:11 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:11 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Fri, Oct 02 2009, Linus Torvalds wrote: > > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > Mostly they care about throughput, and when they come running because > > some their favorite app/benchmark/etc is now 2% slower, I get to hear > > about it all the time. So yes, latency is not ignored, but mostly they > > yack about throughput. > > The reason they yack about it is that they can measure it. > > Give them the benchmark where it goes the other way, and tell them why > they see a 2% deprovement. Give them some button they can tweak, because > they will. To some extent that's true, and I didn't want to generalize. If they are adament that the benchmark models their real life, then no amount of pointing in the other direction will change that. Your point about tuning is definitely true, these people are used to tuning things. For the desktop we care a lot more about working out of the box. > But make the default be low-latency. Because everybody cares about low > latency, and the people who do so are _not_ the people who you give > buttons to tweak things with. Totally agree. > > I agree, we can easily make CFQ be very about about latency. If you > > think that is fine, then lets just do that. Then we'll get to fix the > > server side up when the next RHEL/SLES/whatever cycle is honing in on a > > kernel, hopefully we wont have to start over when that happens. > > I really think we should do latency first, and throughput second. > > It's _easy_ to get throughput. The people who care just about throughput > can always just disable all the work we do for latency. If they really > care about just throughput, they won't want fairness either - none of that > complex stuff. It's not _that_ easy, it depends a lot on the access patterns. A good example of that is actually the idling that we already do. Say you have two applications, each starting up. If you start them both at the same time and just care for the dumb low latency, then you'll do one IO from each of them in turn. Latency will be good, but throughput will be aweful. And this means that in 20s they are both started, while with the slice idling and priority disk access that CFQ does, you'd hopefully have both up and running in 2s. So latency is good, definitely, but sometimes you have to worry about the bigger picture too. Latency is more than single IOs, it's often for complete operation which may involve lots of IOs. Single IO latency is a benchmark thing, it's not a real life issue. And that's where it becomes complex and not so black and white. Mike's test is a really good example of that. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 14:56 ` Jens Axboe (?) (?) @ 2009-10-02 16:33 ` Ray Lee 2009-10-02 17:13 ` Jens Axboe [not found] ` <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Ray Lee @ 2009-10-02 16:33 UTC (permalink / raw) To: Jens Axboe Cc: Linus Torvalds, Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > In some cases I wish we had a server vs desktop switch, since it would > decisions on this easier. I know you say that servers care about > latency, but not at all to the extent that desktops do. Most desktop > users would gladly give away the top of the performance for latency, > that's not true of most server users. Depends on what the server does, > of course. If most of the I/O on a system exhibits seeky tendencies, couldn't the schedulers notice that and use that as the hint for what to optimize? I mean, there's no switch better than the actual I/O behavior itself. ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 16:33 ` Ray Lee @ 2009-10-02 17:13 ` Jens Axboe [not found] ` <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:13 UTC (permalink / raw) To: Ray Lee Cc: Linus Torvalds, Ingo Molnar, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel On Fri, Oct 02 2009, Ray Lee wrote: > On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > > In some cases I wish we had a server vs desktop switch, since it would > > decisions on this easier. I know you say that servers care about > > latency, but not at all to the extent that desktops do. Most desktop > > users would gladly give away the top of the performance for latency, > > that's not true of most server users. Depends on what the server does, > > of course. > > If most of the I/O on a system exhibits seeky tendencies, couldn't the > schedulers notice that and use that as the hint for what to optimize? > > I mean, there's no switch better than the actual I/O behavior itself. Heuristics like that have a tendency to fail. What's the cut-off point? Additionally, heuristics based on past process/system behaviour also has a tendency to be suboptimal, since things aren't static. We already look at seekiness of individual processes or groups. IIRC, as-iosched also keeps a per-queue tracking. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 17:13 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:13 UTC (permalink / raw) To: Ray Lee Cc: dhaval, peterz, dm-devel, dpshah, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Ingo Molnar, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea, Linus Torvalds On Fri, Oct 02 2009, Ray Lee wrote: > On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe@oracle.com> wrote: > > In some cases I wish we had a server vs desktop switch, since it would > > decisions on this easier. I know you say that servers care about > > latency, but not at all to the extent that desktops do. Most desktop > > users would gladly give away the top of the performance for latency, > > that's not true of most server users. Depends on what the server does, > > of course. > > If most of the I/O on a system exhibits seeky tendencies, couldn't the > schedulers notice that and use that as the hint for what to optimize? > > I mean, there's no switch better than the actual I/O behavior itself. Heuristics like that have a tendency to fail. What's the cut-off point? Additionally, heuristics based on past process/system behaviour also has a tendency to be suboptimal, since things aren't static. We already look at seekiness of individual processes or groups. IIRC, as-iosched also keeps a per-queue tracking. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-10-02 17:13 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 17:13 UTC (permalink / raw) To: Ray Lee Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 02 2009, Ray Lee wrote: > On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > In some cases I wish we had a server vs desktop switch, since it would > > decisions on this easier. I know you say that servers care about > > latency, but not at all to the extent that desktops do. Most desktop > > users would gladly give away the top of the performance for latency, > > that's not true of most server users. Depends on what the server does, > > of course. > > If most of the I/O on a system exhibits seeky tendencies, couldn't the > schedulers notice that and use that as the hint for what to optimize? > > I mean, there's no switch better than the actual I/O behavior itself. Heuristics like that have a tendency to fail. What's the cut-off point? Additionally, heuristics based on past process/system behaviour also has a tendency to be suboptimal, since things aren't static. We already look at seekiness of individual processes or groups. IIRC, as-iosched also keeps a per-queue tracking. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002145610.GD31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002145610.GD31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 15:14 ` Linus Torvalds 2009-10-02 16:33 ` Ray Lee 1 sibling, 0 replies; 349+ messages in thread From: Linus Torvalds @ 2009-10-02 15:14 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Fri, 2 Oct 2009, Jens Axboe wrote: > > Mostly they care about throughput, and when they come running because > some their favorite app/benchmark/etc is now 2% slower, I get to hear > about it all the time. So yes, latency is not ignored, but mostly they > yack about throughput. The reason they yack about it is that they can measure it. Give them the benchmark where it goes the other way, and tell them why they see a 2% deprovement. Give them some button they can tweak, because they will. But make the default be low-latency. Because everybody cares about low latency, and the people who do so are _not_ the people who you give buttons to tweak things with. > I agree, we can easily make CFQ be very about about latency. If you > think that is fine, then lets just do that. Then we'll get to fix the > server side up when the next RHEL/SLES/whatever cycle is honing in on a > kernel, hopefully we wont have to start over when that happens. I really think we should do latency first, and throughput second. It's _easy_ to get throughput. The people who care just about throughput can always just disable all the work we do for latency. If they really care about just throughput, they won't want fairness either - none of that complex stuff. Linus ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091002145610.GD31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 15:14 ` Linus Torvalds @ 2009-10-02 16:33 ` Ray Lee 1 sibling, 0 replies; 349+ messages in thread From: Ray Lee @ 2009-10-02 16:33 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, Linus Torvalds On Fri, Oct 2, 2009 at 7:56 AM, Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > In some cases I wish we had a server vs desktop switch, since it would > decisions on this easier. I know you say that servers care about > latency, but not at all to the extent that desktops do. Most desktop > users would gladly give away the top of the performance for latency, > that's not true of most server users. Depends on what the server does, > of course. If most of the I/O on a system exhibits seeky tendencies, couldn't the schedulers notice that and use that as the hint for what to optimize? I mean, there's no switch better than the actual I/O behavior itself. ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2009-10-02 14:45 ` Mike Galbraith 2009-10-02 14:56 ` Jens Axboe 2009-10-02 16:22 ` Ingo Molnar 2 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 14:45 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Fri, 2009-10-02 at 07:24 -0700, Linus Torvalds wrote: > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. > > Well, if we're talking 500-950% improvement vs 30% deprovement, I think > it's pretty clear, though. Even the server people do care about latencies. > > Often they care quite a bit, in fact. > > And Mike's patch didn't look big or complicated. But it is a hack. (thought about and measured, but hack nonetheless) I haven't tested it on much other than reader vs streaming writer. It may well destroy the rest of the IO universe. I don't have the hw to even test any hairy chested IO. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2009-10-02 14:45 ` Mike Galbraith @ 2009-10-02 14:56 ` Jens Axboe 2009-10-02 16:22 ` Ingo Molnar 2 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 14:56 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, Ingo Molnar, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w On Fri, Oct 02 2009, Linus Torvalds wrote: > > > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. > > Well, if we're talking 500-950% improvement vs 30% deprovement, I think > it's pretty clear, though. Even the server people do care about latencies. > > Often they care quite a bit, in fact. Mostly they care about throughput, and when they come running because some their favorite app/benchmark/etc is now 2% slower, I get to hear about it all the time. So yes, latency is not ignored, but mostly they yack about throughput. > And Mike's patch didn't look big or complicated. It wasn't, it was more of a hack than something mergeable though (and I think Mike will agree on that). So I'll repeat what I said to Mike, I'm very well prepared to get something worked out and merged and I very much appreciate the work he's putting into this. > > You can't say it's black and white latency vs throughput issue, > > Umm. Almost 1000% vs 30%. Forget latency vs throughput. That's pretty damn > black-and-white _regardless_ of what you're measuring. Plus you probably > made up the 30% - have you tested the patch? The 30% is totally made up, it's based on previous latency vs throughput tradeoffs. I haven't tested Mike's patch. > And quite frankly, we get a _lot_ of complaints about latency. A LOT. It's > just harder to measure, so people seldom attach numbers to it. But that > again means that when people _are_ able to attach numbers to it, we should > take those numbers _more_ seriously rather than less. I agree, we can easily make CFQ be very about about latency. If you think that is fine, then lets just do that. Then we'll get to fix the server side up when the next RHEL/SLES/whatever cycle is honing in on a kernel, hopefully we wont have to start over when that happens. > So the 30% you threw out as a number is pretty much worthless. It's hand waving, definitely. But I've been doing io scheduler tweaking for years, and I know how hard it is to balance. If you want latency, then you basically only ever give the device 1 thing to do. And you let things cool down before switching over. If you do that, then your nice big array of SSDs or rotating drives will easily drop to 1/4th of the original performance. So we try and tweak the logic to make everybody happy. In some cases I wish we had a server vs desktop switch, since it would decisions on this easier. I know you say that servers care about latency, but not at all to the extent that desktops do. Most desktop users would gladly give away the top of the performance for latency, that's not true of most server users. Depends on what the server does, of course. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2009-10-02 14:45 ` Mike Galbraith 2009-10-02 14:56 ` Jens Axboe @ 2009-10-02 16:22 ` Ingo Molnar 2 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 16:22 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w * Linus Torvalds <torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote: > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. > > Well, if we're talking 500-950% improvement vs 30% deprovement, I > think it's pretty clear, though. Even the server people do care about > latencies. > > Often they care quite a bit, in fact. The other thing is that latency is basically a given property in any system - as an app writer you have to live with it, there's not much you can do to improve it. Bandwidth on the other hand is a lot more engineerable, as it tends to be about batching things and you can batch in user-space too. Batching is often easier to do than getting good latencies. Then there's also the fact that the range of apps that care about bandwidth is a lot smaller than the range of apps which care about latencies. The default should help more apps - i.e. latencies. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 14:24 ` Linus Torvalds @ 2009-10-02 16:22 ` Ingo Molnar -1 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 16:22 UTC (permalink / raw) To: Linus Torvalds Cc: Jens Axboe, Mike Galbraith, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, riel * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. > > Well, if we're talking 500-950% improvement vs 30% deprovement, I > think it's pretty clear, though. Even the server people do care about > latencies. > > Often they care quite a bit, in fact. The other thing is that latency is basically a given property in any system - as an app writer you have to live with it, there's not much you can do to improve it. Bandwidth on the other hand is a lot more engineerable, as it tends to be about batching things and you can batch in user-space too. Batching is often easier to do than getting good latencies. Then there's also the fact that the range of apps that care about bandwidth is a lot smaller than the range of apps which care about latencies. The default should help more apps - i.e. latencies. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 16:22 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 16:22 UTC (permalink / raw) To: Linus Torvalds Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, Mike Galbraith, linux-kernel, akpm, righi.andrea * Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, 2 Oct 2009, Jens Axboe wrote: > > > > It's really not that simple, if we go and do easy latency bits, then > > throughput drops 30% or more. > > Well, if we're talking 500-950% improvement vs 30% deprovement, I > think it's pretty clear, though. Even the server people do care about > latencies. > > Often they care quite a bit, in fact. The other thing is that latency is basically a given property in any system - as an app writer you have to live with it, there's not much you can do to improve it. Bandwidth on the other hand is a lot more engineerable, as it tends to be about batching things and you can batch in user-space too. Batching is often easier to do than getting good latencies. Then there's also the fact that the range of apps that care about bandwidth is a lot smaller than the range of apps which care about latencies. The default should help more apps - i.e. latencies. Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002092409.GA19529-X9Un+BFzKDI@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002092409.GA19529-X9Un+BFzKDI@public.gmane.org> @ 2009-10-02 9:28 ` Jens Axboe 2009-10-02 9:36 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 9:28 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Mike Galbraith, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02 2009, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > It's not hard to make the latency good, the hard bit is making sure we > > also perform well for all other scenarios. > > Looking at the numbers from Mike: > > | dd competing against perf stat -- konsole -e exec timings, 5 back to > | back runs > | Avg > | before 9.15 14.51 9.39 15.06 9.90 11.6 > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 > > _PLEASE_ make read latencies this good - the numbers are _vastly_ > better. We'll worry about the 'other' things _after_ we've reached good > latencies. > > I thought this principle was a well established basic rule of Linux IO > scheduling. Why do we have to have a 'latency vs. bandwidth' discussion > again and again? I thought latency won hands down. It's really not that simple, if we go and do easy latency bits, then throughput drops 30% or more. You can't say it's black and white latency vs throughput issue, that's just not how the real world works. The server folks would be most unpleased. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091002092409.GA19529-X9Un+BFzKDI@public.gmane.org> 2009-10-02 9:28 ` Jens Axboe @ 2009-10-02 9:36 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 9:36 UTC (permalink / raw) To: Ingo Molnar Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote: > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > It's not hard to make the latency good, the hard bit is making sure we > > also perform well for all other scenarios. > > Looking at the numbers from Mike: > > | dd competing against perf stat -- konsole -e exec timings, 5 back to > | back runs > | Avg > | before 9.15 14.51 9.39 15.06 9.90 11.6 > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 > > _PLEASE_ make read latencies this good - the numbers are _vastly_ > better. We'll worry about the 'other' things _after_ we've reached good > latencies. > > I thought this principle was a well established basic rule of Linux IO > scheduling. Why do we have to have a 'latency vs. bandwidth' discussion > again and again? I thought latency won hands down. Just a note: In the testing I've done so far, we're better off today than ever, and I can't recall beating on root ever being anything less than agony for interactivity. IO seekers look a lot like CPU sleepers to me. Looks like both can be as annoying as hell ;-) -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 9:24 ` Ingo Molnar ` (2 preceding siblings ...) (?) @ 2009-10-02 9:36 ` Mike Galbraith 2009-10-02 16:37 ` Ingo Molnar [not found] ` <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 9:36 UTC (permalink / raw) To: Ingo Molnar Cc: Jens Axboe, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote: > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > It's not hard to make the latency good, the hard bit is making sure we > > also perform well for all other scenarios. > > Looking at the numbers from Mike: > > | dd competing against perf stat -- konsole -e exec timings, 5 back to > | back runs > | Avg > | before 9.15 14.51 9.39 15.06 9.90 11.6 > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 > > _PLEASE_ make read latencies this good - the numbers are _vastly_ > better. We'll worry about the 'other' things _after_ we've reached good > latencies. > > I thought this principle was a well established basic rule of Linux IO > scheduling. Why do we have to have a 'latency vs. bandwidth' discussion > again and again? I thought latency won hands down. Just a note: In the testing I've done so far, we're better off today than ever, and I can't recall beating on root ever being anything less than agony for interactivity. IO seekers look a lot like CPU sleepers to me. Looks like both can be as annoying as hell ;-) -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 9:36 ` Mike Galbraith @ 2009-10-02 16:37 ` Ingo Molnar [not found] ` <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 16:37 UTC (permalink / raw) To: Mike Galbraith Cc: Jens Axboe, Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, riel * Mike Galbraith <efault@gmx.de> wrote: > On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > It's not hard to make the latency good, the hard bit is making sure we > > > also perform well for all other scenarios. > > > > Looking at the numbers from Mike: > > > > | dd competing against perf stat -- konsole -e exec timings, 5 back to > > | back runs > > | Avg > > | before 9.15 14.51 9.39 15.06 9.90 11.6 > > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 > > > > _PLEASE_ make read latencies this good - the numbers are _vastly_ > > better. We'll worry about the 'other' things _after_ we've reached good > > latencies. > > > > I thought this principle was a well established basic rule of Linux > > IO scheduling. Why do we have to have a 'latency vs. bandwidth' > > discussion again and again? I thought latency won hands down. > > Just a note: In the testing I've done so far, we're better off today > than ever, [...] Definitely so, and a couple of months ago i've sung praises of that progress on the IO/fs latencies front: http://lkml.org/lkml/2009/4/9/461 ... but we are greedy bastards and dont define excellence by how far down we have come from but by how high we can still climb ;-) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 16:37 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 16:37 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval, peterz, dm-devel, dpshah, Jens Axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, Vivek Goyal, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds * Mike Galbraith <efault@gmx.de> wrote: > On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe@oracle.com> wrote: > > > > > It's not hard to make the latency good, the hard bit is making sure we > > > also perform well for all other scenarios. > > > > Looking at the numbers from Mike: > > > > | dd competing against perf stat -- konsole -e exec timings, 5 back to > > | back runs > > | Avg > > | before 9.15 14.51 9.39 15.06 9.90 11.6 > > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 > > > > _PLEASE_ make read latencies this good - the numbers are _vastly_ > > better. We'll worry about the 'other' things _after_ we've reached good > > latencies. > > > > I thought this principle was a well established basic rule of Linux > > IO scheduling. Why do we have to have a 'latency vs. bandwidth' > > discussion again and again? I thought latency won hands down. > > Just a note: In the testing I've done so far, we're better off today > than ever, [...] Definitely so, and a couple of months ago i've sung praises of that progress on the IO/fs latencies front: http://lkml.org/lkml/2009/4/9/461 ... but we are greedy bastards and dont define excellence by how far down we have come from but by how high we can still climb ;-) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 16:37 ` Ingo Molnar 0 siblings, 0 replies; 349+ messages in thread From: Ingo Molnar @ 2009-10-02 16:37 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b * Mike Galbraith <efault-Mmb7MZpHnFY@public.gmane.org> wrote: > On Fri, 2009-10-02 at 11:24 +0200, Ingo Molnar wrote: > > * Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > > It's not hard to make the latency good, the hard bit is making sure we > > > also perform well for all other scenarios. > > > > Looking at the numbers from Mike: > > > > | dd competing against perf stat -- konsole -e exec timings, 5 back to > > | back runs > > | Avg > > | before 9.15 14.51 9.39 15.06 9.90 11.6 > > | after [+patch] 1.76 1.54 1.93 1.88 1.56 1.7 > > > > _PLEASE_ make read latencies this good - the numbers are _vastly_ > > better. We'll worry about the 'other' things _after_ we've reached good > > latencies. > > > > I thought this principle was a well established basic rule of Linux > > IO scheduling. Why do we have to have a 'latency vs. bandwidth' > > discussion again and again? I thought latency won hands down. > > Just a note: In the testing I've done so far, we're better off today > than ever, [...] Definitely so, and a couple of months ago i've sung praises of that progress on the IO/fs latencies front: http://lkml.org/lkml/2009/4/9/461 ... but we are greedy bastards and dont define excellence by how far down we have come from but by how high we can still climb ;-) Ingo ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091001185816.GU14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091001185816.GU14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 6:23 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 6:23 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, 2009-10-01 at 20:58 +0200, Jens Axboe wrote: > On Thu, Oct 01 2009, Mike Galbraith wrote: > > > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try > > > using one "slice_idle" period of 8 ms. But it might turn out to be too > > > short depending on the disk speed. > > > > Yeah, it is too short, as is even _400_ ms. Trouble is, by the time > > some new task is determined to be seeky, the damage is already done. > > > > The below does better, though not as well as "just say no to overload" > > of course ;-) > > So this essentially takes the "avoid impact from previous slice" to a > new extreme, but idling even before dispatching requests from the new > queue. We basically do two things to prevent this already - one is to > only set the slice when the first request is actually serviced, and the > other is to drain async requests completely before starting sync ones. > I'm a bit surprised that the former doesn't solve the problem fully, I > guess what happens is that if the drive has been flooded with writes, it > may service the new read immediately and then return to finish emptying > its writeback cache. This will cause an impact for any sync IO until > that cache is flushed, and then cause that sync queue to not get as much > service as it should have. I did the stamping selection other than how long have we been solo based on these possibly wrong speculations: If we're in the idle window and doing the async drain thing, we've at the spot where Vivek's patch helps a ton. Seemed like a great time to limit the size of any io that may land in front of my sync reader to plain "you are not alone" quantity. If we've got sync io in flight, that should mean that my new or old known seeky queue has been serviced at least once. There's likely to be more on the way, so delay overloading then too. The seeky bit is supposed to be the earlier "last time we saw a seeker" thing, but known seeky is too late to help a new task at all unless you turn off the overloading for ages, so I added the if incalculable check for good measure, hoping that meant the task is new, may want to exec. Stamping any place may (see below) possibly limit the size of the io the reader can generate as well as writer, but I figured what's good for the goose is good for the the gander, or it ain't really good. The overload was causing the observed pain, definitely ain't good for both at these times at least, so don't let it do that. > Perhaps the "set slice on first complete" isn't working correctly? Or > perhaps we just need to be more extreme. Dunno, I was just tossing rocks and sticks at it. I don't really understand the reasoning behind overloading: I can see that allows cutting thicker slabs for the disk, but with the streaming writer vs reader case, seems only the writers can do that. The reader is unlikely to be alone isn't it? Seems to me that either dd, a flusher thread or kjournald is going to be there with it, which gives dd a huge advantage.. it has two proxies to help it squabble over disk, konsole has none. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-01 18:58 ` Jens Axboe @ 2009-10-02 18:08 ` Jens Axboe 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:08 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, Oct 01 2009, Mike Galbraith wrote: > max_dispatch = cfqd->cfq_quantum; > if (cfq_class_idle(cfqq)) > max_dispatch = 1; > > + if (cfqd->busy_queues > 1) > + cfqd->od_stamp = jiffies; > + ->busy_queues > 1 just means that they have requests ready for dispatch, not that they are dispatched. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-01 7:33 ` Mike Galbraith [not found] ` <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 18:08 ` Jens Axboe 2009-10-02 18:29 ` Mike Galbraith [not found] ` <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 1 sibling, 2 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:08 UTC (permalink / raw) To: Mike Galbraith Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Thu, Oct 01 2009, Mike Galbraith wrote: > max_dispatch = cfqd->cfq_quantum; > if (cfq_class_idle(cfqq)) > max_dispatch = 1; > > + if (cfqd->busy_queues > 1) > + cfqd->od_stamp = jiffies; > + ->busy_queues > 1 just means that they have requests ready for dispatch, not that they are dispatched. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:08 ` Jens Axboe @ 2009-10-02 18:29 ` Mike Galbraith 2009-10-02 18:36 ` Jens Axboe [not found] ` <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> [not found] ` <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 1 sibling, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:29 UTC (permalink / raw) To: Jens Axboe Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote: > On Thu, Oct 01 2009, Mike Galbraith wrote: > > max_dispatch = cfqd->cfq_quantum; > > if (cfq_class_idle(cfqq)) > > max_dispatch = 1; > > > > + if (cfqd->busy_queues > 1) > > + cfqd->od_stamp = jiffies; > > + > > ->busy_queues > 1 just means that they have requests ready for dispatch, > not that they are dispatched. But we're not alone, somebody else is using disk. I'm trying to make sure we don't have someone _about_ to come back.. like a reader, so when there's another player, stamp to give him some time to wake up/submit before putting the pedal to the metal. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 18:29 ` Mike Galbraith @ 2009-10-02 18:36 ` Jens Axboe [not found] ` <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:36 UTC (permalink / raw) To: Mike Galbraith Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote: > > On Thu, Oct 01 2009, Mike Galbraith wrote: > > > max_dispatch = cfqd->cfq_quantum; > > > if (cfq_class_idle(cfqq)) > > > max_dispatch = 1; > > > > > > + if (cfqd->busy_queues > 1) > > > + cfqd->od_stamp = jiffies; > > > + > > > > ->busy_queues > 1 just means that they have requests ready for dispatch, > > not that they are dispatched. > > But we're not alone, somebody else is using disk. I'm trying to make > sure we don't have someone _about_ to come back.. like a reader, so when > there's another player, stamp to give him some time to wake up/submit > before putting the pedal to the metal. OK, then the check does what you want. It'll tell you that you have a pending request, and at least one other queue has one too. And that could dispatch right after you finish yours, depending on idling etc. Note that this _only_ applies to queues that have requests still sitting in CFQ, as soon as they are on the dispatch list in the block layer they will only be counted as busy if they still have sorted IO waiting. But that should be OK already, since I switched CFQ to dispatch single requests a few revisions ago. So we should not run into that anymore. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-10-02 18:36 ` Jens Axboe 0 siblings, 0 replies; 349+ messages in thread From: Jens Axboe @ 2009-10-02 18:36 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, Oct 02 2009, Mike Galbraith wrote: > On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote: > > On Thu, Oct 01 2009, Mike Galbraith wrote: > > > max_dispatch = cfqd->cfq_quantum; > > > if (cfq_class_idle(cfqq)) > > > max_dispatch = 1; > > > > > > + if (cfqd->busy_queues > 1) > > > + cfqd->od_stamp = jiffies; > > > + > > > > ->busy_queues > 1 just means that they have requests ready for dispatch, > > not that they are dispatched. > > But we're not alone, somebody else is using disk. I'm trying to make > sure we don't have someone _about_ to come back.. like a reader, so when > there's another player, stamp to give him some time to wake up/submit > before putting the pedal to the metal. OK, then the check does what you want. It'll tell you that you have a pending request, and at least one other queue has one too. And that could dispatch right after you finish yours, depending on idling etc. Note that this _only_ applies to queues that have requests still sitting in CFQ, as soon as they are on the dispatch list in the block layer they will only be counted as busy if they still have sorted IO waiting. But that should be OK already, since I switched CFQ to dispatch single requests a few revisions ago. So we should not run into that anymore. -- Jens Axboe ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> @ 2009-10-02 18:29 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-10-02 18:29 UTC (permalink / raw) To: Jens Axboe Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, 2009-10-02 at 20:08 +0200, Jens Axboe wrote: > On Thu, Oct 01 2009, Mike Galbraith wrote: > > max_dispatch = cfqd->cfq_quantum; > > if (cfq_class_idle(cfqq)) > > max_dispatch = 1; > > > > + if (cfqd->busy_queues > 1) > > + cfqd->od_stamp = jiffies; > > + > > ->busy_queues > 1 just means that they have requests ready for dispatch, > not that they are dispatched. But we're not alone, somebody else is using disk. I'm trying to make sure we don't have someone _about_ to come back.. like a reader, so when there's another player, stamp to give him some time to wake up/submit before putting the pedal to the metal. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-30 20:24 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 20:24 UTC (permalink / raw) To: Mike Galbraith Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, Jens Axboe, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Wed, Sep 30, 2009 at 10:05:39PM +0200, Mike Galbraith wrote: > > > > /* > > + * We may have seeky queues, don't throttle up just yet. > > + */ > > + if (time_before(jiffies, cfqd->last_seeker + CIC_SEEK_THR)) > > + return 0; > > + > > bzzzt. Window too large, but the though is to let them overload, but > not instantly. > CIC_SEEK_THR is 8K jiffies so that would be 8seconds on 1000HZ system. Try using one "slice_idle" period of 8 ms. But it might turn out to be too short depending on the disk speed. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-25 20:26 ` Vivek Goyal (?) (?) @ 2009-09-27 17:00 ` Corrado Zoccolo 2009-09-28 14:56 ` Vivek Goyal [not found] ` <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-09-27 17:00 UTC (permalink / raw) To: Vivek Goyal Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe Hi Vivek, On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: >> Vivek Goyal wrote: >> > Notes: >> > - With vanilla CFQ, random writers can overwhelm a random reader. >> > Bring down its throughput and bump up latencies significantly. >> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, >> too. >> >> I'm basing this assumption on the observations I made on both OpenSuse >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML >> titled: "Poor desktop responsiveness with background I/O-operations" of >> 2009-09-20. >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) >> >> >> Thus, I'm posting this to show that your work is greatly appreciated, >> given the rather disappointig status quo of Linux's fairness when it >> comes to disk IO time. >> >> I hope that your efforts lead to a change in performance of current >> userland applications, the sooner, the better. >> > [Please don't remove people from original CC list. I am putting them back.] > > Hi Ulrich, > > I quicky went through that mail thread and I tried following on my > desktop. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > sleep 5 > time firefox > # close firefox once gui pops up. > ########################################## > > It was taking close to 1 minute 30 seconds to launch firefox and dd got > following. > > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > > (Results do vary across runs, especially if system is booted fresh. Don't > know why...). > > > Then I tried putting both the applications in separate groups and assign > them weights 200 each. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > echo $! > /cgroup/io/test1/tasks > sleep 5 > echo $$ > /cgroup/io/test2/tasks > time firefox > # close firefox once gui pops up. > ########################################## > > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > > Notice that throughput of dd also improved. > > I ran the block trace and noticed in many a cases firefox threads > immediately preempted the "dd". Probably because it was a file system > request. So in this case latency will arise from seek time. > > In some other cases, threads had to wait for up to 100ms because dd was > not preempted. In this case latency will arise both from waiting on queue > as well as seek time. I think cfq should already be doing something similar, i.e. giving 100ms slices to firefox, that alternate with dd, unless: * firefox is too seeky (in this case, the idle window will be too small) * firefox has too much think time. To rule out the first case, what happens if you run the test with your "fairness for seeky processes" patch? To rule out the second case, what happens if you increase the slice_idle? Thanks, Corrado > > With cgroup thing, We will run 100ms slice for the group in which firefox > is being launched and then give 100ms uninterrupted time slice to dd. So > it should cut down on number of seeks happening and that's why we probably > see this improvement. > > So grouping can help in such cases. May be you can move your X session in > one group and launch the big IO in other group. Most likely you should > have better desktop experience without compromising on dd thread output. > Thanks > Vivek > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-27 17:00 ` Corrado Zoccolo @ 2009-09-28 14:56 ` Vivek Goyal [not found] ` <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 14:56 UTC (permalink / raw) To: Corrado Zoccolo Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: > Hi Vivek, > On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > >> Vivek Goyal wrote: > >> > Notes: > >> > - With vanilla CFQ, random writers can overwhelm a random reader. > >> > Bring down its throughput and bump up latencies significantly. > >> > >> > >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > >> too. > >> > >> I'm basing this assumption on the observations I made on both OpenSuse > >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > >> titled: "Poor desktop responsiveness with background I/O-operations" of > >> 2009-09-20. > >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) > >> > >> > >> Thus, I'm posting this to show that your work is greatly appreciated, > >> given the rather disappointig status quo of Linux's fairness when it > >> comes to disk IO time. > >> > >> I hope that your efforts lead to a change in performance of current > >> userland applications, the sooner, the better. > >> > > [Please don't remove people from original CC list. I am putting them back.] > > > > Hi Ulrich, > > > > I quicky went through that mail thread and I tried following on my > > desktop. > > > > ########################################## > > dd if=/home/vgoyal/4G-file of=/dev/null & > > sleep 5 > > time firefox > > # close firefox once gui pops up. > > ########################################## > > > > It was taking close to 1 minute 30 seconds to launch firefox and dd got > > following. > > > > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > > > > (Results do vary across runs, especially if system is booted fresh. Don't > > know why...). > > > > > > Then I tried putting both the applications in separate groups and assign > > them weights 200 each. > > > > ########################################## > > dd if=/home/vgoyal/4G-file of=/dev/null & > > echo $! > /cgroup/io/test1/tasks > > sleep 5 > > echo $$ > /cgroup/io/test2/tasks > > time firefox > > # close firefox once gui pops up. > > ########################################## > > > > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > > > > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > > > > Notice that throughput of dd also improved. > > > > I ran the block trace and noticed in many a cases firefox threads > > immediately preempted the "dd". Probably because it was a file system > > request. So in this case latency will arise from seek time. > > > > In some other cases, threads had to wait for up to 100ms because dd was > > not preempted. In this case latency will arise both from waiting on queue > > as well as seek time. > > I think cfq should already be doing something similar, i.e. giving > 100ms slices to firefox, that alternate with dd, unless: > * firefox is too seeky (in this case, the idle window will be too small) > * firefox has too much think time. > Hi Corrado, "firefox" is the shell script to setup the environment and launch the broser. It seems to be a group of threads. Some of them run in parallel and some of these seems to be running one after the other (once previous process or threads finished). > To rule out the first case, what happens if you run the test with your > "fairness for seeky processes" patch? I applied that patch and it helps a lot. http://lwn.net/Articles/341032/ With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. So it looks like if we don't disable idle window for seeky processes on hardware supporting command queuing, it helps in this particular case. Thanks Vivek > To rule out the second case, what happens if you increase the slice_idle? > > Thanks, > Corrado > > > > > With cgroup thing, We will run 100ms slice for the group in which firefox > > is being launched and then give 100ms uninterrupted time slice to dd. So > > it should cut down on number of seeks happening and that's why we probably > > see this improvement. > > > > So grouping can help in such cases. May be you can move your X session in > > one group and launch the big IO in other group. Most likely you should > > have better desktop experience without compromising on dd thread output. > > > Thanks > > Vivek > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-28 14:56 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 14:56 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: > Hi Vivek, > On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > >> Vivek Goyal wrote: > >> > Notes: > >> > - With vanilla CFQ, random writers can overwhelm a random reader. > >> > Bring down its throughput and bump up latencies significantly. > >> > >> > >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > >> too. > >> > >> I'm basing this assumption on the observations I made on both OpenSuse > >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > >> titled: "Poor desktop responsiveness with background I/O-operations" of > >> 2009-09-20. > >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) > >> > >> > >> Thus, I'm posting this to show that your work is greatly appreciated, > >> given the rather disappointig status quo of Linux's fairness when it > >> comes to disk IO time. > >> > >> I hope that your efforts lead to a change in performance of current > >> userland applications, the sooner, the better. > >> > > [Please don't remove people from original CC list. I am putting them back.] > > > > Hi Ulrich, > > > > I quicky went through that mail thread and I tried following on my > > desktop. > > > > ########################################## > > dd if=/home/vgoyal/4G-file of=/dev/null & > > sleep 5 > > time firefox > > # close firefox once gui pops up. > > ########################################## > > > > It was taking close to 1 minute 30 seconds to launch firefox and dd got > > following. > > > > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > > > > (Results do vary across runs, especially if system is booted fresh. Don't > > know why...). > > > > > > Then I tried putting both the applications in separate groups and assign > > them weights 200 each. > > > > ########################################## > > dd if=/home/vgoyal/4G-file of=/dev/null & > > echo $! > /cgroup/io/test1/tasks > > sleep 5 > > echo $$ > /cgroup/io/test2/tasks > > time firefox > > # close firefox once gui pops up. > > ########################################## > > > > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > > > > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > > > > Notice that throughput of dd also improved. > > > > I ran the block trace and noticed in many a cases firefox threads > > immediately preempted the "dd". Probably because it was a file system > > request. So in this case latency will arise from seek time. > > > > In some other cases, threads had to wait for up to 100ms because dd was > > not preempted. In this case latency will arise both from waiting on queue > > as well as seek time. > > I think cfq should already be doing something similar, i.e. giving > 100ms slices to firefox, that alternate with dd, unless: > * firefox is too seeky (in this case, the idle window will be too small) > * firefox has too much think time. > Hi Corrado, "firefox" is the shell script to setup the environment and launch the broser. It seems to be a group of threads. Some of them run in parallel and some of these seems to be running one after the other (once previous process or threads finished). > To rule out the first case, what happens if you run the test with your > "fairness for seeky processes" patch? I applied that patch and it helps a lot. http://lwn.net/Articles/341032/ With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. So it looks like if we don't disable idle window for seeky processes on hardware supporting command queuing, it helps in this particular case. Thanks Vivek > To rule out the second case, what happens if you increase the slice_idle? > > Thanks, > Corrado > > > > > With cgroup thing, We will run 100ms slice for the group in which firefox > > is being launched and then give 100ms uninterrupted time slice to dd. So > > it should cut down on number of seeks happening and that's why we probably > > see this improvement. > > > > So grouping can help in such cases. May be you can move your X session in > > one group and launch the big IO in other group. Most likely you should > > have better desktop experience without compromising on dd thread output. > > > Thanks > > Vivek > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo@gmail.com > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090928145655.GB8192-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090928145655.GB8192-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-28 15:35 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-09-28 15:35 UTC (permalink / raw) To: Vivek Goyal Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b [-- Attachment #1: Type: text/plain, Size: 5325 bytes --] On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: >> Hi Vivek, >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: >> >> Vivek Goyal wrote: >> >> > Notes: >> >> > - With vanilla CFQ, random writers can overwhelm a random reader. >> >> > Bring down its throughput and bump up latencies significantly. >> >> >> >> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, >> >> too. >> >> >> >> I'm basing this assumption on the observations I made on both OpenSuse >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML >> >> titled: "Poor desktop responsiveness with background I/O-operations" of >> >> 2009-09-20. >> >> (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org) >> >> >> >> >> >> Thus, I'm posting this to show that your work is greatly appreciated, >> >> given the rather disappointig status quo of Linux's fairness when it >> >> comes to disk IO time. >> >> >> >> I hope that your efforts lead to a change in performance of current >> >> userland applications, the sooner, the better. >> >> >> > [Please don't remove people from original CC list. I am putting them back.] >> > >> > Hi Ulrich, >> > >> > I quicky went through that mail thread and I tried following on my >> > desktop. >> > >> > ########################################## >> > dd if=/home/vgoyal/4G-file of=/dev/null & >> > sleep 5 >> > time firefox >> > # close firefox once gui pops up. >> > ########################################## >> > >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got >> > following. >> > >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s >> > >> > (Results do vary across runs, especially if system is booted fresh. Don't >> > know why...). >> > >> > >> > Then I tried putting both the applications in separate groups and assign >> > them weights 200 each. >> > >> > ########################################## >> > dd if=/home/vgoyal/4G-file of=/dev/null & >> > echo $! > /cgroup/io/test1/tasks >> > sleep 5 >> > echo $$ > /cgroup/io/test2/tasks >> > time firefox >> > # close firefox once gui pops up. >> > ########################################## >> > >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. >> > >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s >> > >> > Notice that throughput of dd also improved. >> > >> > I ran the block trace and noticed in many a cases firefox threads >> > immediately preempted the "dd". Probably because it was a file system >> > request. So in this case latency will arise from seek time. >> > >> > In some other cases, threads had to wait for up to 100ms because dd was >> > not preempted. In this case latency will arise both from waiting on queue >> > as well as seek time. >> >> I think cfq should already be doing something similar, i.e. giving >> 100ms slices to firefox, that alternate with dd, unless: >> * firefox is too seeky (in this case, the idle window will be too small) >> * firefox has too much think time. >> > Hi Vivek, > Hi Corrado, > > "firefox" is the shell script to setup the environment and launch the > broser. It seems to be a group of threads. Some of them run in parallel > and some of these seems to be running one after the other (once previous > process or threads finished). Ok. > >> To rule out the first case, what happens if you run the test with your >> "fairness for seeky processes" patch? > > I applied that patch and it helps a lot. > > http://lwn.net/Articles/341032/ > > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. Great. Can you try the attached patch (on top of 2.6.31)? It implements the alternative approach we discussed privately in july, and it addresses the possible latency increase that could happen with your patch. To summarize for everyone, we separate sync sequential queues, sync seeky queues and async queues in three separate RR strucutres, and alternate servicing requests between them. When servicing seeky queues (the ones that are usually penalized by cfq, for which no fairness is usually provided), we do not idle between them, but we do idle for the last queue (the idle can be exited when any seeky queue has requests). This allows us to allocate disk time globally for all seeky processes, and to reduce seeky processes latencies. I tested with 'konsole -e exit', while doing a sequential write with dd, and the start up time reduced from 37s to 7s, on an old laptop disk. Thanks, Corrado > >> To rule out the first case, what happens if you run the test with your >> "fairness for seeky processes" patch? > > I applied that patch and it helps a lot. > > http://lwn.net/Articles/341032/ > > With above patchset applied, and fairness=1, firefox pops up in 27-28 > seconds. > > So it looks like if we don't disable idle window for seeky processes on > hardware supporting command queuing, it helps in this particular case. > > Thanks > Vivek > [-- Attachment #2: cfq.patch --] [-- Type: application/octet-stream, Size: 24221 bytes --] diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index fd7080e..064f4fb 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 10; static int cfq_slice_async = HZ / 25; static const int cfq_slice_async_rq = 2; static int cfq_slice_idle = HZ / 125; +static int cfq_target_latency = HZ * 3/10; /* 300 ms */ +static int cfq_hist_divisor = 4; +/* + * Number of times that other workloads can be scheduled before async + */ +static const unsigned int cfq_async_penalty = 4; /* * offset from end of service tree @@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125; /* * below this threshold, we consider thinktime immediate */ -#define CFQ_MIN_TT (2) +#define CFQ_MIN_TT (1) #define CFQ_SLICE_SCALE (5) #define CFQ_HW_QUEUE_MIN (5) @@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock); struct cfq_rb_root { struct rb_root rb; struct rb_node *left; + unsigned count; }; -#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, } +#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, } /* * Per process-grouping structure @@ -113,6 +120,21 @@ struct cfq_queue { unsigned short ioprio_class, org_ioprio_class; pid_t pid; + + struct cfq_rb_root *service_tree; + struct cfq_io_context *cic; +}; + +enum wl_prio_t { + IDLE_WL = -1, + BE_WL = 0, + RT_WL = 1 +}; + +enum wl_type_t { + ASYNC_WL = 0, + SYNC_NOIDLE_WL = 1, + SYNC_WL = 2 }; /* @@ -124,7 +146,13 @@ struct cfq_data { /* * rr list of queues with requests and the count of them */ - struct cfq_rb_root service_tree; + struct cfq_rb_root service_trees[2][3]; + struct cfq_rb_root service_tree_idle; + + enum wl_prio_t serving_prio; + enum wl_type_t serving_type; + unsigned long workload_expires; + unsigned int async_starved; /* * Each priority tree is sorted by next_request position. These @@ -134,14 +162,11 @@ struct cfq_data { struct rb_root prio_trees[CFQ_PRIO_LISTS]; unsigned int busy_queues; - /* - * Used to track any pending rt requests so we can pre-empt current - * non-RT cfqq in service when this value is non-zero. - */ - unsigned int busy_rt_queues; + unsigned int busy_queues_avg[2]; - int rq_in_driver; + int rq_in_driver[2]; int sync_flight; + int reads_delayed; /* * queue-depth detection @@ -178,6 +203,9 @@ struct cfq_data { unsigned int cfq_slice[2]; unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned int cfq_target_latency; + unsigned int cfq_hist_divisor; + unsigned int cfq_async_penalty; struct list_head cic_list; @@ -187,11 +215,15 @@ struct cfq_data { struct cfq_queue oom_cfqq; }; +static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type, + struct cfq_data *cfqd) { + return prio == IDLE_WL ? &cfqd->service_tree_idle : &cfqd->service_trees[prio][type]; +} + enum cfqq_state_flags { CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */ CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */ CFQ_CFQQ_FLAG_must_dispatch, /* must be allowed a dispatch */ - CFQ_CFQQ_FLAG_must_alloc, /* must be allowed rq alloc */ CFQ_CFQQ_FLAG_must_alloc_slice, /* per-slice must_alloc flag */ CFQ_CFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */ CFQ_CFQQ_FLAG_idle_window, /* slice idling enabled */ @@ -218,7 +250,6 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq) \ CFQ_CFQQ_FNS(on_rr); CFQ_CFQQ_FNS(wait_request); CFQ_CFQQ_FNS(must_dispatch); -CFQ_CFQQ_FNS(must_alloc); CFQ_CFQQ_FNS(must_alloc_slice); CFQ_CFQQ_FNS(fifo_expire); CFQ_CFQQ_FNS(idle_window); @@ -233,12 +264,28 @@ CFQ_CFQQ_FNS(coop); #define cfq_log(cfqd, fmt, args...) \ blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args) +#define CIC_SEEK_THR 1024 +#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR) +#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic)) + +static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) { + return wl==IDLE_WL? cfqd->service_tree_idle.count : + cfqd->service_trees[wl][ASYNC_WL].count + + cfqd->service_trees[wl][SYNC_NOIDLE_WL].count + + cfqd->service_trees[wl][SYNC_WL].count; +} + static void cfq_dispatch_insert(struct request_queue *, struct request *); static struct cfq_queue *cfq_get_queue(struct cfq_data *, int, struct io_context *, gfp_t); static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *, struct io_context *); +static inline int rq_in_driver(struct cfq_data *cfqd) +{ + return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1]; +} + static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic, int is_sync) { @@ -249,6 +296,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq, int is_sync) { cic->cfqq[!!is_sync] = cfqq; + cfqq->cic = cic; } /* @@ -257,7 +305,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic, */ static inline int cfq_bio_sync(struct bio *bio) { - if (bio_data_dir(bio) == READ || bio_sync(bio)) + if (bio_data_dir(bio) == READ || bio_rw_flagged(bio, BIO_RW_SYNCIO)) return 1; return 0; @@ -303,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio); } +static inline unsigned +cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) { + unsigned min_q, max_q; + unsigned mult = cfqd->cfq_hist_divisor - 1; + unsigned round = cfqd->cfq_hist_divisor / 2; + unsigned busy = cfq_busy_queues_wl(rt, cfqd); + min_q = min(cfqd->busy_queues_avg[rt], busy); + max_q = max(cfqd->busy_queues_avg[rt], busy); + cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) / + cfqd->cfq_hist_divisor; + return cfqd->busy_queues_avg[rt]; +} + static inline void cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) { - cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; + unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1]; + unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq)); + unsigned slice = cfq_prio_to_slice(cfqd, cfqq); + + if (iq > process_thr) { + unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle + / cfqd->cfq_slice[1]; + slice = max(slice * process_thr / iq, min(slice, low_slice)); + } + + cfqq->slice_end = jiffies + slice; cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); } @@ -445,6 +516,7 @@ static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root) if (root->left == n) root->left = NULL; rb_erase_init(n, &root->rb); + --root->count; } /* @@ -485,46 +557,56 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd, } /* - * The cfqd->service_tree holds all pending cfq_queue's that have + * The cfqd->service_trees holds all pending cfq_queue's that have * requests waiting to be processed. It is sorted in the order that * we will service the queues. */ -static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq, - int add_front) +static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq) { struct rb_node **p, *parent; struct cfq_queue *__cfqq; unsigned long rb_key; + struct cfq_rb_root *service_tree; int left; if (cfq_class_idle(cfqq)) { rb_key = CFQ_IDLE_DELAY; - parent = rb_last(&cfqd->service_tree.rb); + service_tree = &cfqd->service_tree_idle; + parent = rb_last(&service_tree->rb); if (parent && parent != &cfqq->rb_node) { __cfqq = rb_entry(parent, struct cfq_queue, rb_node); rb_key += __cfqq->rb_key; } else rb_key += jiffies; - } else if (!add_front) { + } else { + enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL; + enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL; + rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq->slice_resid; cfqq->slice_resid = 0; - } else - rb_key = 0; + + if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq))) + type = SYNC_NOIDLE_WL; + + service_tree = service_tree_for(prio, type, cfqd); + } if (!RB_EMPTY_NODE(&cfqq->rb_node)) { /* * same position, nothing more to do */ - if (rb_key == cfqq->rb_key) + if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree) return; - cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree); + cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); + cfqq->service_tree = NULL; } left = 1; parent = NULL; - p = &cfqd->service_tree.rb.rb_node; + cfqq->service_tree = service_tree; + p = &service_tree->rb.rb_node; while (*p) { struct rb_node **n; @@ -556,11 +638,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq, } if (left) - cfqd->service_tree.left = &cfqq->rb_node; + service_tree->left = &cfqq->rb_node; cfqq->rb_key = rb_key; rb_link_node(&cfqq->rb_node, parent, p); - rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb); + rb_insert_color(&cfqq->rb_node, &service_tree->rb); + service_tree->count++; } static struct cfq_queue * @@ -633,7 +716,7 @@ static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq) * Resorting requires the cfqq to be on the RR list already. */ if (cfq_cfqq_on_rr(cfqq)) { - cfq_service_tree_add(cfqd, cfqq, 0); + cfq_service_tree_add(cfqd, cfqq); cfq_prio_tree_add(cfqd, cfqq); } } @@ -648,8 +731,6 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) BUG_ON(cfq_cfqq_on_rr(cfqq)); cfq_mark_cfqq_on_rr(cfqq); cfqd->busy_queues++; - if (cfq_class_rt(cfqq)) - cfqd->busy_rt_queues++; cfq_resort_rr_list(cfqd, cfqq); } @@ -664,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) BUG_ON(!cfq_cfqq_on_rr(cfqq)); cfq_clear_cfqq_on_rr(cfqq); - if (!RB_EMPTY_NODE(&cfqq->rb_node)) - cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree); + if (!RB_EMPTY_NODE(&cfqq->rb_node)) { + cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); + cfqq->service_tree = NULL; + } if (cfqq->p_root) { rb_erase(&cfqq->p_node, cfqq->p_root); cfqq->p_root = NULL; @@ -673,8 +756,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) BUG_ON(!cfqd->busy_queues); cfqd->busy_queues--; - if (cfq_class_rt(cfqq)) - cfqd->busy_rt_queues--; } /* @@ -760,9 +841,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq) { struct cfq_data *cfqd = q->elevator->elevator_data; - cfqd->rq_in_driver++; + cfqd->rq_in_driver[rq_is_sync(rq)]++; cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d", - cfqd->rq_in_driver); + rq_in_driver(cfqd)); cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq); } @@ -770,11 +851,12 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq) static void cfq_deactivate_request(struct request_queue *q, struct request *rq) { struct cfq_data *cfqd = q->elevator->elevator_data; + const int sync = rq_is_sync(rq); - WARN_ON(!cfqd->rq_in_driver); - cfqd->rq_in_driver--; + WARN_ON(!cfqd->rq_in_driver[sync]); + cfqd->rq_in_driver[sync]--; cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d", - cfqd->rq_in_driver); + rq_in_driver(cfqd)); } static void cfq_remove_request(struct request *rq) @@ -928,10 +1010,11 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out) */ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) { - if (RB_EMPTY_ROOT(&cfqd->service_tree.rb)) - return NULL; + struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd); - return cfq_rb_first(&cfqd->service_tree); + if (RB_EMPTY_ROOT(&service_tree->rb)) + return NULL; + return cfq_rb_first(service_tree); } /* @@ -959,9 +1042,6 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, return cfqd->last_position - blk_rq_pos(rq); } -#define CIC_SEEK_THR 8 * 1024 -#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR) - static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq) { struct cfq_io_context *cic = cfqd->active_cic; @@ -1049,6 +1129,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd, if (cfq_cfqq_coop(cfqq)) return NULL; + /* we don't want to mix processes with different characteristics */ + if (cfqq->service_tree != cur_cfqq->service_tree) + return NULL; + if (!probe) cfq_mark_cfqq_coop(cfqq); return cfqq; @@ -1080,7 +1164,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) /* * still requests with the driver, don't idle */ - if (cfqd->rq_in_driver) + if (rq_in_driver(cfqd)) return; /* @@ -1092,14 +1176,15 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) cfq_mark_cfqq_wait_request(cfqq); - /* - * we don't want to idle for seeks, but we do want to allow - * fair distribution of slice time for a process doing back-to-back - * seeks. so allow a little bit of time for him to submit a new rq - */ - sl = cfqd->cfq_slice_idle; - if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic)) + sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies); + + /* very small idle if we are serving noidle trees, and there are more trees */ + if (cfqd->serving_type == SYNC_NOIDLE_WL && + service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) { + if (blk_queue_nonrot(cfqd->queue)) + return; sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT)); + } mod_timer(&cfqd->idle_slice_timer, jiffies + sl); cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl); @@ -1115,6 +1200,12 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) cfq_log_cfqq(cfqd, cfqq, "dispatch_insert"); + if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) { + cfqd->reads_delayed = max_t(int, cfqd->reads_delayed, + (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2)); + } + + cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq); cfq_remove_request(rq); cfqq->dispatched++; elv_dispatch_sort(q, rq); @@ -1160,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq) return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio)); } +enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) { + struct cfq_queue *id, *ni; + ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd)); + id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd)); + if (id && ni && id->rb_key < ni->rb_key) + return SYNC_WL; + if (!ni) return SYNC_WL; + return SYNC_NOIDLE_WL; +} + /* * Select a queue for service. If we have a current active queue, * check whether to continue servicing it, or retrieve and set a new one. @@ -1179,20 +1280,6 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) goto expire; /* - * If we have a RT cfqq waiting, then we pre-empt the current non-rt - * cfqq. - */ - if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) { - /* - * We simulate this as cfqq timed out so that it gets to bank - * the remaining of its time slice. - */ - cfq_log_cfqq(cfqd, cfqq, "preempt"); - cfq_slice_expired(cfqd, 1); - goto new_queue; - } - - /* * The active queue has requests and isn't expired, allow it to * dispatch. */ @@ -1214,15 +1301,68 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) * flight or is idling for a new request, allow either of these * conditions to happen (or time out) before selecting a new queue. */ - if (timer_pending(&cfqd->idle_slice_timer) || + if (timer_pending(&cfqd->idle_slice_timer) || (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) { cfqq = NULL; goto keep_queue; } - expire: cfq_slice_expired(cfqd, 0); new_queue: + if (!new_cfqq) { + enum wl_prio_t previous_prio = cfqd->serving_prio; + + if (cfq_busy_queues_wl(RT_WL, cfqd)) + cfqd->serving_prio = RT_WL; + else if (cfq_busy_queues_wl(BE_WL, cfqd)) + cfqd->serving_prio = BE_WL; + else { + cfqd->serving_prio = IDLE_WL; + cfqd->workload_expires = jiffies + 1; + cfqd->reads_delayed = 0; + } + + if (cfqd->serving_prio != IDLE_WL) { + int counts[]={ + service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count, + service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count, + service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count + }; + int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2]; + + if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) { + cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL; + cfqd->async_starved = 0; + cfqd->reads_delayed = 0; + } else { + if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) { + if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] && + cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed)) + cfqd->serving_type = ASYNC_WL; + else + cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio); + } else + goto same_wl; + } + + { + unsigned slice = cfqd->cfq_target_latency; + slice = slice * counts[cfqd->serving_type] / + max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio], + counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]); + + if (cfqd->serving_type == ASYNC_WL) + slice = max(1U, (slice / (1 + cfqd->reads_delayed)) + * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]); + else + slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle)); + + cfqd->workload_expires = jiffies + slice; + cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL); + } + } + } + same_wl: cfqq = cfq_set_active_queue(cfqd, new_cfqq); keep_queue: return cfqq; @@ -1249,8 +1389,13 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd) { struct cfq_queue *cfqq; int dispatched = 0; + int i,j; + for (i = 0; i < 2; ++i) + for (j = 0; j < 3; ++j) + while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL) + dispatched += __cfq_forced_dispatch_cfqq(cfqq); - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL) + while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL) dispatched += __cfq_forced_dispatch_cfqq(cfqq); cfq_slice_expired(cfqd, 0); @@ -1312,6 +1457,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) return 0; /* + * Drain async requests before we start sync IO + */ + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + return 0; + + /* * If this is an async queue and we have sync IO in flight, let it wait */ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) @@ -1362,7 +1513,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) cfq_slice_expired(cfqd, 0); } - cfq_log(cfqd, "dispatched a request"); + cfq_log_cfqq(cfqd, cfqq, "dispatched a request"); return 1; } @@ -2004,18 +2155,8 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, if (cfq_class_idle(cfqq)) return 1; - /* - * if the new request is sync, but the currently running queue is - * not, let the sync request have priority. - */ - if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) - return 1; - - /* - * So both queues are sync. Let the new request get disk time if - * it's a metadata request and the current queue is doing regular IO. - */ - if (rq_is_meta(rq) && !cfqq->meta_pending) + if (cfqd->serving_type == SYNC_NOIDLE_WL + && new_cfqq->service_tree == cfqq->service_tree) return 1; /* @@ -2046,13 +2187,9 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq) cfq_log_cfqq(cfqd, cfqq, "preempt"); cfq_slice_expired(cfqd, 1); - /* - * Put the new queue at the front of the of the current list, - * so we know that it will be selected next. - */ BUG_ON(!cfq_cfqq_on_rr(cfqq)); - cfq_service_tree_add(cfqd, cfqq, 1); + cfq_service_tree_add(cfqd, cfqq); cfqq->slice_end = 0; cfq_mark_cfqq_slice_new(cfqq); @@ -2130,11 +2267,11 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq) */ static void cfq_update_hw_tag(struct cfq_data *cfqd) { - if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak) - cfqd->rq_in_driver_peak = cfqd->rq_in_driver; + if (rq_in_driver(cfqd) > cfqd->rq_in_driver_peak) + cfqd->rq_in_driver_peak = rq_in_driver(cfqd); if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN && - cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN) + rq_in_driver(cfqd) <= CFQ_HW_QUEUE_MIN) return; if (cfqd->hw_tag_samples++ < 50) @@ -2161,9 +2298,9 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfq_update_hw_tag(cfqd); - WARN_ON(!cfqd->rq_in_driver); + WARN_ON(!cfqd->rq_in_driver[sync]); WARN_ON(!cfqq->dispatched); - cfqd->rq_in_driver--; + cfqd->rq_in_driver[sync]--; cfqq->dispatched--; if (cfq_cfqq_sync(cfqq)) @@ -2197,7 +2334,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfq_arm_slice_timer(cfqd); } - if (!cfqd->rq_in_driver) + if (!rq_in_driver(cfqd)) cfq_schedule_dispatch(cfqd); } @@ -2229,8 +2366,7 @@ static void cfq_prio_boost(struct cfq_queue *cfqq) static inline int __cfq_may_queue(struct cfq_queue *cfqq) { - if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) && - !cfq_cfqq_must_alloc_slice(cfqq)) { + if (cfq_cfqq_wait_request(cfqq) && !cfq_cfqq_must_alloc_slice(cfqq)) { cfq_mark_cfqq_must_alloc_slice(cfqq); return ELV_MQUEUE_MUST; } @@ -2317,7 +2453,6 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask) } cfqq->allocated[rw]++; - cfq_clear_cfqq_must_alloc(cfqq); atomic_inc(&cfqq->ref); spin_unlock_irqrestore(q->queue_lock, flags); @@ -2451,13 +2586,16 @@ static void cfq_exit_queue(struct elevator_queue *e) static void *cfq_init_queue(struct request_queue *q) { struct cfq_data *cfqd; - int i; + int i,j; cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node); if (!cfqd) return NULL; - cfqd->service_tree = CFQ_RB_ROOT; + for (i = 0; i < 2; ++i) + for (j = 0; j < 3; ++j) + cfqd->service_trees[i][j] = CFQ_RB_ROOT; + cfqd->service_tree_idle = CFQ_RB_ROOT; /* * Not strictly needed (since RB_ROOT just clears the node and we @@ -2494,6 +2632,9 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_slice[1] = cfq_slice_sync; cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; + cfqd->cfq_target_latency = cfq_target_latency; + cfqd->cfq_hist_divisor = cfq_hist_divisor; + cfqd->cfq_async_penalty = cfq_async_penalty; cfqd->hw_tag = 1; return cfqd; @@ -2530,6 +2671,7 @@ fail: /* * sysfs parts below --> */ + static ssize_t cfq_var_show(unsigned int var, char *page) { @@ -2563,6 +2705,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1); SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1); SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); +SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1); +SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0); +SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -2594,6 +2739,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); + +STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1); +STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0); +STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0); + #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -2609,6 +2759,9 @@ static struct elv_fs_entry cfq_attrs[] = { CFQ_ATTR(slice_async), CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), + CFQ_ATTR(target_latency), + CFQ_ATTR(hist_divisor), + CFQ_ATTR(async_penalty), __ATTR_NULL }; [-- Attachment #3: Type: text/plain, Size: 206 bytes --] _______________________________________________ Containers mailing list Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply related [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 14:56 ` Vivek Goyal (?) (?) @ 2009-09-28 15:35 ` Corrado Zoccolo 2009-09-28 17:14 ` Vivek Goyal ` (2 more replies) -1 siblings, 3 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-09-28 15:35 UTC (permalink / raw) To: Vivek Goyal Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker [-- Attachment #1: Type: text/plain, Size: 5235 bytes --] On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: >> Hi Vivek, >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: >> >> Vivek Goyal wrote: >> >> > Notes: >> >> > - With vanilla CFQ, random writers can overwhelm a random reader. >> >> > Bring down its throughput and bump up latencies significantly. >> >> >> >> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, >> >> too. >> >> >> >> I'm basing this assumption on the observations I made on both OpenSuse >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML >> >> titled: "Poor desktop responsiveness with background I/O-operations" of >> >> 2009-09-20. >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) >> >> >> >> >> >> Thus, I'm posting this to show that your work is greatly appreciated, >> >> given the rather disappointig status quo of Linux's fairness when it >> >> comes to disk IO time. >> >> >> >> I hope that your efforts lead to a change in performance of current >> >> userland applications, the sooner, the better. >> >> >> > [Please don't remove people from original CC list. I am putting them back.] >> > >> > Hi Ulrich, >> > >> > I quicky went through that mail thread and I tried following on my >> > desktop. >> > >> > ########################################## >> > dd if=/home/vgoyal/4G-file of=/dev/null & >> > sleep 5 >> > time firefox >> > # close firefox once gui pops up. >> > ########################################## >> > >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got >> > following. >> > >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s >> > >> > (Results do vary across runs, especially if system is booted fresh. Don't >> > know why...). >> > >> > >> > Then I tried putting both the applications in separate groups and assign >> > them weights 200 each. >> > >> > ########################################## >> > dd if=/home/vgoyal/4G-file of=/dev/null & >> > echo $! > /cgroup/io/test1/tasks >> > sleep 5 >> > echo $$ > /cgroup/io/test2/tasks >> > time firefox >> > # close firefox once gui pops up. >> > ########################################## >> > >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. >> > >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s >> > >> > Notice that throughput of dd also improved. >> > >> > I ran the block trace and noticed in many a cases firefox threads >> > immediately preempted the "dd". Probably because it was a file system >> > request. So in this case latency will arise from seek time. >> > >> > In some other cases, threads had to wait for up to 100ms because dd was >> > not preempted. In this case latency will arise both from waiting on queue >> > as well as seek time. >> >> I think cfq should already be doing something similar, i.e. giving >> 100ms slices to firefox, that alternate with dd, unless: >> * firefox is too seeky (in this case, the idle window will be too small) >> * firefox has too much think time. >> > Hi Vivek, > Hi Corrado, > > "firefox" is the shell script to setup the environment and launch the > broser. It seems to be a group of threads. Some of them run in parallel > and some of these seems to be running one after the other (once previous > process or threads finished). Ok. > >> To rule out the first case, what happens if you run the test with your >> "fairness for seeky processes" patch? > > I applied that patch and it helps a lot. > > http://lwn.net/Articles/341032/ > > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. Great. Can you try the attached patch (on top of 2.6.31)? It implements the alternative approach we discussed privately in july, and it addresses the possible latency increase that could happen with your patch. To summarize for everyone, we separate sync sequential queues, sync seeky queues and async queues in three separate RR strucutres, and alternate servicing requests between them. When servicing seeky queues (the ones that are usually penalized by cfq, for which no fairness is usually provided), we do not idle between them, but we do idle for the last queue (the idle can be exited when any seeky queue has requests). This allows us to allocate disk time globally for all seeky processes, and to reduce seeky processes latencies. I tested with 'konsole -e exit', while doing a sequential write with dd, and the start up time reduced from 37s to 7s, on an old laptop disk. Thanks, Corrado > >> To rule out the first case, what happens if you run the test with your >> "fairness for seeky processes" patch? > > I applied that patch and it helps a lot. > > http://lwn.net/Articles/341032/ > > With above patchset applied, and fairness=1, firefox pops up in 27-28 > seconds. > > So it looks like if we don't disable idle window for seeky processes on > hardware supporting command queuing, it helps in this particular case. > > Thanks > Vivek > [-- Attachment #2: cfq.patch --] [-- Type: application/octet-stream, Size: 24221 bytes --] diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index fd7080e..064f4fb 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 10; static int cfq_slice_async = HZ / 25; static const int cfq_slice_async_rq = 2; static int cfq_slice_idle = HZ / 125; +static int cfq_target_latency = HZ * 3/10; /* 300 ms */ +static int cfq_hist_divisor = 4; +/* + * Number of times that other workloads can be scheduled before async + */ +static const unsigned int cfq_async_penalty = 4; /* * offset from end of service tree @@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125; /* * below this threshold, we consider thinktime immediate */ -#define CFQ_MIN_TT (2) +#define CFQ_MIN_TT (1) #define CFQ_SLICE_SCALE (5) #define CFQ_HW_QUEUE_MIN (5) @@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock); struct cfq_rb_root { struct rb_root rb; struct rb_node *left; + unsigned count; }; -#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, } +#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, } /* * Per process-grouping structure @@ -113,6 +120,21 @@ struct cfq_queue { unsigned short ioprio_class, org_ioprio_class; pid_t pid; + + struct cfq_rb_root *service_tree; + struct cfq_io_context *cic; +}; + +enum wl_prio_t { + IDLE_WL = -1, + BE_WL = 0, + RT_WL = 1 +}; + +enum wl_type_t { + ASYNC_WL = 0, + SYNC_NOIDLE_WL = 1, + SYNC_WL = 2 }; /* @@ -124,7 +146,13 @@ struct cfq_data { /* * rr list of queues with requests and the count of them */ - struct cfq_rb_root service_tree; + struct cfq_rb_root service_trees[2][3]; + struct cfq_rb_root service_tree_idle; + + enum wl_prio_t serving_prio; + enum wl_type_t serving_type; + unsigned long workload_expires; + unsigned int async_starved; /* * Each priority tree is sorted by next_request position. These @@ -134,14 +162,11 @@ struct cfq_data { struct rb_root prio_trees[CFQ_PRIO_LISTS]; unsigned int busy_queues; - /* - * Used to track any pending rt requests so we can pre-empt current - * non-RT cfqq in service when this value is non-zero. - */ - unsigned int busy_rt_queues; + unsigned int busy_queues_avg[2]; - int rq_in_driver; + int rq_in_driver[2]; int sync_flight; + int reads_delayed; /* * queue-depth detection @@ -178,6 +203,9 @@ struct cfq_data { unsigned int cfq_slice[2]; unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned int cfq_target_latency; + unsigned int cfq_hist_divisor; + unsigned int cfq_async_penalty; struct list_head cic_list; @@ -187,11 +215,15 @@ struct cfq_data { struct cfq_queue oom_cfqq; }; +static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type, + struct cfq_data *cfqd) { + return prio == IDLE_WL ? &cfqd->service_tree_idle : &cfqd->service_trees[prio][type]; +} + enum cfqq_state_flags { CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */ CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */ CFQ_CFQQ_FLAG_must_dispatch, /* must be allowed a dispatch */ - CFQ_CFQQ_FLAG_must_alloc, /* must be allowed rq alloc */ CFQ_CFQQ_FLAG_must_alloc_slice, /* per-slice must_alloc flag */ CFQ_CFQQ_FLAG_fifo_expire, /* FIFO checked in this slice */ CFQ_CFQQ_FLAG_idle_window, /* slice idling enabled */ @@ -218,7 +250,6 @@ static inline int cfq_cfqq_##name(const struct cfq_queue *cfqq) \ CFQ_CFQQ_FNS(on_rr); CFQ_CFQQ_FNS(wait_request); CFQ_CFQQ_FNS(must_dispatch); -CFQ_CFQQ_FNS(must_alloc); CFQ_CFQQ_FNS(must_alloc_slice); CFQ_CFQQ_FNS(fifo_expire); CFQ_CFQQ_FNS(idle_window); @@ -233,12 +264,28 @@ CFQ_CFQQ_FNS(coop); #define cfq_log(cfqd, fmt, args...) \ blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args) +#define CIC_SEEK_THR 1024 +#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR) +#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic)) + +static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) { + return wl==IDLE_WL? cfqd->service_tree_idle.count : + cfqd->service_trees[wl][ASYNC_WL].count + + cfqd->service_trees[wl][SYNC_NOIDLE_WL].count + + cfqd->service_trees[wl][SYNC_WL].count; +} + static void cfq_dispatch_insert(struct request_queue *, struct request *); static struct cfq_queue *cfq_get_queue(struct cfq_data *, int, struct io_context *, gfp_t); static struct cfq_io_context *cfq_cic_lookup(struct cfq_data *, struct io_context *); +static inline int rq_in_driver(struct cfq_data *cfqd) +{ + return cfqd->rq_in_driver[0] + cfqd->rq_in_driver[1]; +} + static inline struct cfq_queue *cic_to_cfqq(struct cfq_io_context *cic, int is_sync) { @@ -249,6 +296,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic, struct cfq_queue *cfqq, int is_sync) { cic->cfqq[!!is_sync] = cfqq; + cfqq->cic = cic; } /* @@ -257,7 +305,7 @@ static inline void cic_set_cfqq(struct cfq_io_context *cic, */ static inline int cfq_bio_sync(struct bio *bio) { - if (bio_data_dir(bio) == READ || bio_sync(bio)) + if (bio_data_dir(bio) == READ || bio_rw_flagged(bio, BIO_RW_SYNCIO)) return 1; return 0; @@ -303,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio); } +static inline unsigned +cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) { + unsigned min_q, max_q; + unsigned mult = cfqd->cfq_hist_divisor - 1; + unsigned round = cfqd->cfq_hist_divisor / 2; + unsigned busy = cfq_busy_queues_wl(rt, cfqd); + min_q = min(cfqd->busy_queues_avg[rt], busy); + max_q = max(cfqd->busy_queues_avg[rt], busy); + cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) / + cfqd->cfq_hist_divisor; + return cfqd->busy_queues_avg[rt]; +} + static inline void cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) { - cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; + unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1]; + unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq)); + unsigned slice = cfq_prio_to_slice(cfqd, cfqq); + + if (iq > process_thr) { + unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle + / cfqd->cfq_slice[1]; + slice = max(slice * process_thr / iq, min(slice, low_slice)); + } + + cfqq->slice_end = jiffies + slice; cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); } @@ -445,6 +516,7 @@ static void cfq_rb_erase(struct rb_node *n, struct cfq_rb_root *root) if (root->left == n) root->left = NULL; rb_erase_init(n, &root->rb); + --root->count; } /* @@ -485,46 +557,56 @@ static unsigned long cfq_slice_offset(struct cfq_data *cfqd, } /* - * The cfqd->service_tree holds all pending cfq_queue's that have + * The cfqd->service_trees holds all pending cfq_queue's that have * requests waiting to be processed. It is sorted in the order that * we will service the queues. */ -static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq, - int add_front) +static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq) { struct rb_node **p, *parent; struct cfq_queue *__cfqq; unsigned long rb_key; + struct cfq_rb_root *service_tree; int left; if (cfq_class_idle(cfqq)) { rb_key = CFQ_IDLE_DELAY; - parent = rb_last(&cfqd->service_tree.rb); + service_tree = &cfqd->service_tree_idle; + parent = rb_last(&service_tree->rb); if (parent && parent != &cfqq->rb_node) { __cfqq = rb_entry(parent, struct cfq_queue, rb_node); rb_key += __cfqq->rb_key; } else rb_key += jiffies; - } else if (!add_front) { + } else { + enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL; + enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL; + rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq->slice_resid; cfqq->slice_resid = 0; - } else - rb_key = 0; + + if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq))) + type = SYNC_NOIDLE_WL; + + service_tree = service_tree_for(prio, type, cfqd); + } if (!RB_EMPTY_NODE(&cfqq->rb_node)) { /* * same position, nothing more to do */ - if (rb_key == cfqq->rb_key) + if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree) return; - cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree); + cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); + cfqq->service_tree = NULL; } left = 1; parent = NULL; - p = &cfqd->service_tree.rb.rb_node; + cfqq->service_tree = service_tree; + p = &service_tree->rb.rb_node; while (*p) { struct rb_node **n; @@ -556,11 +638,12 @@ static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq, } if (left) - cfqd->service_tree.left = &cfqq->rb_node; + service_tree->left = &cfqq->rb_node; cfqq->rb_key = rb_key; rb_link_node(&cfqq->rb_node, parent, p); - rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb); + rb_insert_color(&cfqq->rb_node, &service_tree->rb); + service_tree->count++; } static struct cfq_queue * @@ -633,7 +716,7 @@ static void cfq_resort_rr_list(struct cfq_data *cfqd, struct cfq_queue *cfqq) * Resorting requires the cfqq to be on the RR list already. */ if (cfq_cfqq_on_rr(cfqq)) { - cfq_service_tree_add(cfqd, cfqq, 0); + cfq_service_tree_add(cfqd, cfqq); cfq_prio_tree_add(cfqd, cfqq); } } @@ -648,8 +731,6 @@ static void cfq_add_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) BUG_ON(cfq_cfqq_on_rr(cfqq)); cfq_mark_cfqq_on_rr(cfqq); cfqd->busy_queues++; - if (cfq_class_rt(cfqq)) - cfqd->busy_rt_queues++; cfq_resort_rr_list(cfqd, cfqq); } @@ -664,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) BUG_ON(!cfq_cfqq_on_rr(cfqq)); cfq_clear_cfqq_on_rr(cfqq); - if (!RB_EMPTY_NODE(&cfqq->rb_node)) - cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree); + if (!RB_EMPTY_NODE(&cfqq->rb_node)) { + cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); + cfqq->service_tree = NULL; + } if (cfqq->p_root) { rb_erase(&cfqq->p_node, cfqq->p_root); cfqq->p_root = NULL; @@ -673,8 +756,6 @@ static void cfq_del_cfqq_rr(struct cfq_data *cfqd, struct cfq_queue *cfqq) BUG_ON(!cfqd->busy_queues); cfqd->busy_queues--; - if (cfq_class_rt(cfqq)) - cfqd->busy_rt_queues--; } /* @@ -760,9 +841,9 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq) { struct cfq_data *cfqd = q->elevator->elevator_data; - cfqd->rq_in_driver++; + cfqd->rq_in_driver[rq_is_sync(rq)]++; cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "activate rq, drv=%d", - cfqd->rq_in_driver); + rq_in_driver(cfqd)); cfqd->last_position = blk_rq_pos(rq) + blk_rq_sectors(rq); } @@ -770,11 +851,12 @@ static void cfq_activate_request(struct request_queue *q, struct request *rq) static void cfq_deactivate_request(struct request_queue *q, struct request *rq) { struct cfq_data *cfqd = q->elevator->elevator_data; + const int sync = rq_is_sync(rq); - WARN_ON(!cfqd->rq_in_driver); - cfqd->rq_in_driver--; + WARN_ON(!cfqd->rq_in_driver[sync]); + cfqd->rq_in_driver[sync]--; cfq_log_cfqq(cfqd, RQ_CFQQ(rq), "deactivate rq, drv=%d", - cfqd->rq_in_driver); + rq_in_driver(cfqd)); } static void cfq_remove_request(struct request *rq) @@ -928,10 +1010,11 @@ static inline void cfq_slice_expired(struct cfq_data *cfqd, int timed_out) */ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) { - if (RB_EMPTY_ROOT(&cfqd->service_tree.rb)) - return NULL; + struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd); - return cfq_rb_first(&cfqd->service_tree); + if (RB_EMPTY_ROOT(&service_tree->rb)) + return NULL; + return cfq_rb_first(service_tree); } /* @@ -959,9 +1042,6 @@ static inline sector_t cfq_dist_from_last(struct cfq_data *cfqd, return cfqd->last_position - blk_rq_pos(rq); } -#define CIC_SEEK_THR 8 * 1024 -#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR) - static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq) { struct cfq_io_context *cic = cfqd->active_cic; @@ -1049,6 +1129,10 @@ static struct cfq_queue *cfq_close_cooperator(struct cfq_data *cfqd, if (cfq_cfqq_coop(cfqq)) return NULL; + /* we don't want to mix processes with different characteristics */ + if (cfqq->service_tree != cur_cfqq->service_tree) + return NULL; + if (!probe) cfq_mark_cfqq_coop(cfqq); return cfqq; @@ -1080,7 +1164,7 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) /* * still requests with the driver, don't idle */ - if (cfqd->rq_in_driver) + if (rq_in_driver(cfqd)) return; /* @@ -1092,14 +1176,15 @@ static void cfq_arm_slice_timer(struct cfq_data *cfqd) cfq_mark_cfqq_wait_request(cfqq); - /* - * we don't want to idle for seeks, but we do want to allow - * fair distribution of slice time for a process doing back-to-back - * seeks. so allow a little bit of time for him to submit a new rq - */ - sl = cfqd->cfq_slice_idle; - if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic)) + sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies); + + /* very small idle if we are serving noidle trees, and there are more trees */ + if (cfqd->serving_type == SYNC_NOIDLE_WL && + service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) { + if (blk_queue_nonrot(cfqd->queue)) + return; sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT)); + } mod_timer(&cfqd->idle_slice_timer, jiffies + sl); cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl); @@ -1115,6 +1200,12 @@ static void cfq_dispatch_insert(struct request_queue *q, struct request *rq) cfq_log_cfqq(cfqd, cfqq, "dispatch_insert"); + if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) { + cfqd->reads_delayed = max_t(int, cfqd->reads_delayed, + (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2)); + } + + cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq); cfq_remove_request(rq); cfqq->dispatched++; elv_dispatch_sort(q, rq); @@ -1160,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, struct cfq_queue *cfqq) return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio)); } +enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) { + struct cfq_queue *id, *ni; + ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd)); + id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd)); + if (id && ni && id->rb_key < ni->rb_key) + return SYNC_WL; + if (!ni) return SYNC_WL; + return SYNC_NOIDLE_WL; +} + /* * Select a queue for service. If we have a current active queue, * check whether to continue servicing it, or retrieve and set a new one. @@ -1179,20 +1280,6 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) goto expire; /* - * If we have a RT cfqq waiting, then we pre-empt the current non-rt - * cfqq. - */ - if (!cfq_class_rt(cfqq) && cfqd->busy_rt_queues) { - /* - * We simulate this as cfqq timed out so that it gets to bank - * the remaining of its time slice. - */ - cfq_log_cfqq(cfqd, cfqq, "preempt"); - cfq_slice_expired(cfqd, 1); - goto new_queue; - } - - /* * The active queue has requests and isn't expired, allow it to * dispatch. */ @@ -1214,15 +1301,68 @@ static struct cfq_queue *cfq_select_queue(struct cfq_data *cfqd) * flight or is idling for a new request, allow either of these * conditions to happen (or time out) before selecting a new queue. */ - if (timer_pending(&cfqd->idle_slice_timer) || + if (timer_pending(&cfqd->idle_slice_timer) || (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) { cfqq = NULL; goto keep_queue; } - expire: cfq_slice_expired(cfqd, 0); new_queue: + if (!new_cfqq) { + enum wl_prio_t previous_prio = cfqd->serving_prio; + + if (cfq_busy_queues_wl(RT_WL, cfqd)) + cfqd->serving_prio = RT_WL; + else if (cfq_busy_queues_wl(BE_WL, cfqd)) + cfqd->serving_prio = BE_WL; + else { + cfqd->serving_prio = IDLE_WL; + cfqd->workload_expires = jiffies + 1; + cfqd->reads_delayed = 0; + } + + if (cfqd->serving_prio != IDLE_WL) { + int counts[]={ + service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count, + service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count, + service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count + }; + int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2]; + + if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) { + cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL; + cfqd->async_starved = 0; + cfqd->reads_delayed = 0; + } else { + if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) { + if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] && + cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed)) + cfqd->serving_type = ASYNC_WL; + else + cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio); + } else + goto same_wl; + } + + { + unsigned slice = cfqd->cfq_target_latency; + slice = slice * counts[cfqd->serving_type] / + max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio], + counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]); + + if (cfqd->serving_type == ASYNC_WL) + slice = max(1U, (slice / (1 + cfqd->reads_delayed)) + * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]); + else + slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle)); + + cfqd->workload_expires = jiffies + slice; + cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL); + } + } + } + same_wl: cfqq = cfq_set_active_queue(cfqd, new_cfqq); keep_queue: return cfqq; @@ -1249,8 +1389,13 @@ static int cfq_forced_dispatch(struct cfq_data *cfqd) { struct cfq_queue *cfqq; int dispatched = 0; + int i,j; + for (i = 0; i < 2; ++i) + for (j = 0; j < 3; ++j) + while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL) + dispatched += __cfq_forced_dispatch_cfqq(cfqq); - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL) + while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL) dispatched += __cfq_forced_dispatch_cfqq(cfqq); cfq_slice_expired(cfqd, 0); @@ -1312,6 +1457,12 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) return 0; /* + * Drain async requests before we start sync IO + */ + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + return 0; + + /* * If this is an async queue and we have sync IO in flight, let it wait */ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) @@ -1362,7 +1513,7 @@ static int cfq_dispatch_requests(struct request_queue *q, int force) cfq_slice_expired(cfqd, 0); } - cfq_log(cfqd, "dispatched a request"); + cfq_log_cfqq(cfqd, cfqq, "dispatched a request"); return 1; } @@ -2004,18 +2155,8 @@ cfq_should_preempt(struct cfq_data *cfqd, struct cfq_queue *new_cfqq, if (cfq_class_idle(cfqq)) return 1; - /* - * if the new request is sync, but the currently running queue is - * not, let the sync request have priority. - */ - if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) - return 1; - - /* - * So both queues are sync. Let the new request get disk time if - * it's a metadata request and the current queue is doing regular IO. - */ - if (rq_is_meta(rq) && !cfqq->meta_pending) + if (cfqd->serving_type == SYNC_NOIDLE_WL + && new_cfqq->service_tree == cfqq->service_tree) return 1; /* @@ -2046,13 +2187,9 @@ static void cfq_preempt_queue(struct cfq_data *cfqd, struct cfq_queue *cfqq) cfq_log_cfqq(cfqd, cfqq, "preempt"); cfq_slice_expired(cfqd, 1); - /* - * Put the new queue at the front of the of the current list, - * so we know that it will be selected next. - */ BUG_ON(!cfq_cfqq_on_rr(cfqq)); - cfq_service_tree_add(cfqd, cfqq, 1); + cfq_service_tree_add(cfqd, cfqq); cfqq->slice_end = 0; cfq_mark_cfqq_slice_new(cfqq); @@ -2130,11 +2267,11 @@ static void cfq_insert_request(struct request_queue *q, struct request *rq) */ static void cfq_update_hw_tag(struct cfq_data *cfqd) { - if (cfqd->rq_in_driver > cfqd->rq_in_driver_peak) - cfqd->rq_in_driver_peak = cfqd->rq_in_driver; + if (rq_in_driver(cfqd) > cfqd->rq_in_driver_peak) + cfqd->rq_in_driver_peak = rq_in_driver(cfqd); if (cfqd->rq_queued <= CFQ_HW_QUEUE_MIN && - cfqd->rq_in_driver <= CFQ_HW_QUEUE_MIN) + rq_in_driver(cfqd) <= CFQ_HW_QUEUE_MIN) return; if (cfqd->hw_tag_samples++ < 50) @@ -2161,9 +2298,9 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfq_update_hw_tag(cfqd); - WARN_ON(!cfqd->rq_in_driver); + WARN_ON(!cfqd->rq_in_driver[sync]); WARN_ON(!cfqq->dispatched); - cfqd->rq_in_driver--; + cfqd->rq_in_driver[sync]--; cfqq->dispatched--; if (cfq_cfqq_sync(cfqq)) @@ -2197,7 +2334,7 @@ static void cfq_completed_request(struct request_queue *q, struct request *rq) cfq_arm_slice_timer(cfqd); } - if (!cfqd->rq_in_driver) + if (!rq_in_driver(cfqd)) cfq_schedule_dispatch(cfqd); } @@ -2229,8 +2366,7 @@ static void cfq_prio_boost(struct cfq_queue *cfqq) static inline int __cfq_may_queue(struct cfq_queue *cfqq) { - if ((cfq_cfqq_wait_request(cfqq) || cfq_cfqq_must_alloc(cfqq)) && - !cfq_cfqq_must_alloc_slice(cfqq)) { + if (cfq_cfqq_wait_request(cfqq) && !cfq_cfqq_must_alloc_slice(cfqq)) { cfq_mark_cfqq_must_alloc_slice(cfqq); return ELV_MQUEUE_MUST; } @@ -2317,7 +2453,6 @@ cfq_set_request(struct request_queue *q, struct request *rq, gfp_t gfp_mask) } cfqq->allocated[rw]++; - cfq_clear_cfqq_must_alloc(cfqq); atomic_inc(&cfqq->ref); spin_unlock_irqrestore(q->queue_lock, flags); @@ -2451,13 +2586,16 @@ static void cfq_exit_queue(struct elevator_queue *e) static void *cfq_init_queue(struct request_queue *q) { struct cfq_data *cfqd; - int i; + int i,j; cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node); if (!cfqd) return NULL; - cfqd->service_tree = CFQ_RB_ROOT; + for (i = 0; i < 2; ++i) + for (j = 0; j < 3; ++j) + cfqd->service_trees[i][j] = CFQ_RB_ROOT; + cfqd->service_tree_idle = CFQ_RB_ROOT; /* * Not strictly needed (since RB_ROOT just clears the node and we @@ -2494,6 +2632,9 @@ static void *cfq_init_queue(struct request_queue *q) cfqd->cfq_slice[1] = cfq_slice_sync; cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; + cfqd->cfq_target_latency = cfq_target_latency; + cfqd->cfq_hist_divisor = cfq_hist_divisor; + cfqd->cfq_async_penalty = cfq_async_penalty; cfqd->hw_tag = 1; return cfqd; @@ -2530,6 +2671,7 @@ fail: /* * sysfs parts below --> */ + static ssize_t cfq_var_show(unsigned int var, char *page) { @@ -2563,6 +2705,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd->cfq_slice_idle, 1); SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1); SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); +SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1); +SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0); +SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -2594,6 +2739,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cfqd->cfq_slice[1], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); + +STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1); +STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0); +STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0); + #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -2609,6 +2759,9 @@ static struct elv_fs_entry cfq_attrs[] = { CFQ_ATTR(slice_async), CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), + CFQ_ATTR(target_latency), + CFQ_ATTR(hist_divisor), + CFQ_ATTR(async_penalty), __ATTR_NULL }; ^ permalink raw reply related [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 15:35 ` Corrado Zoccolo @ 2009-09-28 17:14 ` Vivek Goyal 2009-09-28 17:51 ` Mike Galbraith [not found] ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 17:14 UTC (permalink / raw) To: Corrado Zoccolo Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote: > On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: > >> Hi Vivek, > >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > >> >> Vivek Goyal wrote: > >> >> > Notes: > >> >> > - With vanilla CFQ, random writers can overwhelm a random reader. > >> >> > Bring down its throughput and bump up latencies significantly. > >> >> > >> >> > >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > >> >> too. > >> >> > >> >> I'm basing this assumption on the observations I made on both OpenSuse > >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > >> >> titled: "Poor desktop responsiveness with background I/O-operations" of > >> >> 2009-09-20. > >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) > >> >> > >> >> > >> >> Thus, I'm posting this to show that your work is greatly appreciated, > >> >> given the rather disappointig status quo of Linux's fairness when it > >> >> comes to disk IO time. > >> >> > >> >> I hope that your efforts lead to a change in performance of current > >> >> userland applications, the sooner, the better. > >> >> > >> > [Please don't remove people from original CC list. I am putting them back.] > >> > > >> > Hi Ulrich, > >> > > >> > I quicky went through that mail thread and I tried following on my > >> > desktop. > >> > > >> > ########################################## > >> > dd if=/home/vgoyal/4G-file of=/dev/null & > >> > sleep 5 > >> > time firefox > >> > # close firefox once gui pops up. > >> > ########################################## > >> > > >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got > >> > following. > >> > > >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > >> > > >> > (Results do vary across runs, especially if system is booted fresh. Don't > >> > know why...). > >> > > >> > > >> > Then I tried putting both the applications in separate groups and assign > >> > them weights 200 each. > >> > > >> > ########################################## > >> > dd if=/home/vgoyal/4G-file of=/dev/null & > >> > echo $! > /cgroup/io/test1/tasks > >> > sleep 5 > >> > echo $$ > /cgroup/io/test2/tasks > >> > time firefox > >> > # close firefox once gui pops up. > >> > ########################################## > >> > > >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > >> > > >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > >> > > >> > Notice that throughput of dd also improved. > >> > > >> > I ran the block trace and noticed in many a cases firefox threads > >> > immediately preempted the "dd". Probably because it was a file system > >> > request. So in this case latency will arise from seek time. > >> > > >> > In some other cases, threads had to wait for up to 100ms because dd was > >> > not preempted. In this case latency will arise both from waiting on queue > >> > as well as seek time. > >> > >> I think cfq should already be doing something similar, i.e. giving > >> 100ms slices to firefox, that alternate with dd, unless: > >> * firefox is too seeky (in this case, the idle window will be too small) > >> * firefox has too much think time. > >> > > > Hi Vivek, > > Hi Corrado, > > > > "firefox" is the shell script to setup the environment and launch the > > broser. It seems to be a group of threads. Some of them run in parallel > > and some of these seems to be running one after the other (once previous > > process or threads finished). > > Ok. > > > > >> To rule out the first case, what happens if you run the test with your > >> "fairness for seeky processes" patch? > > > > I applied that patch and it helps a lot. > > > > http://lwn.net/Articles/341032/ > > > > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. > > Great. > Can you try the attached patch (on top of 2.6.31)? > It implements the alternative approach we discussed privately in july, > and it addresses the possible latency increase that could happen with > your patch. > > To summarize for everyone, we separate sync sequential queues, sync > seeky queues and async queues in three separate RR strucutres, and > alternate servicing requests between them. > > When servicing seeky queues (the ones that are usually penalized by > cfq, for which no fairness is usually provided), we do not idle > between them, but we do idle for the last queue (the idle can be > exited when any seeky queue has requests). This allows us to allocate > disk time globally for all seeky processes, and to reduce seeky > processes latencies. > Ok, I seem to be doing same thing at group level (In group scheduling patches). I do not idle on individual sync seeky queues but if this is last queue in the group, then I do idle to make sure group does not loose its fair share and exit from idle the moment there is any busy queue in the group. So you seem to be grouping all the sync seeky queues system wide in a single group. So all the sync seeky queues collectively get 100ms in a single round of dispatch? I am wondering what happens if there are lot of such sync seeky queues this 100ms time slice is consumed before all the sync seeky queues got a chance to dispatch. Does that mean that some of the queues can completely skip the one dispatch round? Thanks Vivek > I tested with 'konsole -e exit', while doing a sequential write with > dd, and the start up time reduced from 37s to 7s, on an old laptop > disk. > > Thanks, > Corrado > > > > >> To rule out the first case, what happens if you run the test with your > >> "fairness for seeky processes" patch? > > > > I applied that patch and it helps a lot. > > > > http://lwn.net/Articles/341032/ > > > > With above patchset applied, and fairness=1, firefox pops up in 27-28 > > seconds. > > > > So it looks like if we don't disable idle window for seeky processes on > > hardware supporting command queuing, it helps in this particular case. > > > > Thanks > > Vivek > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-28 17:14 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 17:14 UTC (permalink / raw) To: Corrado Zoccolo Cc: Tobias Oetiker, dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote: > On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: > >> Hi Vivek, > >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > >> >> Vivek Goyal wrote: > >> >> > Notes: > >> >> > - With vanilla CFQ, random writers can overwhelm a random reader. > >> >> > Bring down its throughput and bump up latencies significantly. > >> >> > >> >> > >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > >> >> too. > >> >> > >> >> I'm basing this assumption on the observations I made on both OpenSuse > >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > >> >> titled: "Poor desktop responsiveness with background I/O-operations" of > >> >> 2009-09-20. > >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) > >> >> > >> >> > >> >> Thus, I'm posting this to show that your work is greatly appreciated, > >> >> given the rather disappointig status quo of Linux's fairness when it > >> >> comes to disk IO time. > >> >> > >> >> I hope that your efforts lead to a change in performance of current > >> >> userland applications, the sooner, the better. > >> >> > >> > [Please don't remove people from original CC list. I am putting them back.] > >> > > >> > Hi Ulrich, > >> > > >> > I quicky went through that mail thread and I tried following on my > >> > desktop. > >> > > >> > ########################################## > >> > dd if=/home/vgoyal/4G-file of=/dev/null & > >> > sleep 5 > >> > time firefox > >> > # close firefox once gui pops up. > >> > ########################################## > >> > > >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got > >> > following. > >> > > >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > >> > > >> > (Results do vary across runs, especially if system is booted fresh. Don't > >> > know why...). > >> > > >> > > >> > Then I tried putting both the applications in separate groups and assign > >> > them weights 200 each. > >> > > >> > ########################################## > >> > dd if=/home/vgoyal/4G-file of=/dev/null & > >> > echo $! > /cgroup/io/test1/tasks > >> > sleep 5 > >> > echo $$ > /cgroup/io/test2/tasks > >> > time firefox > >> > # close firefox once gui pops up. > >> > ########################################## > >> > > >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > >> > > >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > >> > > >> > Notice that throughput of dd also improved. > >> > > >> > I ran the block trace and noticed in many a cases firefox threads > >> > immediately preempted the "dd". Probably because it was a file system > >> > request. So in this case latency will arise from seek time. > >> > > >> > In some other cases, threads had to wait for up to 100ms because dd was > >> > not preempted. In this case latency will arise both from waiting on queue > >> > as well as seek time. > >> > >> I think cfq should already be doing something similar, i.e. giving > >> 100ms slices to firefox, that alternate with dd, unless: > >> * firefox is too seeky (in this case, the idle window will be too small) > >> * firefox has too much think time. > >> > > > Hi Vivek, > > Hi Corrado, > > > > "firefox" is the shell script to setup the environment and launch the > > broser. It seems to be a group of threads. Some of them run in parallel > > and some of these seems to be running one after the other (once previous > > process or threads finished). > > Ok. > > > > >> To rule out the first case, what happens if you run the test with your > >> "fairness for seeky processes" patch? > > > > I applied that patch and it helps a lot. > > > > http://lwn.net/Articles/341032/ > > > > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. > > Great. > Can you try the attached patch (on top of 2.6.31)? > It implements the alternative approach we discussed privately in july, > and it addresses the possible latency increase that could happen with > your patch. > > To summarize for everyone, we separate sync sequential queues, sync > seeky queues and async queues in three separate RR strucutres, and > alternate servicing requests between them. > > When servicing seeky queues (the ones that are usually penalized by > cfq, for which no fairness is usually provided), we do not idle > between them, but we do idle for the last queue (the idle can be > exited when any seeky queue has requests). This allows us to allocate > disk time globally for all seeky processes, and to reduce seeky > processes latencies. > Ok, I seem to be doing same thing at group level (In group scheduling patches). I do not idle on individual sync seeky queues but if this is last queue in the group, then I do idle to make sure group does not loose its fair share and exit from idle the moment there is any busy queue in the group. So you seem to be grouping all the sync seeky queues system wide in a single group. So all the sync seeky queues collectively get 100ms in a single round of dispatch? I am wondering what happens if there are lot of such sync seeky queues this 100ms time slice is consumed before all the sync seeky queues got a chance to dispatch. Does that mean that some of the queues can completely skip the one dispatch round? Thanks Vivek > I tested with 'konsole -e exit', while doing a sequential write with > dd, and the start up time reduced from 37s to 7s, on an old laptop > disk. > > Thanks, > Corrado > > > > >> To rule out the first case, what happens if you run the test with your > >> "fairness for seeky processes" patch? > > > > I applied that patch and it helps a lot. > > > > http://lwn.net/Articles/341032/ > > > > With above patchset applied, and fairness=1, firefox pops up in 27-28 > > seconds. > > > > So it looks like if we don't disable idle window for seeky processes on > > hardware supporting command queuing, it helps in this particular case. > > > > Thanks > > Vivek > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 17:14 ` Vivek Goyal (?) @ 2009-09-29 7:10 ` Corrado Zoccolo -1 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-09-29 7:10 UTC (permalink / raw) To: Vivek Goyal Cc: Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker Hi Vivek, On Mon, Sep 28, 2009 at 7:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote: >> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: >> >> Hi Vivek, >> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: >> >> >> Vivek Goyal wrote: >> >> >> > Notes: >> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader. >> >> >> > Bring down its throughput and bump up latencies significantly. >> >> >> >> >> >> >> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, >> >> >> too. >> >> >> >> >> >> I'm basing this assumption on the observations I made on both OpenSuse >> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML >> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of >> >> >> 2009-09-20. >> >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) >> >> >> >> >> >> >> >> >> Thus, I'm posting this to show that your work is greatly appreciated, >> >> >> given the rather disappointig status quo of Linux's fairness when it >> >> >> comes to disk IO time. >> >> >> >> >> >> I hope that your efforts lead to a change in performance of current >> >> >> userland applications, the sooner, the better. >> >> >> >> >> > [Please don't remove people from original CC list. I am putting them back.] >> >> > >> >> > Hi Ulrich, >> >> > >> >> > I quicky went through that mail thread and I tried following on my >> >> > desktop. >> >> > >> >> > ########################################## >> >> > dd if=/home/vgoyal/4G-file of=/dev/null & >> >> > sleep 5 >> >> > time firefox >> >> > # close firefox once gui pops up. >> >> > ########################################## >> >> > >> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got >> >> > following. >> >> > >> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s >> >> > >> >> > (Results do vary across runs, especially if system is booted fresh. Don't >> >> > know why...). >> >> > >> >> > >> >> > Then I tried putting both the applications in separate groups and assign >> >> > them weights 200 each. >> >> > >> >> > ########################################## >> >> > dd if=/home/vgoyal/4G-file of=/dev/null & >> >> > echo $! > /cgroup/io/test1/tasks >> >> > sleep 5 >> >> > echo $$ > /cgroup/io/test2/tasks >> >> > time firefox >> >> > # close firefox once gui pops up. >> >> > ########################################## >> >> > >> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. >> >> > >> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s >> >> > >> >> > Notice that throughput of dd also improved. >> >> > >> >> > I ran the block trace and noticed in many a cases firefox threads >> >> > immediately preempted the "dd". Probably because it was a file system >> >> > request. So in this case latency will arise from seek time. >> >> > >> >> > In some other cases, threads had to wait for up to 100ms because dd was >> >> > not preempted. In this case latency will arise both from waiting on queue >> >> > as well as seek time. >> >> >> >> I think cfq should already be doing something similar, i.e. giving >> >> 100ms slices to firefox, that alternate with dd, unless: >> >> * firefox is too seeky (in this case, the idle window will be too small) >> >> * firefox has too much think time. >> >> >> > >> Hi Vivek, >> > Hi Corrado, >> > >> > "firefox" is the shell script to setup the environment and launch the >> > broser. It seems to be a group of threads. Some of them run in parallel >> > and some of these seems to be running one after the other (once previous >> > process or threads finished). >> >> Ok. >> >> > >> >> To rule out the first case, what happens if you run the test with your >> >> "fairness for seeky processes" patch? >> > >> > I applied that patch and it helps a lot. >> > >> > http://lwn.net/Articles/341032/ >> > >> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. >> >> Great. >> Can you try the attached patch (on top of 2.6.31)? >> It implements the alternative approach we discussed privately in july, >> and it addresses the possible latency increase that could happen with >> your patch. >> >> To summarize for everyone, we separate sync sequential queues, sync >> seeky queues and async queues in three separate RR strucutres, and >> alternate servicing requests between them. >> >> When servicing seeky queues (the ones that are usually penalized by >> cfq, for which no fairness is usually provided), we do not idle >> between them, but we do idle for the last queue (the idle can be >> exited when any seeky queue has requests). This allows us to allocate >> disk time globally for all seeky processes, and to reduce seeky >> processes latencies. >> > > Ok, I seem to be doing same thing at group level (In group scheduling > patches). I do not idle on individual sync seeky queues but if this is > last queue in the group, then I do idle to make sure group does not loose > its fair share and exit from idle the moment there is any busy queue in > the group. > > So you seem to be grouping all the sync seeky queues system wide in a > single group. So all the sync seeky queues collectively get 100ms in a > single round of dispatch? A round of dispatch (defined by tunable target_latency, default 300ms) is subdivided between the three groups, proportionally to how many queues are waiting in each, so if we have 1 sequential and 2 seeky (and 0 async), we get 100ms for seq and 200ms for seeky. > I am wondering what happens if there are lot > of such sync seeky queues this 100ms time slice is consumed before all the > sync seeky queues got a chance to dispatch. Does that mean that some of > the queues can completely skip the one dispatch round? It can happen: if each seek costs 10ms, and you have more than 30 seeky processes, then you are guaranteed that they cannot issue all in the same round. When this happens, the ones that did not issue before, will be the first ones to be issued in the next round. Thanks, Corrado > > Thanks > Vivek > >> I tested with 'konsole -e exit', while doing a sequential write with >> dd, and the start up time reduced from 37s to 7s, on an old laptop >> disk. >> >> Thanks, >> Corrado >> >> > >> >> To rule out the first case, what happens if you run the test with your >> >> "fairness for seeky processes" patch? >> > >> > I applied that patch and it helps a lot. >> > >> > http://lwn.net/Articles/341032/ >> > >> > With above patchset applied, and fairness=1, firefox pops up in 27-28 >> > seconds. >> > >> > So it looks like if we don't disable idle window for seeky processes on >> > hardware supporting command queuing, it helps in this particular case. >> > >> > Thanks >> > Vivek >> > > > > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090928171420.GA3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090928171420.GA3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-29 7:10 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-09-29 7:10 UTC (permalink / raw) To: Vivek Goyal Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, On Mon, Sep 28, 2009 at 7:14 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote: >> On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: >> >> Hi Vivek, >> >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: >> >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: >> >> >> Vivek Goyal wrote: >> >> >> > Notes: >> >> >> > - With vanilla CFQ, random writers can overwhelm a random reader. >> >> >> > Bring down its throughput and bump up latencies significantly. >> >> >> >> >> >> >> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, >> >> >> too. >> >> >> >> >> >> I'm basing this assumption on the observations I made on both OpenSuse >> >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML >> >> >> titled: "Poor desktop responsiveness with background I/O-operations" of >> >> >> 2009-09-20. >> >> >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) >> >> >> >> >> >> >> >> >> Thus, I'm posting this to show that your work is greatly appreciated, >> >> >> given the rather disappointig status quo of Linux's fairness when it >> >> >> comes to disk IO time. >> >> >> >> >> >> I hope that your efforts lead to a change in performance of current >> >> >> userland applications, the sooner, the better. >> >> >> >> >> > [Please don't remove people from original CC list. I am putting them back.] >> >> > >> >> > Hi Ulrich, >> >> > >> >> > I quicky went through that mail thread and I tried following on my >> >> > desktop. >> >> > >> >> > ########################################## >> >> > dd if=/home/vgoyal/4G-file of=/dev/null & >> >> > sleep 5 >> >> > time firefox >> >> > # close firefox once gui pops up. >> >> > ########################################## >> >> > >> >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got >> >> > following. >> >> > >> >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s >> >> > >> >> > (Results do vary across runs, especially if system is booted fresh. Don't >> >> > know why...). >> >> > >> >> > >> >> > Then I tried putting both the applications in separate groups and assign >> >> > them weights 200 each. >> >> > >> >> > ########################################## >> >> > dd if=/home/vgoyal/4G-file of=/dev/null & >> >> > echo $! > /cgroup/io/test1/tasks >> >> > sleep 5 >> >> > echo $$ > /cgroup/io/test2/tasks >> >> > time firefox >> >> > # close firefox once gui pops up. >> >> > ########################################## >> >> > >> >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. >> >> > >> >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s >> >> > >> >> > Notice that throughput of dd also improved. >> >> > >> >> > I ran the block trace and noticed in many a cases firefox threads >> >> > immediately preempted the "dd". Probably because it was a file system >> >> > request. So in this case latency will arise from seek time. >> >> > >> >> > In some other cases, threads had to wait for up to 100ms because dd was >> >> > not preempted. In this case latency will arise both from waiting on queue >> >> > as well as seek time. >> >> >> >> I think cfq should already be doing something similar, i.e. giving >> >> 100ms slices to firefox, that alternate with dd, unless: >> >> * firefox is too seeky (in this case, the idle window will be too small) >> >> * firefox has too much think time. >> >> >> > >> Hi Vivek, >> > Hi Corrado, >> > >> > "firefox" is the shell script to setup the environment and launch the >> > broser. It seems to be a group of threads. Some of them run in parallel >> > and some of these seems to be running one after the other (once previous >> > process or threads finished). >> >> Ok. >> >> > >> >> To rule out the first case, what happens if you run the test with your >> >> "fairness for seeky processes" patch? >> > >> > I applied that patch and it helps a lot. >> > >> > http://lwn.net/Articles/341032/ >> > >> > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. >> >> Great. >> Can you try the attached patch (on top of 2.6.31)? >> It implements the alternative approach we discussed privately in july, >> and it addresses the possible latency increase that could happen with >> your patch. >> >> To summarize for everyone, we separate sync sequential queues, sync >> seeky queues and async queues in three separate RR strucutres, and >> alternate servicing requests between them. >> >> When servicing seeky queues (the ones that are usually penalized by >> cfq, for which no fairness is usually provided), we do not idle >> between them, but we do idle for the last queue (the idle can be >> exited when any seeky queue has requests). This allows us to allocate >> disk time globally for all seeky processes, and to reduce seeky >> processes latencies. >> > > Ok, I seem to be doing same thing at group level (In group scheduling > patches). I do not idle on individual sync seeky queues but if this is > last queue in the group, then I do idle to make sure group does not loose > its fair share and exit from idle the moment there is any busy queue in > the group. > > So you seem to be grouping all the sync seeky queues system wide in a > single group. So all the sync seeky queues collectively get 100ms in a > single round of dispatch? A round of dispatch (defined by tunable target_latency, default 300ms) is subdivided between the three groups, proportionally to how many queues are waiting in each, so if we have 1 sequential and 2 seeky (and 0 async), we get 100ms for seq and 200ms for seeky. > I am wondering what happens if there are lot > of such sync seeky queues this 100ms time slice is consumed before all the > sync seeky queues got a chance to dispatch. Does that mean that some of > the queues can completely skip the one dispatch round? It can happen: if each seek costs 10ms, and you have more than 30 seeky processes, then you are guaranteed that they cannot issue all in the same round. When this happens, the ones that did not issue before, will be the first ones to be issued in the next round. Thanks, Corrado > > Thanks > Vivek > >> I tested with 'konsole -e exit', while doing a sequential write with >> dd, and the start up time reduced from 37s to 7s, on an old laptop >> disk. >> >> Thanks, >> Corrado >> >> > >> >> To rule out the first case, what happens if you run the test with your >> >> "fairness for seeky processes" patch? >> > >> > I applied that patch and it helps a lot. >> > >> > http://lwn.net/Articles/341032/ >> > >> > With above patchset applied, and fairness=1, firefox pops up in 27-28 >> > seconds. >> > >> > So it looks like if we don't disable idle window for seeky processes on >> > hardware supporting command queuing, it helps in this particular case. >> > >> > Thanks >> > Vivek >> > > > > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 15:35 ` Corrado Zoccolo 2009-09-28 17:14 ` Vivek Goyal @ 2009-09-28 17:51 ` Mike Galbraith 2009-09-28 18:18 ` Vivek Goyal ` (2 more replies) [not found] ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2 siblings, 3 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 17:51 UTC (permalink / raw) To: Corrado Zoccolo Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote: > Great. > Can you try the attached patch (on top of 2.6.31)? > It implements the alternative approach we discussed privately in july, > and it addresses the possible latency increase that could happen with > your patch. > > To summarize for everyone, we separate sync sequential queues, sync > seeky queues and async queues in three separate RR strucutres, and > alternate servicing requests between them. > > When servicing seeky queues (the ones that are usually penalized by > cfq, for which no fairness is usually provided), we do not idle > between them, but we do idle for the last queue (the idle can be > exited when any seeky queue has requests). This allows us to allocate > disk time globally for all seeky processes, and to reduce seeky > processes latencies. > > I tested with 'konsole -e exit', while doing a sequential write with > dd, and the start up time reduced from 37s to 7s, on an old laptop > disk. I was fiddling around trying to get IDLE class to behave at least, and getting a bit frustrated. Class/priority didn't seem to make much if any difference for konsole -e exit timings, and now I know why. I saw the reference to Vivek's patch, and gave it a shot. Makes a large difference. Avg perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory 16.24 175.82 154.38 228.97 147.16 144.5 noop 43.23 57.39 96.13 148.25 180.09 105.0 deadline 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE I'll give your patch a spin as well. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 17:51 ` Mike Galbraith @ 2009-09-28 18:18 ` Vivek Goyal 2009-09-29 5:55 ` Mike Galbraith [not found] ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 18:18 UTC (permalink / raw) To: Mike Galbraith Cc: Corrado Zoccolo, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote: > On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote: > > > Great. > > Can you try the attached patch (on top of 2.6.31)? > > It implements the alternative approach we discussed privately in july, > > and it addresses the possible latency increase that could happen with > > your patch. > > > > To summarize for everyone, we separate sync sequential queues, sync > > seeky queues and async queues in three separate RR strucutres, and > > alternate servicing requests between them. > > > > When servicing seeky queues (the ones that are usually penalized by > > cfq, for which no fairness is usually provided), we do not idle > > between them, but we do idle for the last queue (the idle can be > > exited when any seeky queue has requests). This allows us to allocate > > disk time globally for all seeky processes, and to reduce seeky > > processes latencies. > > > > I tested with 'konsole -e exit', while doing a sequential write with > > dd, and the start up time reduced from 37s to 7s, on an old laptop > > disk. > > I was fiddling around trying to get IDLE class to behave at least, and > getting a bit frustrated. Class/priority didn't seem to make much if > any difference for konsole -e exit timings, and now I know why. You seem to be testing kconsole timings against a writer. In case of a writer prio will not make much of a difference as prio only adjusts length of slice given to process and writers rarely get to use their slice length. Reader immediately preemtps it... I guess changing class to IDLE should have helped a bit as now this is equivalent to setting the quantum to 1 and after dispatching one request to disk, CFQ will always expire the writer once. So it might happen that by the the reader preempted writer, we have less number of requests in disk and lesser latency for this reader. > I saw > the reference to Vivek's patch, and gave it a shot. Makes a large > difference. > Avg > perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory > 16.24 175.82 154.38 228.97 147.16 144.5 noop > 43.23 57.39 96.13 148.25 180.09 105.0 deadline > 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 > 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 > 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE > 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 > 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 > 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE > Hmm.., looks like average latency went down only in case of fairness=1 and not in case of fairness=0. (Looking at previous mail, average vanilla cfq latencies were around 12 seconds). Are you running all this in root group or have you put writers and readers into separate cgroups? If everything is running in root group, then I am curious why latency went down in case of fairness=1. The only thing fairness=1 parameter does is that it lets complete all the requests from previous queue before start dispatching from next queue. On top of this is valid only if no preemption took place. In your test case, konsole should preempt the writer so practically fairness=1 might not make much difference. In fact now Jens has committed a patch which achieves the similar effect as fairness=1 for async queues. commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9 Author: Jens Axboe <jens.axboe@oracle.com> Date: Fri Jul 3 12:57:48 2009 +0200 cfq-iosched: drain device queue before switching to a sync queue To lessen the impact of async IO on sync IO, let the device drain of any async IO in progress when switching to a sync cfqq that has idling enabled. If everything is in separate cgroups, then we should have seen latency improvements in case of fairness=0 case also. I am little perplexed here.. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-28 18:18 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 18:18 UTC (permalink / raw) To: Mike Galbraith Cc: Tobias Oetiker, dhaval, peterz, Corrado Zoccolo, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, fernando, Ulrich Lukas, mikew, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, containers, linux-kernel, akpm, righi.andrea, torvalds On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote: > On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote: > > > Great. > > Can you try the attached patch (on top of 2.6.31)? > > It implements the alternative approach we discussed privately in july, > > and it addresses the possible latency increase that could happen with > > your patch. > > > > To summarize for everyone, we separate sync sequential queues, sync > > seeky queues and async queues in three separate RR strucutres, and > > alternate servicing requests between them. > > > > When servicing seeky queues (the ones that are usually penalized by > > cfq, for which no fairness is usually provided), we do not idle > > between them, but we do idle for the last queue (the idle can be > > exited when any seeky queue has requests). This allows us to allocate > > disk time globally for all seeky processes, and to reduce seeky > > processes latencies. > > > > I tested with 'konsole -e exit', while doing a sequential write with > > dd, and the start up time reduced from 37s to 7s, on an old laptop > > disk. > > I was fiddling around trying to get IDLE class to behave at least, and > getting a bit frustrated. Class/priority didn't seem to make much if > any difference for konsole -e exit timings, and now I know why. You seem to be testing kconsole timings against a writer. In case of a writer prio will not make much of a difference as prio only adjusts length of slice given to process and writers rarely get to use their slice length. Reader immediately preemtps it... I guess changing class to IDLE should have helped a bit as now this is equivalent to setting the quantum to 1 and after dispatching one request to disk, CFQ will always expire the writer once. So it might happen that by the the reader preempted writer, we have less number of requests in disk and lesser latency for this reader. > I saw > the reference to Vivek's patch, and gave it a shot. Makes a large > difference. > Avg > perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory > 16.24 175.82 154.38 228.97 147.16 144.5 noop > 43.23 57.39 96.13 148.25 180.09 105.0 deadline > 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 > 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 > 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE > 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 > 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 > 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE > Hmm.., looks like average latency went down only in case of fairness=1 and not in case of fairness=0. (Looking at previous mail, average vanilla cfq latencies were around 12 seconds). Are you running all this in root group or have you put writers and readers into separate cgroups? If everything is running in root group, then I am curious why latency went down in case of fairness=1. The only thing fairness=1 parameter does is that it lets complete all the requests from previous queue before start dispatching from next queue. On top of this is valid only if no preemption took place. In your test case, konsole should preempt the writer so practically fairness=1 might not make much difference. In fact now Jens has committed a patch which achieves the similar effect as fairness=1 for async queues. commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9 Author: Jens Axboe <jens.axboe@oracle.com> Date: Fri Jul 3 12:57:48 2009 +0200 cfq-iosched: drain device queue before switching to a sync queue To lessen the impact of async IO on sync IO, let the device drain of any async IO in progress when switching to a sync cfqq that has idling enabled. If everything is in separate cgroups, then we should have seen latency improvements in case of fairness=0 case also. I am little perplexed here.. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 18:18 ` Vivek Goyal (?) @ 2009-09-28 18:53 ` Mike Galbraith 2009-09-29 7:14 ` Corrado Zoccolo [not found] ` <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 18:53 UTC (permalink / raw) To: Vivek Goyal Cc: Corrado Zoccolo, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote: > On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote: > I guess changing class to IDLE should have helped a bit as now this is > equivalent to setting the quantum to 1 and after dispatching one request > to disk, CFQ will always expire the writer once. So it might happen that > by the the reader preempted writer, we have less number of requests in > disk and lesser latency for this reader. I expected SCHED_IDLE to be better than setting quantum to 1, because max is quantum*4 if you aren't IDLE. But that's not what happened. I just retested with all knobs set back to stock, fairness off, and quantum set to 1 with everything running nice 0. 2.8 seconds avg :-/ > > I saw > > the reference to Vivek's patch, and gave it a shot. Makes a large > > difference. > > Avg > > perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory > > 16.24 175.82 154.38 228.97 147.16 144.5 noop > > 43.23 57.39 96.13 148.25 180.09 105.0 deadline > > 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 > > 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 > > 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE > > 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 > > 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 > > 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE > > > > Hmm.., looks like average latency went down only in case of fairness=1 > and not in case of fairness=0. (Looking at previous mail, average vanilla > cfq latencies were around 12 seconds). Yup. > Are you running all this in root group or have you put writers and readers > into separate cgroups? No cgroups here. > If everything is running in root group, then I am curious why latency went > down in case of fairness=1. The only thing fairness=1 parameter does is > that it lets complete all the requests from previous queue before start > dispatching from next queue. On top of this is valid only if no preemption > took place. In your test case, konsole should preempt the writer so > practically fairness=1 might not make much difference. fairness=1 very definitely makes a very large difference. All of those cfq numbers were logged in back to back runs. > In fact now Jens has committed a patch which achieves the similar effect as > fairness=1 for async queues. Yeah, I was there yesterday. I speculated that that would hurt my reader, but rearranging things didn't help one bit. Playing with merge, I managed to give dd ~7% more throughput, and injured poor reader even more. (problem analysis via hammer/axe not always most effective;) > commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9 > Author: Jens Axboe <jens.axboe@oracle.com> > Date: Fri Jul 3 12:57:48 2009 +0200 > > cfq-iosched: drain device queue before switching to a sync queue > > To lessen the impact of async IO on sync IO, let the device drain of > any async IO in progress when switching to a sync cfqq that has idling > enabled. > > > If everything is in separate cgroups, then we should have seen latency > improvements in case of fairness=0 case also. I am little perplexed here.. > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 18:53 ` Mike Galbraith @ 2009-09-29 7:14 ` Corrado Zoccolo [not found] ` <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-09-29 7:14 UTC (permalink / raw) To: Mike Galbraith Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker Hi Mike, On Mon, Sep 28, 2009 at 8:53 PM, Mike Galbraith <efault@gmx.de> wrote: > On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote: >> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote: > >> I guess changing class to IDLE should have helped a bit as now this is >> equivalent to setting the quantum to 1 and after dispatching one request >> to disk, CFQ will always expire the writer once. So it might happen that >> by the the reader preempted writer, we have less number of requests in >> disk and lesser latency for this reader. > > I expected SCHED_IDLE to be better than setting quantum to 1, because > max is quantum*4 if you aren't IDLE. But that's not what happened. I > just retested with all knobs set back to stock, fairness off, and > quantum set to 1 with everything running nice 0. 2.8 seconds avg :-/ Idle doesn't work very well for async writes, since the writer process will just send its writes to the page cache. The real writeback will happen in the context of a kernel thread, with best effort scheduling class. > >> > I saw >> > the reference to Vivek's patch, and gave it a shot. Makes a large >> > difference. >> > Avg >> > perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory >> > 16.24 175.82 154.38 228.97 147.16 144.5 noop >> > 43.23 57.39 96.13 148.25 180.09 105.0 deadline >> > 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 >> > 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 >> > 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE >> > 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 >> > 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 >> > 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE >> > >> >> Hmm.., looks like average latency went down only in case of fairness=1 >> and not in case of fairness=0. (Looking at previous mail, average vanilla >> cfq latencies were around 12 seconds). > > Yup. > >> Are you running all this in root group or have you put writers and readers >> into separate cgroups? > > No cgroups here. > >> If everything is running in root group, then I am curious why latency went >> down in case of fairness=1. The only thing fairness=1 parameter does is >> that it lets complete all the requests from previous queue before start >> dispatching from next queue. On top of this is valid only if no preemption >> took place. In your test case, konsole should preempt the writer so >> practically fairness=1 might not make much difference. > > fairness=1 very definitely makes a very large difference. All of those > cfq numbers were logged in back to back runs. > >> In fact now Jens has committed a patch which achieves the similar effect as >> fairness=1 for async queues. > > Yeah, I was there yesterday. I speculated that that would hurt my > reader, but rearranging things didn't help one bit. Playing with merge, > I managed to give dd ~7% more throughput, and injured poor reader even > more. (problem analysis via hammer/axe not always most effective;) > >> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9 >> Author: Jens Axboe <jens.axboe@oracle.com> >> Date: Fri Jul 3 12:57:48 2009 +0200 >> >> cfq-iosched: drain device queue before switching to a sync queue >> >> To lessen the impact of async IO on sync IO, let the device drain of >> any async IO in progress when switching to a sync cfqq that has idling >> enabled. >> >> >> If everything is in separate cgroups, then we should have seen latency >> improvements in case of fairness=0 case also. I am little perplexed here.. >> >> Thanks >> Vivek > > Thanks, Corrado -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-29 7:14 ` Corrado Zoccolo 0 siblings, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-09-29 7:14 UTC (permalink / raw) To: Mike Galbraith Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Mike, On Mon, Sep 28, 2009 at 8:53 PM, Mike Galbraith <efault@gmx.de> wrote: > On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote: >> On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote: > >> I guess changing class to IDLE should have helped a bit as now this is >> equivalent to setting the quantum to 1 and after dispatching one request >> to disk, CFQ will always expire the writer once. So it might happen that >> by the the reader preempted writer, we have less number of requests in >> disk and lesser latency for this reader. > > I expected SCHED_IDLE to be better than setting quantum to 1, because > max is quantum*4 if you aren't IDLE. But that's not what happened. I > just retested with all knobs set back to stock, fairness off, and > quantum set to 1 with everything running nice 0. 2.8 seconds avg :-/ Idle doesn't work very well for async writes, since the writer process will just send its writes to the page cache. The real writeback will happen in the context of a kernel thread, with best effort scheduling class. > >> > I saw >> > the reference to Vivek's patch, and gave it a shot. Makes a large >> > difference. >> > Avg >> > perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory >> > 16.24 175.82 154.38 228.97 147.16 144.5 noop >> > 43.23 57.39 96.13 148.25 180.09 105.0 deadline >> > 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 >> > 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 >> > 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE >> > 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 >> > 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 >> > 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE >> > >> >> Hmm.., looks like average latency went down only in case of fairness=1 >> and not in case of fairness=0. (Looking at previous mail, average vanilla >> cfq latencies were around 12 seconds). > > Yup. > >> Are you running all this in root group or have you put writers and readers >> into separate cgroups? > > No cgroups here. > >> If everything is running in root group, then I am curious why latency went >> down in case of fairness=1. The only thing fairness=1 parameter does is >> that it lets complete all the requests from previous queue before start >> dispatching from next queue. On top of this is valid only if no preemption >> took place. In your test case, konsole should preempt the writer so >> practically fairness=1 might not make much difference. > > fairness=1 very definitely makes a very large difference. All of those > cfq numbers were logged in back to back runs. > >> In fact now Jens has committed a patch which achieves the similar effect as >> fairness=1 for async queues. > > Yeah, I was there yesterday. I speculated that that would hurt my > reader, but rearranging things didn't help one bit. Playing with merge, > I managed to give dd ~7% more throughput, and injured poor reader even > more. (problem analysis via hammer/axe not always most effective;) > >> commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9 >> Author: Jens Axboe <jens.axboe@oracle.com> >> Date: Fri Jul 3 12:57:48 2009 +0200 >> >> cfq-iosched: drain device queue before switching to a sync queue >> >> To lessen the impact of async IO on sync IO, let the device drain of >> any async IO in progress when switching to a sync cfqq that has idling >> enabled. >> >> >> If everything is in separate cgroups, then we should have seen latency >> improvements in case of fairness=0 case also. I am little perplexed here.. >> >> Thanks >> Vivek > > Thanks, Corrado -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- The self-confidence of a warrior is not the self-confidence of the average man. The average man seeks certainty in the eyes of the onlooker and calls that self-confidence. The warrior seeks impeccability in his own eyes and calls that humbleness. Tales of Power - C. Castaneda _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090928181846.GC3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090928181846.GC3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-28 18:53 ` Mike Galbraith 0 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 18:53 UTC (permalink / raw) To: Vivek Goyal Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, 2009-09-28 at 14:18 -0400, Vivek Goyal wrote: > On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote: > I guess changing class to IDLE should have helped a bit as now this is > equivalent to setting the quantum to 1 and after dispatching one request > to disk, CFQ will always expire the writer once. So it might happen that > by the the reader preempted writer, we have less number of requests in > disk and lesser latency for this reader. I expected SCHED_IDLE to be better than setting quantum to 1, because max is quantum*4 if you aren't IDLE. But that's not what happened. I just retested with all knobs set back to stock, fairness off, and quantum set to 1 with everything running nice 0. 2.8 seconds avg :-/ > > I saw > > the reference to Vivek's patch, and gave it a shot. Makes a large > > difference. > > Avg > > perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory > > 16.24 175.82 154.38 228.97 147.16 144.5 noop > > 43.23 57.39 96.13 148.25 180.09 105.0 deadline > > 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 > > 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 > > 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE > > 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 > > 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 > > 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE > > > > Hmm.., looks like average latency went down only in case of fairness=1 > and not in case of fairness=0. (Looking at previous mail, average vanilla > cfq latencies were around 12 seconds). Yup. > Are you running all this in root group or have you put writers and readers > into separate cgroups? No cgroups here. > If everything is running in root group, then I am curious why latency went > down in case of fairness=1. The only thing fairness=1 parameter does is > that it lets complete all the requests from previous queue before start > dispatching from next queue. On top of this is valid only if no preemption > took place. In your test case, konsole should preempt the writer so > practically fairness=1 might not make much difference. fairness=1 very definitely makes a very large difference. All of those cfq numbers were logged in back to back runs. > In fact now Jens has committed a patch which achieves the similar effect as > fairness=1 for async queues. Yeah, I was there yesterday. I speculated that that would hurt my reader, but rearranging things didn't help one bit. Playing with merge, I managed to give dd ~7% more throughput, and injured poor reader even more. (problem analysis via hammer/axe not always most effective;) > commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9 > Author: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > Date: Fri Jul 3 12:57:48 2009 +0200 > > cfq-iosched: drain device queue before switching to a sync queue > > To lessen the impact of async IO on sync IO, let the device drain of > any async IO in progress when switching to a sync cfqq that has idling > enabled. > > > If everything is in separate cgroups, then we should have seen latency > improvements in case of fairness=0 case also. I am little perplexed here.. > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-28 17:51 ` Mike Galbraith 2009-09-28 18:18 ` Vivek Goyal @ 2009-09-29 5:55 ` Mike Galbraith [not found] ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2 siblings, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-29 5:55 UTC (permalink / raw) To: Corrado Zoccolo Cc: Vivek Goyal, Ulrich Lukas, linux-kernel, containers, dm-devel, nauman, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, jens.axboe, Tobias Oetiker On Mon, 2009-09-28 at 19:51 +0200, Mike Galbraith wrote: > I'll give your patch a spin as well. I applied it to tip, and fixed up rejects. I haven't done a line for line verification against the original patch yet (brave or..), so add giant economy sized pinch of salt. In the form it ended up in, it didn't help here. I tried twiddling knobs, but it didn't help either. Reducing latency target from 300 to 30 did nada, but dropping to 3 did... I got to poke BRB. Plugging Vivek's fairness tweakable on top, and enabling it, my timings return to decent numbers, so that one liner absatively posilutely is where my write vs read woes are coming from. FWIW, below is patch wedged into tip v2.6.31-10215-ga3c9602 --- block/cfq-iosched.c | 281 ++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 227 insertions(+), 54 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 1 static int cfq_slice_async = HZ / 25; static const int cfq_slice_async_rq = 2; static int cfq_slice_idle = HZ / 125; +static int cfq_target_latency = HZ * 3/10; /* 300 ms */ +static int cfq_hist_divisor = 4; +/* + * Number of times that other workloads can be scheduled before async + */ +static const unsigned int cfq_async_penalty = 4; /* * offset from end of service tree @@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125; /* * below this threshold, we consider thinktime immediate */ -#define CFQ_MIN_TT (2) +#define CFQ_MIN_TT (1) #define CFQ_SLICE_SCALE (5) #define CFQ_HW_QUEUE_MIN (5) @@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock); struct cfq_rb_root { struct rb_root rb; struct rb_node *left; + unsigned count; }; -#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, } +#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, } /* * Per process-grouping structure @@ -113,6 +120,21 @@ struct cfq_queue { unsigned short ioprio_class, org_ioprio_class; pid_t pid; + + struct cfq_rb_root *service_tree; + struct cfq_io_context *cic; +}; + +enum wl_prio_t { + IDLE_WL = -1, + BE_WL = 0, + RT_WL = 1 +}; + +enum wl_type_t { + ASYNC_WL = 0, + SYNC_NOIDLE_WL = 1, + SYNC_WL = 2 }; /* @@ -124,7 +146,13 @@ struct cfq_data { /* * rr list of queues with requests and the count of them */ - struct cfq_rb_root service_tree; + struct cfq_rb_root service_trees[2][3]; + struct cfq_rb_root service_tree_idle; + + enum wl_prio_t serving_prio; + enum wl_type_t serving_type; + unsigned long workload_expires; + unsigned int async_starved; /* * Each priority tree is sorted by next_request position. These @@ -134,9 +162,11 @@ struct cfq_data { struct rb_root prio_trees[CFQ_PRIO_LISTS]; unsigned int busy_queues; + unsigned int busy_queues_avg[2]; int rq_in_driver[2]; int sync_flight; + int reads_delayed; /* * queue-depth detection @@ -173,6 +203,9 @@ struct cfq_data { unsigned int cfq_slice[2]; unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned int cfq_target_latency; + unsigned int cfq_hist_divisor; + unsigned int cfq_async_penalty; struct list_head cic_list; @@ -182,6 +215,11 @@ struct cfq_data { struct cfq_queue oom_cfqq; }; +static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type, + struct cfq_data *cfqd) { + return prio == IDLE_WL ? &cfqd->service_tree_idle : &cfqd->service_trees[prio][type]; +} + enum cfqq_state_flags { CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */ CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */ @@ -226,6 +264,17 @@ CFQ_CFQQ_FNS(coop); #define cfq_log(cfqd, fmt, args...) \ blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args) +#define CIC_SEEK_THR 1024 +#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR) +#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic)) + +static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) { + return wl==IDLE_WL? cfqd->service_tree_idle.count : + cfqd->service_trees[wl][ASYNC_WL].count + + cfqd->service_trees[wl][SYNC_NOIDLE_WL].count + + cfqd->service_trees[wl][SYNC_WL].count; +} + static void cfq_dispatch_insert(struct request_queue *, struct request *); static struct cfq_queue *cfq_get_queue(struct cfq_data *, int, struct io_context *, gfp_t); @@ -247,6 +296,7 @@ static inline void cic_set_cfqq(struct c struct cfq_queue *cfqq, int is_sync) { cic->cfqq[!!is_sync] = cfqq; + cfqq->cic = cic; } /* @@ -301,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd, return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio); } +static inline unsigned +cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) { + unsigned min_q, max_q; + unsigned mult = cfqd->cfq_hist_divisor - 1; + unsigned round = cfqd->cfq_hist_divisor / 2; + unsigned busy = cfq_busy_queues_wl(rt, cfqd); + min_q = min(cfqd->busy_queues_avg[rt], busy); + max_q = max(cfqd->busy_queues_avg[rt], busy); + cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) / + cfqd->cfq_hist_divisor; + return cfqd->busy_queues_avg[rt]; +} + static inline void cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) { - cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; + unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1]; + unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq)); + unsigned slice = cfq_prio_to_slice(cfqd, cfqq); + + if (iq > process_thr) { + unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle + / cfqd->cfq_slice[1]; + slice = max(slice * process_thr / iq, min(slice, low_slice)); + } + + cfqq->slice_end = jiffies + slice; cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); } @@ -443,6 +516,7 @@ static void cfq_rb_erase(struct rb_node if (root->left == n) root->left = NULL; rb_erase_init(n, &root->rb); + --root->count; } /* @@ -483,46 +557,56 @@ static unsigned long cfq_slice_offset(st } /* - * The cfqd->service_tree holds all pending cfq_queue's that have + * The cfqd->service_trees holds all pending cfq_queue's that have * requests waiting to be processed. It is sorted in the order that * we will service the queues. */ -static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq, - int add_front) +static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq) { struct rb_node **p, *parent; struct cfq_queue *__cfqq; unsigned long rb_key; + struct cfq_rb_root *service_tree; int left; if (cfq_class_idle(cfqq)) { rb_key = CFQ_IDLE_DELAY; - parent = rb_last(&cfqd->service_tree.rb); + service_tree = &cfqd->service_tree_idle; + parent = rb_last(&service_tree->rb); if (parent && parent != &cfqq->rb_node) { __cfqq = rb_entry(parent, struct cfq_queue, rb_node); rb_key += __cfqq->rb_key; } else rb_key += jiffies; - } else if (!add_front) { + } else { + enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL; + enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL; + rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq->slice_resid; cfqq->slice_resid = 0; - } else - rb_key = 0; + + if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq))) + type = SYNC_NOIDLE_WL; + + service_tree = service_tree_for(prio, type, cfqd); + } if (!RB_EMPTY_NODE(&cfqq->rb_node)) { /* * same position, nothing more to do */ - if (rb_key == cfqq->rb_key) + if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree) return; - cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree); + cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); + cfqq->service_tree = NULL; } left = 1; parent = NULL; - p = &cfqd->service_tree.rb.rb_node; + cfqq->service_tree = service_tree; + p = &service_tree->rb.rb_node; while (*p) { struct rb_node **n; @@ -554,11 +638,12 @@ static void cfq_service_tree_add(struct } if (left) - cfqd->service_tree.left = &cfqq->rb_node; + service_tree->left = &cfqq->rb_node; cfqq->rb_key = rb_key; rb_link_node(&cfqq->rb_node, parent, p); - rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb); + rb_insert_color(&cfqq->rb_node, &service_tree->rb); + service_tree->count++; } static struct cfq_queue * @@ -631,7 +716,7 @@ static void cfq_resort_rr_list(struct cf * Resorting requires the cfqq to be on the RR list already. */ if (cfq_cfqq_on_rr(cfqq)) { - cfq_service_tree_add(cfqd, cfqq, 0); + cfq_service_tree_add(cfqd, cfqq); cfq_prio_tree_add(cfqd, cfqq); } } @@ -660,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_d BUG_ON(!cfq_cfqq_on_rr(cfqq)); cfq_clear_cfqq_on_rr(cfqq); - if (!RB_EMPTY_NODE(&cfqq->rb_node)) - cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree); + if (!RB_EMPTY_NODE(&cfqq->rb_node)) { + cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); + cfqq->service_tree = NULL; + } if (cfqq->p_root) { rb_erase(&cfqq->p_node, cfqq->p_root); cfqq->p_root = NULL; @@ -923,10 +1010,11 @@ static inline void cfq_slice_expired(str */ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) { - if (RB_EMPTY_ROOT(&cfqd->service_tree.rb)) - return NULL; + struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd); - return cfq_rb_first(&cfqd->service_tree); + if (RB_EMPTY_ROOT(&service_tree->rb)) + return NULL; + return cfq_rb_first(service_tree); } /* @@ -954,9 +1042,6 @@ static inline sector_t cfq_dist_from_las return cfqd->last_position - blk_rq_pos(rq); } -#define CIC_SEEK_THR 8 * 1024 -#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR) - static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq) { struct cfq_io_context *cic = cfqd->active_cic; @@ -1044,6 +1129,10 @@ static struct cfq_queue *cfq_close_coope if (cfq_cfqq_coop(cfqq)) return NULL; + /* we don't want to mix processes with different characteristics */ + if (cfqq->service_tree != cur_cfqq->service_tree) + return NULL; + if (!probe) cfq_mark_cfqq_coop(cfqq); return cfqq; @@ -1087,14 +1176,15 @@ static void cfq_arm_slice_timer(struct c cfq_mark_cfqq_wait_request(cfqq); - /* - * we don't want to idle for seeks, but we do want to allow - * fair distribution of slice time for a process doing back-to-back - * seeks. so allow a little bit of time for him to submit a new rq - */ - sl = cfqd->cfq_slice_idle; - if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic)) + sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies); + + /* very small idle if we are serving noidle trees, and there are more trees */ + if (cfqd->serving_type == SYNC_NOIDLE_WL && + service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) { + if (blk_queue_nonrot(cfqd->queue)) + return; sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT)); + } mod_timer(&cfqd->idle_slice_timer, jiffies + sl); cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl); @@ -1110,6 +1200,11 @@ static void cfq_dispatch_insert(struct r cfq_log_cfqq(cfqd, cfqq, "dispatch_insert"); + if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) { + cfqd->reads_delayed = max_t(int, cfqd->reads_delayed, + (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2)); + } + cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq); cfq_remove_request(rq); cfqq->dispatched++; @@ -1156,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio)); } +enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) { + struct cfq_queue *id, *ni; + ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd)); + id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd)); + if (id && ni && id->rb_key < ni->rb_key) + return SYNC_WL; + if (!ni) return SYNC_WL; + return SYNC_NOIDLE_WL; +} + /* * Select a queue for service. If we have a current active queue, * check whether to continue servicing it, or retrieve and set a new one. @@ -1196,15 +1301,68 @@ static struct cfq_queue *cfq_select_queu * flight or is idling for a new request, allow either of these * conditions to happen (or time out) before selecting a new queue. */ - if (timer_pending(&cfqd->idle_slice_timer) || + if (timer_pending(&cfqd->idle_slice_timer) || (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) { cfqq = NULL; goto keep_queue; } - expire: cfq_slice_expired(cfqd, 0); new_queue: + if (!new_cfqq) { + enum wl_prio_t previous_prio = cfqd->serving_prio; + + if (cfq_busy_queues_wl(RT_WL, cfqd)) + cfqd->serving_prio = RT_WL; + else if (cfq_busy_queues_wl(BE_WL, cfqd)) + cfqd->serving_prio = BE_WL; + else { + cfqd->serving_prio = IDLE_WL; + cfqd->workload_expires = jiffies + 1; + cfqd->reads_delayed = 0; + } + + if (cfqd->serving_prio != IDLE_WL) { + int counts[]={ + service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count, + service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count, + service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count + }; + int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2]; + + if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) { + cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL; + cfqd->async_starved = 0; + cfqd->reads_delayed = 0; + } else { + if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) { + if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] && + cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed)) + cfqd->serving_type = ASYNC_WL; + else + cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio); + } else + goto same_wl; + } + + { + unsigned slice = cfqd->cfq_target_latency; + slice = slice * counts[cfqd->serving_type] / + max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio], + counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]); + + if (cfqd->serving_type == ASYNC_WL) + slice = max(1U, (slice / (1 + cfqd->reads_delayed)) + * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]); + else + slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle)); + + cfqd->workload_expires = jiffies + slice; + cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL); + } + } + } + same_wl: cfqq = cfq_set_active_queue(cfqd, new_cfqq); keep_queue: return cfqq; @@ -1231,8 +1389,13 @@ static int cfq_forced_dispatch(struct cf { struct cfq_queue *cfqq; int dispatched = 0; + int i,j; + for (i = 0; i < 2; ++i) + for (j = 0; j < 3; ++j) + while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL) + dispatched += __cfq_forced_dispatch_cfqq(cfqq); - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL) + while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL) dispatched += __cfq_forced_dispatch_cfqq(cfqq); cfq_slice_expired(cfqd, 0); @@ -1300,6 +1463,12 @@ static int cfq_dispatch_requests(struct return 0; /* + * Drain async requests before we start sync IO + */ + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + return 0; + + /* * If this is an async queue and we have sync IO in flight, let it wait */ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) @@ -1993,18 +2162,8 @@ cfq_should_preempt(struct cfq_data *cfqd if (cfq_class_idle(cfqq)) return 1; - /* - * if the new request is sync, but the currently running queue is - * not, let the sync request have priority. - */ - if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) - return 1; - - /* - * So both queues are sync. Let the new request get disk time if - * it's a metadata request and the current queue is doing regular IO. - */ - if (rq_is_meta(rq) && !cfqq->meta_pending) + if (cfqd->serving_type == SYNC_NOIDLE_WL + && new_cfqq->service_tree == cfqq->service_tree) return 1; /* @@ -2035,13 +2194,9 @@ static void cfq_preempt_queue(struct cfq cfq_log_cfqq(cfqd, cfqq, "preempt"); cfq_slice_expired(cfqd, 1); - /* - * Put the new queue at the front of the of the current list, - * so we know that it will be selected next. - */ BUG_ON(!cfq_cfqq_on_rr(cfqq)); - cfq_service_tree_add(cfqd, cfqq, 1); + cfq_service_tree_add(cfqd, cfqq); cfqq->slice_end = 0; cfq_mark_cfqq_slice_new(cfqq); @@ -2438,13 +2593,16 @@ static void cfq_exit_queue(struct elevat static void *cfq_init_queue(struct request_queue *q) { struct cfq_data *cfqd; - int i; + int i,j; cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node); if (!cfqd) return NULL; - cfqd->service_tree = CFQ_RB_ROOT; + for (i = 0; i < 2; ++i) + for (j = 0; j < 3; ++j) + cfqd->service_trees[i][j] = CFQ_RB_ROOT; + cfqd->service_tree_idle = CFQ_RB_ROOT; /* * Not strictly needed (since RB_ROOT just clears the node and we @@ -2481,6 +2639,9 @@ static void *cfq_init_queue(struct reque cfqd->cfq_slice[1] = cfq_slice_sync; cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; + cfqd->cfq_target_latency = cfq_target_latency; + cfqd->cfq_hist_divisor = cfq_hist_divisor; + cfqd->cfq_async_penalty = cfq_async_penalty; cfqd->hw_tag = 1; return cfqd; @@ -2517,6 +2678,7 @@ fail: /* * sysfs parts below --> */ + static ssize_t cfq_var_show(unsigned int var, char *page) { @@ -2550,6 +2712,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd- SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1); SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); +SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1); +SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0); +SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -2581,6 +2746,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cf STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); + +STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1); +STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0); +STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0); + #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -2596,6 +2766,9 @@ static struct elv_fs_entry cfq_attrs[] = CFQ_ATTR(slice_async), CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), + CFQ_ATTR(target_latency), + CFQ_ATTR(hist_divisor), + CFQ_ATTR(async_penalty), __ATTR_NULL }; ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> @ 2009-09-28 18:18 ` Vivek Goyal 2009-09-29 5:55 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 18:18 UTC (permalink / raw) To: Mike Galbraith Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Corrado Zoccolo, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Sep 28, 2009 at 07:51:14PM +0200, Mike Galbraith wrote: > On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote: > > > Great. > > Can you try the attached patch (on top of 2.6.31)? > > It implements the alternative approach we discussed privately in july, > > and it addresses the possible latency increase that could happen with > > your patch. > > > > To summarize for everyone, we separate sync sequential queues, sync > > seeky queues and async queues in three separate RR strucutres, and > > alternate servicing requests between them. > > > > When servicing seeky queues (the ones that are usually penalized by > > cfq, for which no fairness is usually provided), we do not idle > > between them, but we do idle for the last queue (the idle can be > > exited when any seeky queue has requests). This allows us to allocate > > disk time globally for all seeky processes, and to reduce seeky > > processes latencies. > > > > I tested with 'konsole -e exit', while doing a sequential write with > > dd, and the start up time reduced from 37s to 7s, on an old laptop > > disk. > > I was fiddling around trying to get IDLE class to behave at least, and > getting a bit frustrated. Class/priority didn't seem to make much if > any difference for konsole -e exit timings, and now I know why. You seem to be testing kconsole timings against a writer. In case of a writer prio will not make much of a difference as prio only adjusts length of slice given to process and writers rarely get to use their slice length. Reader immediately preemtps it... I guess changing class to IDLE should have helped a bit as now this is equivalent to setting the quantum to 1 and after dispatching one request to disk, CFQ will always expire the writer once. So it might happen that by the the reader preempted writer, we have less number of requests in disk and lesser latency for this reader. > I saw > the reference to Vivek's patch, and gave it a shot. Makes a large > difference. > Avg > perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory > 16.24 175.82 154.38 228.97 147.16 144.5 noop > 43.23 57.39 96.13 148.25 180.09 105.0 deadline > 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 > 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 > 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE > 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 > 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 > 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE > Hmm.., looks like average latency went down only in case of fairness=1 and not in case of fairness=0. (Looking at previous mail, average vanilla cfq latencies were around 12 seconds). Are you running all this in root group or have you put writers and readers into separate cgroups? If everything is running in root group, then I am curious why latency went down in case of fairness=1. The only thing fairness=1 parameter does is that it lets complete all the requests from previous queue before start dispatching from next queue. On top of this is valid only if no preemption took place. In your test case, konsole should preempt the writer so practically fairness=1 might not make much difference. In fact now Jens has committed a patch which achieves the similar effect as fairness=1 for async queues. commit 5ad531db6e0f3c3c985666e83d3c1c4d53acccf9 Author: Jens Axboe <jens.axboe-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Date: Fri Jul 3 12:57:48 2009 +0200 cfq-iosched: drain device queue before switching to a sync queue To lessen the impact of async IO on sync IO, let the device drain of any async IO in progress when switching to a sync cfqq that has idling enabled. If everything is in separate cgroups, then we should have seen latency improvements in case of fairness=0 case also. I am little perplexed here.. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-28 18:18 ` Vivek Goyal @ 2009-09-29 5:55 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-29 5:55 UTC (permalink / raw) To: Corrado Zoccolo Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, 2009-09-28 at 19:51 +0200, Mike Galbraith wrote: > I'll give your patch a spin as well. I applied it to tip, and fixed up rejects. I haven't done a line for line verification against the original patch yet (brave or..), so add giant economy sized pinch of salt. In the form it ended up in, it didn't help here. I tried twiddling knobs, but it didn't help either. Reducing latency target from 300 to 30 did nada, but dropping to 3 did... I got to poke BRB. Plugging Vivek's fairness tweakable on top, and enabling it, my timings return to decent numbers, so that one liner absatively posilutely is where my write vs read woes are coming from. FWIW, below is patch wedged into tip v2.6.31-10215-ga3c9602 --- block/cfq-iosched.c | 281 ++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 227 insertions(+), 54 deletions(-) Index: linux-2.6/block/cfq-iosched.c =================================================================== --- linux-2.6.orig/block/cfq-iosched.c +++ linux-2.6/block/cfq-iosched.c @@ -27,6 +27,12 @@ static const int cfq_slice_sync = HZ / 1 static int cfq_slice_async = HZ / 25; static const int cfq_slice_async_rq = 2; static int cfq_slice_idle = HZ / 125; +static int cfq_target_latency = HZ * 3/10; /* 300 ms */ +static int cfq_hist_divisor = 4; +/* + * Number of times that other workloads can be scheduled before async + */ +static const unsigned int cfq_async_penalty = 4; /* * offset from end of service tree @@ -36,7 +42,7 @@ static int cfq_slice_idle = HZ / 125; /* * below this threshold, we consider thinktime immediate */ -#define CFQ_MIN_TT (2) +#define CFQ_MIN_TT (1) #define CFQ_SLICE_SCALE (5) #define CFQ_HW_QUEUE_MIN (5) @@ -67,8 +73,9 @@ static DEFINE_SPINLOCK(ioc_gone_lock); struct cfq_rb_root { struct rb_root rb; struct rb_node *left; + unsigned count; }; -#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, } +#define CFQ_RB_ROOT (struct cfq_rb_root) { RB_ROOT, NULL, 0, } /* * Per process-grouping structure @@ -113,6 +120,21 @@ struct cfq_queue { unsigned short ioprio_class, org_ioprio_class; pid_t pid; + + struct cfq_rb_root *service_tree; + struct cfq_io_context *cic; +}; + +enum wl_prio_t { + IDLE_WL = -1, + BE_WL = 0, + RT_WL = 1 +}; + +enum wl_type_t { + ASYNC_WL = 0, + SYNC_NOIDLE_WL = 1, + SYNC_WL = 2 }; /* @@ -124,7 +146,13 @@ struct cfq_data { /* * rr list of queues with requests and the count of them */ - struct cfq_rb_root service_tree; + struct cfq_rb_root service_trees[2][3]; + struct cfq_rb_root service_tree_idle; + + enum wl_prio_t serving_prio; + enum wl_type_t serving_type; + unsigned long workload_expires; + unsigned int async_starved; /* * Each priority tree is sorted by next_request position. These @@ -134,9 +162,11 @@ struct cfq_data { struct rb_root prio_trees[CFQ_PRIO_LISTS]; unsigned int busy_queues; + unsigned int busy_queues_avg[2]; int rq_in_driver[2]; int sync_flight; + int reads_delayed; /* * queue-depth detection @@ -173,6 +203,9 @@ struct cfq_data { unsigned int cfq_slice[2]; unsigned int cfq_slice_async_rq; unsigned int cfq_slice_idle; + unsigned int cfq_target_latency; + unsigned int cfq_hist_divisor; + unsigned int cfq_async_penalty; struct list_head cic_list; @@ -182,6 +215,11 @@ struct cfq_data { struct cfq_queue oom_cfqq; }; +static struct cfq_rb_root * service_tree_for(enum wl_prio_t prio, enum wl_type_t type, + struct cfq_data *cfqd) { + return prio == IDLE_WL ? &cfqd->service_tree_idle : &cfqd->service_trees[prio][type]; +} + enum cfqq_state_flags { CFQ_CFQQ_FLAG_on_rr = 0, /* on round-robin busy list */ CFQ_CFQQ_FLAG_wait_request, /* waiting for a request */ @@ -226,6 +264,17 @@ CFQ_CFQQ_FNS(coop); #define cfq_log(cfqd, fmt, args...) \ blk_add_trace_msg((cfqd)->queue, "cfq " fmt, ##args) +#define CIC_SEEK_THR 1024 +#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR) +#define CFQQ_SEEKY(cfqq) (!cfqq->cic || CIC_SEEKY(cfqq->cic)) + +static inline int cfq_busy_queues_wl(enum wl_prio_t wl, struct cfq_data *cfqd) { + return wl==IDLE_WL? cfqd->service_tree_idle.count : + cfqd->service_trees[wl][ASYNC_WL].count + + cfqd->service_trees[wl][SYNC_NOIDLE_WL].count + + cfqd->service_trees[wl][SYNC_WL].count; +} + static void cfq_dispatch_insert(struct request_queue *, struct request *); static struct cfq_queue *cfq_get_queue(struct cfq_data *, int, struct io_context *, gfp_t); @@ -247,6 +296,7 @@ static inline void cic_set_cfqq(struct c struct cfq_queue *cfqq, int is_sync) { cic->cfqq[!!is_sync] = cfqq; + cfqq->cic = cic; } /* @@ -301,10 +351,33 @@ cfq_prio_to_slice(struct cfq_data *cfqd, return cfq_prio_slice(cfqd, cfq_cfqq_sync(cfqq), cfqq->ioprio); } +static inline unsigned +cfq_get_interested_queues(struct cfq_data *cfqd, bool rt) { + unsigned min_q, max_q; + unsigned mult = cfqd->cfq_hist_divisor - 1; + unsigned round = cfqd->cfq_hist_divisor / 2; + unsigned busy = cfq_busy_queues_wl(rt, cfqd); + min_q = min(cfqd->busy_queues_avg[rt], busy); + max_q = max(cfqd->busy_queues_avg[rt], busy); + cfqd->busy_queues_avg[rt] = (mult * max_q + min_q + round) / + cfqd->cfq_hist_divisor; + return cfqd->busy_queues_avg[rt]; +} + static inline void cfq_set_prio_slice(struct cfq_data *cfqd, struct cfq_queue *cfqq) { - cfqq->slice_end = cfq_prio_to_slice(cfqd, cfqq) + jiffies; + unsigned process_thr = cfqd->cfq_target_latency / cfqd->cfq_slice[1]; + unsigned iq = cfq_get_interested_queues(cfqd, cfq_class_rt(cfqq)); + unsigned slice = cfq_prio_to_slice(cfqd, cfqq); + + if (iq > process_thr) { + unsigned low_slice = 2 * slice * cfqd->cfq_slice_idle + / cfqd->cfq_slice[1]; + slice = max(slice * process_thr / iq, min(slice, low_slice)); + } + + cfqq->slice_end = jiffies + slice; cfq_log_cfqq(cfqd, cfqq, "set_slice=%lu", cfqq->slice_end - jiffies); } @@ -443,6 +516,7 @@ static void cfq_rb_erase(struct rb_node if (root->left == n) root->left = NULL; rb_erase_init(n, &root->rb); + --root->count; } /* @@ -483,46 +557,56 @@ static unsigned long cfq_slice_offset(st } /* - * The cfqd->service_tree holds all pending cfq_queue's that have + * The cfqd->service_trees holds all pending cfq_queue's that have * requests waiting to be processed. It is sorted in the order that * we will service the queues. */ -static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq, - int add_front) +static void cfq_service_tree_add(struct cfq_data *cfqd, struct cfq_queue *cfqq) { struct rb_node **p, *parent; struct cfq_queue *__cfqq; unsigned long rb_key; + struct cfq_rb_root *service_tree; int left; if (cfq_class_idle(cfqq)) { rb_key = CFQ_IDLE_DELAY; - parent = rb_last(&cfqd->service_tree.rb); + service_tree = &cfqd->service_tree_idle; + parent = rb_last(&service_tree->rb); if (parent && parent != &cfqq->rb_node) { __cfqq = rb_entry(parent, struct cfq_queue, rb_node); rb_key += __cfqq->rb_key; } else rb_key += jiffies; - } else if (!add_front) { + } else { + enum wl_prio_t prio = cfq_class_rt(cfqq) ? RT_WL : BE_WL; + enum wl_type_t type = cfq_cfqq_sync(cfqq) ? SYNC_WL : ASYNC_WL; + rb_key = cfq_slice_offset(cfqd, cfqq) + jiffies; rb_key += cfqq->slice_resid; cfqq->slice_resid = 0; - } else - rb_key = 0; + + if (type == SYNC_WL && (CFQQ_SEEKY(cfqq) || !cfq_cfqq_idle_window(cfqq))) + type = SYNC_NOIDLE_WL; + + service_tree = service_tree_for(prio, type, cfqd); + } if (!RB_EMPTY_NODE(&cfqq->rb_node)) { /* * same position, nothing more to do */ - if (rb_key == cfqq->rb_key) + if (rb_key == cfqq->rb_key && cfqq->service_tree == service_tree) return; - cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree); + cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); + cfqq->service_tree = NULL; } left = 1; parent = NULL; - p = &cfqd->service_tree.rb.rb_node; + cfqq->service_tree = service_tree; + p = &service_tree->rb.rb_node; while (*p) { struct rb_node **n; @@ -554,11 +638,12 @@ static void cfq_service_tree_add(struct } if (left) - cfqd->service_tree.left = &cfqq->rb_node; + service_tree->left = &cfqq->rb_node; cfqq->rb_key = rb_key; rb_link_node(&cfqq->rb_node, parent, p); - rb_insert_color(&cfqq->rb_node, &cfqd->service_tree.rb); + rb_insert_color(&cfqq->rb_node, &service_tree->rb); + service_tree->count++; } static struct cfq_queue * @@ -631,7 +716,7 @@ static void cfq_resort_rr_list(struct cf * Resorting requires the cfqq to be on the RR list already. */ if (cfq_cfqq_on_rr(cfqq)) { - cfq_service_tree_add(cfqd, cfqq, 0); + cfq_service_tree_add(cfqd, cfqq); cfq_prio_tree_add(cfqd, cfqq); } } @@ -660,8 +745,10 @@ static void cfq_del_cfqq_rr(struct cfq_d BUG_ON(!cfq_cfqq_on_rr(cfqq)); cfq_clear_cfqq_on_rr(cfqq); - if (!RB_EMPTY_NODE(&cfqq->rb_node)) - cfq_rb_erase(&cfqq->rb_node, &cfqd->service_tree); + if (!RB_EMPTY_NODE(&cfqq->rb_node)) { + cfq_rb_erase(&cfqq->rb_node, cfqq->service_tree); + cfqq->service_tree = NULL; + } if (cfqq->p_root) { rb_erase(&cfqq->p_node, cfqq->p_root); cfqq->p_root = NULL; @@ -923,10 +1010,11 @@ static inline void cfq_slice_expired(str */ static struct cfq_queue *cfq_get_next_queue(struct cfq_data *cfqd) { - if (RB_EMPTY_ROOT(&cfqd->service_tree.rb)) - return NULL; + struct cfq_rb_root *service_tree = service_tree_for(cfqd->serving_prio, cfqd->serving_type, cfqd); - return cfq_rb_first(&cfqd->service_tree); + if (RB_EMPTY_ROOT(&service_tree->rb)) + return NULL; + return cfq_rb_first(service_tree); } /* @@ -954,9 +1042,6 @@ static inline sector_t cfq_dist_from_las return cfqd->last_position - blk_rq_pos(rq); } -#define CIC_SEEK_THR 8 * 1024 -#define CIC_SEEKY(cic) ((cic)->seek_mean > CIC_SEEK_THR) - static inline int cfq_rq_close(struct cfq_data *cfqd, struct request *rq) { struct cfq_io_context *cic = cfqd->active_cic; @@ -1044,6 +1129,10 @@ static struct cfq_queue *cfq_close_coope if (cfq_cfqq_coop(cfqq)) return NULL; + /* we don't want to mix processes with different characteristics */ + if (cfqq->service_tree != cur_cfqq->service_tree) + return NULL; + if (!probe) cfq_mark_cfqq_coop(cfqq); return cfqq; @@ -1087,14 +1176,15 @@ static void cfq_arm_slice_timer(struct c cfq_mark_cfqq_wait_request(cfqq); - /* - * we don't want to idle for seeks, but we do want to allow - * fair distribution of slice time for a process doing back-to-back - * seeks. so allow a little bit of time for him to submit a new rq - */ - sl = cfqd->cfq_slice_idle; - if (sample_valid(cic->seek_samples) && CIC_SEEKY(cic)) + sl = min_t(unsigned, cfqd->cfq_slice_idle, cfqq->slice_end - jiffies); + + /* very small idle if we are serving noidle trees, and there are more trees */ + if (cfqd->serving_type == SYNC_NOIDLE_WL && + service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count > 0) { + if (blk_queue_nonrot(cfqd->queue)) + return; sl = min(sl, msecs_to_jiffies(CFQ_MIN_TT)); + } mod_timer(&cfqd->idle_slice_timer, jiffies + sl); cfq_log_cfqq(cfqd, cfqq, "arm_idle: %lu", sl); @@ -1110,6 +1200,11 @@ static void cfq_dispatch_insert(struct r cfq_log_cfqq(cfqd, cfqq, "dispatch_insert"); + if (!time_before(jiffies, rq->start_time + cfqd->cfq_target_latency / 2) && rq_data_dir(rq)==READ) { + cfqd->reads_delayed = max_t(int, cfqd->reads_delayed, + (jiffies - rq->start_time) / (cfqd->cfq_target_latency / 2)); + } + cfqq->next_rq = cfq_find_next_rq(cfqd, cfqq, rq); cfq_remove_request(rq); cfqq->dispatched++; @@ -1156,6 +1251,16 @@ cfq_prio_to_maxrq(struct cfq_data *cfqd, return 2 * (base_rq + base_rq * (CFQ_PRIO_LISTS - 1 - cfqq->ioprio)); } +enum wl_type_t cfq_choose_sync_async(struct cfq_data *cfqd, enum wl_prio_t prio) { + struct cfq_queue *id, *ni; + ni = cfq_rb_first(service_tree_for(prio, SYNC_NOIDLE_WL, cfqd)); + id = cfq_rb_first(service_tree_for(prio, SYNC_WL, cfqd)); + if (id && ni && id->rb_key < ni->rb_key) + return SYNC_WL; + if (!ni) return SYNC_WL; + return SYNC_NOIDLE_WL; +} + /* * Select a queue for service. If we have a current active queue, * check whether to continue servicing it, or retrieve and set a new one. @@ -1196,15 +1301,68 @@ static struct cfq_queue *cfq_select_queu * flight or is idling for a new request, allow either of these * conditions to happen (or time out) before selecting a new queue. */ - if (timer_pending(&cfqd->idle_slice_timer) || + if (timer_pending(&cfqd->idle_slice_timer) || (cfqq->dispatched && cfq_cfqq_idle_window(cfqq))) { cfqq = NULL; goto keep_queue; } - expire: cfq_slice_expired(cfqd, 0); new_queue: + if (!new_cfqq) { + enum wl_prio_t previous_prio = cfqd->serving_prio; + + if (cfq_busy_queues_wl(RT_WL, cfqd)) + cfqd->serving_prio = RT_WL; + else if (cfq_busy_queues_wl(BE_WL, cfqd)) + cfqd->serving_prio = BE_WL; + else { + cfqd->serving_prio = IDLE_WL; + cfqd->workload_expires = jiffies + 1; + cfqd->reads_delayed = 0; + } + + if (cfqd->serving_prio != IDLE_WL) { + int counts[]={ + service_tree_for(cfqd->serving_prio, ASYNC_WL, cfqd)->count, + service_tree_for(cfqd->serving_prio, SYNC_NOIDLE_WL, cfqd)->count, + service_tree_for(cfqd->serving_prio, SYNC_WL, cfqd)->count + }; + int nonzero_counts= !!counts[0] + !!counts[1] + !!counts[2]; + + if (previous_prio != cfqd->serving_prio || (nonzero_counts == 1)) { + cfqd->serving_type = counts[1] ? SYNC_NOIDLE_WL : counts[2] ? SYNC_WL : ASYNC_WL; + cfqd->async_starved = 0; + cfqd->reads_delayed = 0; + } else { + if (!counts[cfqd->serving_type] || time_after(jiffies, cfqd->workload_expires)) { + if (cfqd->serving_type != ASYNC_WL && counts[ASYNC_WL] && + cfqd->async_starved++ > cfqd->cfq_async_penalty * (1 + cfqd->reads_delayed)) + cfqd->serving_type = ASYNC_WL; + else + cfqd->serving_type = cfq_choose_sync_async(cfqd, cfqd->serving_prio); + } else + goto same_wl; + } + + { + unsigned slice = cfqd->cfq_target_latency; + slice = slice * counts[cfqd->serving_type] / + max_t(unsigned, cfqd->busy_queues_avg[cfqd->serving_prio], + counts[SYNC_WL] + counts[SYNC_NOIDLE_WL] + counts[ASYNC_WL]); + + if (cfqd->serving_type == ASYNC_WL) + slice = max(1U, (slice / (1 + cfqd->reads_delayed)) + * cfqd->cfq_slice[0] / cfqd->cfq_slice[1]); + else + slice = max(slice, 2U * max(1U, cfqd->cfq_slice_idle)); + + cfqd->workload_expires = jiffies + slice; + cfqd->async_starved *= (cfqd->serving_type != ASYNC_WL); + } + } + } + same_wl: cfqq = cfq_set_active_queue(cfqd, new_cfqq); keep_queue: return cfqq; @@ -1231,8 +1389,13 @@ static int cfq_forced_dispatch(struct cf { struct cfq_queue *cfqq; int dispatched = 0; + int i,j; + for (i = 0; i < 2; ++i) + for (j = 0; j < 3; ++j) + while ((cfqq = cfq_rb_first(&cfqd->service_trees[i][j])) != NULL) + dispatched += __cfq_forced_dispatch_cfqq(cfqq); - while ((cfqq = cfq_rb_first(&cfqd->service_tree)) != NULL) + while ((cfqq = cfq_rb_first(&cfqd->service_tree_idle)) != NULL) dispatched += __cfq_forced_dispatch_cfqq(cfqq); cfq_slice_expired(cfqd, 0); @@ -1300,6 +1463,12 @@ static int cfq_dispatch_requests(struct return 0; /* + * Drain async requests before we start sync IO + */ + if (cfq_cfqq_idle_window(cfqq) && cfqd->rq_in_driver[BLK_RW_ASYNC]) + return 0; + + /* * If this is an async queue and we have sync IO in flight, let it wait */ if (cfqd->sync_flight && !cfq_cfqq_sync(cfqq)) @@ -1993,18 +2162,8 @@ cfq_should_preempt(struct cfq_data *cfqd if (cfq_class_idle(cfqq)) return 1; - /* - * if the new request is sync, but the currently running queue is - * not, let the sync request have priority. - */ - if (rq_is_sync(rq) && !cfq_cfqq_sync(cfqq)) - return 1; - - /* - * So both queues are sync. Let the new request get disk time if - * it's a metadata request and the current queue is doing regular IO. - */ - if (rq_is_meta(rq) && !cfqq->meta_pending) + if (cfqd->serving_type == SYNC_NOIDLE_WL + && new_cfqq->service_tree == cfqq->service_tree) return 1; /* @@ -2035,13 +2194,9 @@ static void cfq_preempt_queue(struct cfq cfq_log_cfqq(cfqd, cfqq, "preempt"); cfq_slice_expired(cfqd, 1); - /* - * Put the new queue at the front of the of the current list, - * so we know that it will be selected next. - */ BUG_ON(!cfq_cfqq_on_rr(cfqq)); - cfq_service_tree_add(cfqd, cfqq, 1); + cfq_service_tree_add(cfqd, cfqq); cfqq->slice_end = 0; cfq_mark_cfqq_slice_new(cfqq); @@ -2438,13 +2593,16 @@ static void cfq_exit_queue(struct elevat static void *cfq_init_queue(struct request_queue *q) { struct cfq_data *cfqd; - int i; + int i,j; cfqd = kmalloc_node(sizeof(*cfqd), GFP_KERNEL | __GFP_ZERO, q->node); if (!cfqd) return NULL; - cfqd->service_tree = CFQ_RB_ROOT; + for (i = 0; i < 2; ++i) + for (j = 0; j < 3; ++j) + cfqd->service_trees[i][j] = CFQ_RB_ROOT; + cfqd->service_tree_idle = CFQ_RB_ROOT; /* * Not strictly needed (since RB_ROOT just clears the node and we @@ -2481,6 +2639,9 @@ static void *cfq_init_queue(struct reque cfqd->cfq_slice[1] = cfq_slice_sync; cfqd->cfq_slice_async_rq = cfq_slice_async_rq; cfqd->cfq_slice_idle = cfq_slice_idle; + cfqd->cfq_target_latency = cfq_target_latency; + cfqd->cfq_hist_divisor = cfq_hist_divisor; + cfqd->cfq_async_penalty = cfq_async_penalty; cfqd->hw_tag = 1; return cfqd; @@ -2517,6 +2678,7 @@ fail: /* * sysfs parts below --> */ + static ssize_t cfq_var_show(unsigned int var, char *page) { @@ -2550,6 +2712,9 @@ SHOW_FUNCTION(cfq_slice_idle_show, cfqd- SHOW_FUNCTION(cfq_slice_sync_show, cfqd->cfq_slice[1], 1); SHOW_FUNCTION(cfq_slice_async_show, cfqd->cfq_slice[0], 1); SHOW_FUNCTION(cfq_slice_async_rq_show, cfqd->cfq_slice_async_rq, 0); +SHOW_FUNCTION(cfq_target_latency_show, cfqd->cfq_target_latency, 1); +SHOW_FUNCTION(cfq_hist_divisor_show, cfqd->cfq_hist_divisor, 0); +SHOW_FUNCTION(cfq_async_penalty_show, cfqd->cfq_async_penalty, 0); #undef SHOW_FUNCTION #define STORE_FUNCTION(__FUNC, __PTR, MIN, MAX, __CONV) \ @@ -2581,6 +2746,11 @@ STORE_FUNCTION(cfq_slice_sync_store, &cf STORE_FUNCTION(cfq_slice_async_store, &cfqd->cfq_slice[0], 1, UINT_MAX, 1); STORE_FUNCTION(cfq_slice_async_rq_store, &cfqd->cfq_slice_async_rq, 1, UINT_MAX, 0); + +STORE_FUNCTION(cfq_target_latency_store, &cfqd->cfq_target_latency, 1, 1000, 1); +STORE_FUNCTION(cfq_hist_divisor_store, &cfqd->cfq_hist_divisor, 1, 100, 0); +STORE_FUNCTION(cfq_async_penalty_store, &cfqd->cfq_async_penalty, 1, UINT_MAX, 0); + #undef STORE_FUNCTION #define CFQ_ATTR(name) \ @@ -2596,6 +2766,9 @@ static struct elv_fs_entry cfq_attrs[] = CFQ_ATTR(slice_async), CFQ_ATTR(slice_async_rq), CFQ_ATTR(slice_idle), + CFQ_ATTR(target_latency), + CFQ_ATTR(hist_divisor), + CFQ_ATTR(async_penalty), __ATTR_NULL }; ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-09-28 17:14 ` Vivek Goyal 2009-09-28 17:51 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 17:14 UTC (permalink / raw) To: Corrado Zoccolo Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Sep 28, 2009 at 05:35:02PM +0200, Corrado Zoccolo wrote: > On Mon, Sep 28, 2009 at 4:56 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: > >> Hi Vivek, > >> On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > >> > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > >> >> Vivek Goyal wrote: > >> >> > Notes: > >> >> > - With vanilla CFQ, random writers can overwhelm a random reader. > >> >> > Bring down its throughput and bump up latencies significantly. > >> >> > >> >> > >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > >> >> too. > >> >> > >> >> I'm basing this assumption on the observations I made on both OpenSuse > >> >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > >> >> titled: "Poor desktop responsiveness with background I/O-operations" of > >> >> 2009-09-20. > >> >> (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org) > >> >> > >> >> > >> >> Thus, I'm posting this to show that your work is greatly appreciated, > >> >> given the rather disappointig status quo of Linux's fairness when it > >> >> comes to disk IO time. > >> >> > >> >> I hope that your efforts lead to a change in performance of current > >> >> userland applications, the sooner, the better. > >> >> > >> > [Please don't remove people from original CC list. I am putting them back.] > >> > > >> > Hi Ulrich, > >> > > >> > I quicky went through that mail thread and I tried following on my > >> > desktop. > >> > > >> > ########################################## > >> > dd if=/home/vgoyal/4G-file of=/dev/null & > >> > sleep 5 > >> > time firefox > >> > # close firefox once gui pops up. > >> > ########################################## > >> > > >> > It was taking close to 1 minute 30 seconds to launch firefox and dd got > >> > following. > >> > > >> > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > >> > > >> > (Results do vary across runs, especially if system is booted fresh. Don't > >> > know why...). > >> > > >> > > >> > Then I tried putting both the applications in separate groups and assign > >> > them weights 200 each. > >> > > >> > ########################################## > >> > dd if=/home/vgoyal/4G-file of=/dev/null & > >> > echo $! > /cgroup/io/test1/tasks > >> > sleep 5 > >> > echo $$ > /cgroup/io/test2/tasks > >> > time firefox > >> > # close firefox once gui pops up. > >> > ########################################## > >> > > >> > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > >> > > >> > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > >> > > >> > Notice that throughput of dd also improved. > >> > > >> > I ran the block trace and noticed in many a cases firefox threads > >> > immediately preempted the "dd". Probably because it was a file system > >> > request. So in this case latency will arise from seek time. > >> > > >> > In some other cases, threads had to wait for up to 100ms because dd was > >> > not preempted. In this case latency will arise both from waiting on queue > >> > as well as seek time. > >> > >> I think cfq should already be doing something similar, i.e. giving > >> 100ms slices to firefox, that alternate with dd, unless: > >> * firefox is too seeky (in this case, the idle window will be too small) > >> * firefox has too much think time. > >> > > > Hi Vivek, > > Hi Corrado, > > > > "firefox" is the shell script to setup the environment and launch the > > broser. It seems to be a group of threads. Some of them run in parallel > > and some of these seems to be running one after the other (once previous > > process or threads finished). > > Ok. > > > > >> To rule out the first case, what happens if you run the test with your > >> "fairness for seeky processes" patch? > > > > I applied that patch and it helps a lot. > > > > http://lwn.net/Articles/341032/ > > > > With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. > > Great. > Can you try the attached patch (on top of 2.6.31)? > It implements the alternative approach we discussed privately in july, > and it addresses the possible latency increase that could happen with > your patch. > > To summarize for everyone, we separate sync sequential queues, sync > seeky queues and async queues in three separate RR strucutres, and > alternate servicing requests between them. > > When servicing seeky queues (the ones that are usually penalized by > cfq, for which no fairness is usually provided), we do not idle > between them, but we do idle for the last queue (the idle can be > exited when any seeky queue has requests). This allows us to allocate > disk time globally for all seeky processes, and to reduce seeky > processes latencies. > Ok, I seem to be doing same thing at group level (In group scheduling patches). I do not idle on individual sync seeky queues but if this is last queue in the group, then I do idle to make sure group does not loose its fair share and exit from idle the moment there is any busy queue in the group. So you seem to be grouping all the sync seeky queues system wide in a single group. So all the sync seeky queues collectively get 100ms in a single round of dispatch? I am wondering what happens if there are lot of such sync seeky queues this 100ms time slice is consumed before all the sync seeky queues got a chance to dispatch. Does that mean that some of the queues can completely skip the one dispatch round? Thanks Vivek > I tested with 'konsole -e exit', while doing a sequential write with > dd, and the start up time reduced from 37s to 7s, on an old laptop > disk. > > Thanks, > Corrado > > > > >> To rule out the first case, what happens if you run the test with your > >> "fairness for seeky processes" patch? > > > > I applied that patch and it helps a lot. > > > > http://lwn.net/Articles/341032/ > > > > With above patchset applied, and fairness=1, firefox pops up in 27-28 > > seconds. > > > > So it looks like if we don't disable idle window for seeky processes on > > hardware supporting command queuing, it helps in this particular case. > > > > Thanks > > Vivek > > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-09-28 17:14 ` Vivek Goyal @ 2009-09-28 17:51 ` Mike Galbraith 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-28 17:51 UTC (permalink / raw) To: Corrado Zoccolo Cc: Tobias Oetiker, dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, 2009-09-28 at 17:35 +0200, Corrado Zoccolo wrote: > Great. > Can you try the attached patch (on top of 2.6.31)? > It implements the alternative approach we discussed privately in july, > and it addresses the possible latency increase that could happen with > your patch. > > To summarize for everyone, we separate sync sequential queues, sync > seeky queues and async queues in three separate RR strucutres, and > alternate servicing requests between them. > > When servicing seeky queues (the ones that are usually penalized by > cfq, for which no fairness is usually provided), we do not idle > between them, but we do idle for the last queue (the idle can be > exited when any seeky queue has requests). This allows us to allocate > disk time globally for all seeky processes, and to reduce seeky > processes latencies. > > I tested with 'konsole -e exit', while doing a sequential write with > dd, and the start up time reduced from 37s to 7s, on an old laptop > disk. I was fiddling around trying to get IDLE class to behave at least, and getting a bit frustrated. Class/priority didn't seem to make much if any difference for konsole -e exit timings, and now I know why. I saw the reference to Vivek's patch, and gave it a shot. Makes a large difference. Avg perf stat 12.82 7.19 8.49 5.76 9.32 8.7 anticipatory 16.24 175.82 154.38 228.97 147.16 144.5 noop 43.23 57.39 96.13 148.25 180.09 105.0 deadline 9.15 14.51 9.39 15.06 9.90 11.6 cfq fairness=0 dd=nice 0 12.22 9.85 12.55 9.88 15.06 11.9 cfq fairness=0 dd=nice 19 9.77 13.19 11.78 17.40 9.51 11.9 cfq fairness=0 dd=SCHED_IDLE 4.59 2.74 4.70 3.45 4.69 4.0 cfq fairness=1 dd=nice 0 3.79 4.66 2.66 5.15 3.03 3.8 cfq fairness=1 dd=nice 19 2.79 4.73 2.79 4.02 2.50 3.3 cfq fairness=1 dd=SCHED_IDLE I'll give your patch a spin as well. -Mike ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-09-28 14:56 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-28 14:56 UTC (permalink / raw) To: Corrado Zoccolo Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Sun, Sep 27, 2009 at 07:00:08PM +0200, Corrado Zoccolo wrote: > Hi Vivek, > On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > >> Vivek Goyal wrote: > >> > Notes: > >> > - With vanilla CFQ, random writers can overwhelm a random reader. > >> > Bring down its throughput and bump up latencies significantly. > >> > >> > >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > >> too. > >> > >> I'm basing this assumption on the observations I made on both OpenSuse > >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > >> titled: "Poor desktop responsiveness with background I/O-operations" of > >> 2009-09-20. > >> (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org) > >> > >> > >> Thus, I'm posting this to show that your work is greatly appreciated, > >> given the rather disappointig status quo of Linux's fairness when it > >> comes to disk IO time. > >> > >> I hope that your efforts lead to a change in performance of current > >> userland applications, the sooner, the better. > >> > > [Please don't remove people from original CC list. I am putting them back.] > > > > Hi Ulrich, > > > > I quicky went through that mail thread and I tried following on my > > desktop. > > > > ########################################## > > dd if=/home/vgoyal/4G-file of=/dev/null & > > sleep 5 > > time firefox > > # close firefox once gui pops up. > > ########################################## > > > > It was taking close to 1 minute 30 seconds to launch firefox and dd got > > following. > > > > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > > > > (Results do vary across runs, especially if system is booted fresh. Don't > > know why...). > > > > > > Then I tried putting both the applications in separate groups and assign > > them weights 200 each. > > > > ########################################## > > dd if=/home/vgoyal/4G-file of=/dev/null & > > echo $! > /cgroup/io/test1/tasks > > sleep 5 > > echo $$ > /cgroup/io/test2/tasks > > time firefox > > # close firefox once gui pops up. > > ########################################## > > > > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > > > > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > > > > Notice that throughput of dd also improved. > > > > I ran the block trace and noticed in many a cases firefox threads > > immediately preempted the "dd". Probably because it was a file system > > request. So in this case latency will arise from seek time. > > > > In some other cases, threads had to wait for up to 100ms because dd was > > not preempted. In this case latency will arise both from waiting on queue > > as well as seek time. > > I think cfq should already be doing something similar, i.e. giving > 100ms slices to firefox, that alternate with dd, unless: > * firefox is too seeky (in this case, the idle window will be too small) > * firefox has too much think time. > Hi Corrado, "firefox" is the shell script to setup the environment and launch the broser. It seems to be a group of threads. Some of them run in parallel and some of these seems to be running one after the other (once previous process or threads finished). > To rule out the first case, what happens if you run the test with your > "fairness for seeky processes" patch? I applied that patch and it helps a lot. http://lwn.net/Articles/341032/ With above patchset applied, and fairness=1, firefox pops up in 27-28 seconds. So it looks like if we don't disable idle window for seeky processes on hardware supporting command queuing, it helps in this particular case. Thanks Vivek > To rule out the second case, what happens if you increase the slice_idle? > > Thanks, > Corrado > > > > > With cgroup thing, We will run 100ms slice for the group in which firefox > > is being launched and then give 100ms uninterrupted time slice to dd. So > > it should cut down on number of seeks happening and that's why we probably > > see this improvement. > > > > So grouping can help in such cases. May be you can move your X session in > > one group and launch the big IO in other group. Most likely you should > > have better desktop experience without compromising on dd thread output. > > > Thanks > > Vivek > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > > > > > > -- > __________________________________________________________________________ > > dott. Corrado Zoccolo mailto:czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org > PhD - Department of Computer Science - University of Pisa, Italy > -------------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090925202636.GC15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090925202636.GC15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-26 14:51 ` Mike Galbraith 2009-09-27 17:00 ` Corrado Zoccolo 1 sibling, 0 replies; 349+ messages in thread From: Mike Galbraith @ 2009-09-26 14:51 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Fri, 2009-09-25 at 16:26 -0400, Vivek Goyal wrote: > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > > Vivek Goyal wrote: > > > Notes: > > > - With vanilla CFQ, random writers can overwhelm a random reader. > > > Bring down its throughput and bump up latencies significantly. > > > > > > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > > too. > > > > I'm basing this assumption on the observations I made on both OpenSuse > > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > > titled: "Poor desktop responsiveness with background I/O-operations" of > > 2009-09-20. > > (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org) > > > > > > Thus, I'm posting this to show that your work is greatly appreciated, > > given the rather disappointig status quo of Linux's fairness when it > > comes to disk IO time. > > > > I hope that your efforts lead to a change in performance of current > > userland applications, the sooner, the better. > > > [Please don't remove people from original CC list. I am putting them back.] > > Hi Ulrich, > > I quicky went through that mail thread and I tried following on my > desktop. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > sleep 5 > time firefox > # close firefox once gui pops up. > ########################################## > > It was taking close to 1 minute 30 seconds to launch firefox and dd got > following. > > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > > (Results do vary across runs, especially if system is booted fresh. Don't > know why...). > > > Then I tried putting both the applications in separate groups and assign > them weights 200 each. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > echo $! > /cgroup/io/test1/tasks > sleep 5 > echo $$ > /cgroup/io/test2/tasks > time firefox > # close firefox once gui pops up. > ########################################## > > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > > Notice that throughput of dd also improved. > > I ran the block trace and noticed in many a cases firefox threads > immediately preempted the "dd". Probably because it was a file system > request. So in this case latency will arise from seek time. > > In some other cases, threads had to wait for up to 100ms because dd was > not preempted. In this case latency will arise both from waiting on queue > as well as seek time. Hm, with tip, I see ~10ms max wakeup latency running scriptlet below. > With cgroup thing, We will run 100ms slice for the group in which firefox > is being launched and then give 100ms uninterrupted time slice to dd. So > it should cut down on number of seeks happening and that's why we probably > see this improvement. I'm not testing with group IO/CPU, but my numbers kinda agree that it's seek latency that's THE killer. What the compiled numbers below from the cheezy script below that _seem_ to be telling me is that the default setting of CFQ quantum is allowing too many write requests through, inflicting too much read latency... for the disk where my binaries live. The longer the seeky burst, the more it hurts both reader/writer, so cutting down the max requests queueable helps the reader (which i think can't queue anything near per unit time that the writer can) finish and get out of the writer's way sooner. 'nuff possibly useless words, onward to possibly useless numbers :) dd pre == number dd emits upon receiving USR1 before execing perf. perf stat == time to load/execute perf stat konsole -e exit. dd post == same after dd number, after perf finishes. quantum = 1 Avg dd pre 58.4 52.5 56.1 61.6 52.3 56.1 MB/s perf stat 2.87 0.91 1.64 1.41 0.90 1.5 Sec dd post 56.6 61.0 66.3 64.7 60.9 61.9 quantum = 2 dd pre 59.7 62.4 58.9 65.3 60.3 61.3 perf stat 5.81 6.09 6.24 10.13 6.21 6.8 dd post 64.0 62.6 64.2 60.4 61.1 62.4 quantum = 3 dd pre 65.5 57.7 54.5 51.1 56.3 57.0 perf stat 14.01 13.71 8.35 5.35 8.57 9.9 dd post 59.2 49.1 58.8 62.3 62.1 58.3 quantum = 4 dd pre 57.2 52.1 56.8 55.2 61.6 56.5 perf stat 11.98 1.61 9.63 16.21 11.13 10.1 dd post 57.2 52.6 62.2 49.3 50.2 54.3 Nothing pinned btw, 4 cores available, but only 1 drive. #!/bin/sh DISK=sdb QUANTUM=/sys/block/$DISK/queue/iosched/quantum END=$(cat $QUANTUM) for q in `seq 1 $END`; do echo $q > $QUANTUM LOGFILE=quantum_log_$q rm -f $LOGFILE for i in `seq 1 5`; do echo 2 > /proc/sys/vm/drop_caches sh -c "dd if=/dev/zero of=./deleteme.dd 2>&1|tee -a $LOGFILE" & sleep 30 sh -c "echo quantum $(cat $QUANTUM) loop $i" 2>&1|tee -a $LOGFILE perf stat -- killlall -q get_stuf_into_ram >/dev/null 2>&1 sleep 1 killall -q -USR1 dd & sleep 1 sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE sleep 1 killall -q -USR1 dd & sleep 5 killall -qw dd rm -f ./deleteme.dd sync sh -c "echo" 2>&1|tee -a $LOGFILE done; done; ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20090925202636.GC15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-26 14:51 ` Mike Galbraith @ 2009-09-27 17:00 ` Corrado Zoccolo 1 sibling, 0 replies; 349+ messages in thread From: Corrado Zoccolo @ 2009-09-27 17:00 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, Ulrich Lukas, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, On Fri, Sep 25, 2009 at 10:26 PM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: >> Vivek Goyal wrote: >> > Notes: >> > - With vanilla CFQ, random writers can overwhelm a random reader. >> > Bring down its throughput and bump up latencies significantly. >> >> >> IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, >> too. >> >> I'm basing this assumption on the observations I made on both OpenSuse >> 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML >> titled: "Poor desktop responsiveness with background I/O-operations" of >> 2009-09-20. >> (Message ID: 4AB59CBB.8090907@datenparkplatz.de) >> >> >> Thus, I'm posting this to show that your work is greatly appreciated, >> given the rather disappointig status quo of Linux's fairness when it >> comes to disk IO time. >> >> I hope that your efforts lead to a change in performance of current >> userland applications, the sooner, the better. >> > [Please don't remove people from original CC list. I am putting them back.] > > Hi Ulrich, > > I quicky went through that mail thread and I tried following on my > desktop. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > sleep 5 > time firefox > # close firefox once gui pops up. > ########################################## > > It was taking close to 1 minute 30 seconds to launch firefox and dd got > following. > > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > > (Results do vary across runs, especially if system is booted fresh. Don't > know why...). > > > Then I tried putting both the applications in separate groups and assign > them weights 200 each. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > echo $! > /cgroup/io/test1/tasks > sleep 5 > echo $$ > /cgroup/io/test2/tasks > time firefox > # close firefox once gui pops up. > ########################################## > > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > > Notice that throughput of dd also improved. > > I ran the block trace and noticed in many a cases firefox threads > immediately preempted the "dd". Probably because it was a file system > request. So in this case latency will arise from seek time. > > In some other cases, threads had to wait for up to 100ms because dd was > not preempted. In this case latency will arise both from waiting on queue > as well as seek time. I think cfq should already be doing something similar, i.e. giving 100ms slices to firefox, that alternate with dd, unless: * firefox is too seeky (in this case, the idle window will be too small) * firefox has too much think time. To rule out the first case, what happens if you run the test with your "fairness for seeky processes" patch? To rule out the second case, what happens if you increase the slice_idle? Thanks, Corrado > > With cgroup thing, We will run 100ms slice for the group in which firefox > is being launched and then give 100ms uninterrupted time slice to dd. So > it should cut down on number of seeks happening and that's why we probably > see this improvement. > > So grouping can help in such cases. May be you can move your X session in > one group and launch the big IO in other group. Most likely you should > have better desktop experience without compromising on dd thread output. > Thanks > Vivek > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- __________________________________________________________________________ dott. Corrado Zoccolo mailto:czoccolo@gmail.com PhD - Department of Computer Science - University of Pisa, Italy -------------------------------------------------------------------------- _______________________________________________ Containers mailing list Containers@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-24 19:25 Vivek Goyal @ 2009-09-29 0:37 ` Nauman Rafique 2009-09-25 2:20 ` Ulrich Lukas ` (2 subsequent siblings) 3 siblings, 0 replies; 349+ messages in thread From: Nauman Rafique @ 2009-09-29 0:37 UTC (permalink / raw) To: Vivek Goyal Cc: linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi Vivek, Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with Jens about IO controller during Linux Plumbers Conference '09. Jens expressed his concerns about the size and complexity of the patches. I believe that is a reasonable concern. We talked about things that could be done to reduce the size of the patches. The requirement that the "solution has to work with all IO schedulers" seems like a secondary concern at this point; and it came out as one thing that can help to reduce the size of the patch set. Another possibility is to use a simpler scheduling algorithm e.g. weighted round robin, instead of BFQ scheduler. BFQ indeed has great properties, but we cannot deny the fact that it is complex to understand, and might be cumbersome to maintain. Also, hierarchical scheduling is something that could be unnecessary in the first set of patches, even though cgroups are hierarchical in nature. We are starting from a point where there is no cgroup based IO scheduling in the kernel. And it is probably not reasonable to satisfy all IO scheduling related requirements in one patch set. We can start with something simple, and build on top of that. So a very simple patch set that enables cgroup based proportional scheduling for CFQ seems like the way to go at this point. It would be great if we discuss our plans on the mailing list, so we can get early feedback from everyone. ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-29 0:37 ` Nauman Rafique 0 siblings, 0 replies; 349+ messages in thread From: Nauman Rafique @ 2009-09-29 0:37 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds Hi Vivek, Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with Jens about IO controller during Linux Plumbers Conference '09. Jens expressed his concerns about the size and complexity of the patches. I believe that is a reasonable concern. We talked about things that could be done to reduce the size of the patches. The requirement that the "solution has to work with all IO schedulers" seems like a secondary concern at this point; and it came out as one thing that can help to reduce the size of the patch set. Another possibility is to use a simpler scheduling algorithm e.g. weighted round robin, instead of BFQ scheduler. BFQ indeed has great properties, but we cannot deny the fact that it is complex to understand, and might be cumbersome to maintain. Also, hierarchical scheduling is something that could be unnecessary in the first set of patches, even though cgroups are hierarchical in nature. We are starting from a point where there is no cgroup based IO scheduling in the kernel. And it is probably not reasonable to satisfy all IO scheduling related requirements in one patch set. We can start with something simple, and build on top of that. So a very simple patch set that enables cgroup based proportional scheduling for CFQ seems like the way to go at this point. It would be great if we discuss our plans on the mailing list, so we can get early feedback from everyone. ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-29 0:37 ` Nauman Rafique @ 2009-09-29 3:22 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-29 3:22 UTC (permalink / raw) To: Nauman Rafique Cc: linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, ryov, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > Hi Vivek, > Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with > Jens about IO controller during Linux Plumbers Conference '09. Jens > expressed his concerns about the size and complexity of the patches. I > believe that is a reasonable concern. We talked about things that > could be done to reduce the size of the patches. The requirement that > the "solution has to work with all IO schedulers" seems like a > secondary concern at this point; and it came out as one thing that can > help to reduce the size of the patch set. Initially doing cgroup based IO control only for CFQ should help a lot in reducing the patchset size. > Another possibility is to > use a simpler scheduling algorithm e.g. weighted round robin, instead > of BFQ scheduler. BFQ indeed has great properties, but we cannot deny > the fact that it is complex to understand, and might be cumbersome to > maintain. Core of the BFQ I have gotten rid of already. The remaining part is idle tree and data structures. I will see how can I simplify it further. > Also, hierarchical scheduling is something that could be > unnecessary in the first set of patches, even though cgroups are > hierarchical in nature. Sure. Though I don't think that a lot of code is there because of hierarchical nature. If we solve the issue at CFQ layer, we have to maintain atleast two levels. One for queue and other for groups. So even the simplest solution becomes almost hierarchical in nature. But I will still see how to get rid of some code here too... > > We are starting from a point where there is no cgroup based IO > scheduling in the kernel. And it is probably not reasonable to satisfy > all IO scheduling related requirements in one patch set. We can start > with something simple, and build on top of that. So a very simple > patch set that enables cgroup based proportional scheduling for CFQ > seems like the way to go at this point. Sure, we can start with CFQ only. But a bigger question we need to answer is that is CFQ the right place to solve the issue? Jens, do you think that CFQ is the right place to solve the problem? Andrew seems to favor a high level approach so that IO schedulers are less complex and we can provide fairness at high level logical devices also. I will again try to summarize my understanding so far about the pros/cons of each approach and then we can take the discussion forward. Fairness in terms of size of IO or disk time used ================================================= On a seeky media, fairness in terms of disk time can get us better results instead fairness interms of size of IO or number of IO. If we implement some kind of time based solution at higher layer, then that higher layer should know who used how much of time each group used. We can probably do some kind of timestamping in bio to get a sense when did it get into disk and when did it finish. But on a multi queue hardware there can be multiple requests in the disk either from same queue or from differnet queues and with pure timestamping based apparoch, so far I could not think how at high level we will get an idea who used how much of time. So this is the first point of contention that how do we want to provide fairness. In terms of disk time used or in terms of size of IO/number of IO. Max bandwidth Controller or Proportional bandwidth controller ============================================================= What is our primary requirement here? A weight based proportional bandwidth controller where we can use the resources optimally and any kind of throttling kicks in only if there is contention for the disk. Or we want max bandwidth control where a group is not allowed to use the disk even if disk is free. Or we need both? I would think that at some point of time we will need both but we can start with proportional bandwidth control first. Fairness for higher level logical devices ========================================= Do we want good fairness numbers for higher level logical devices also or it is sufficient to provide fairness at leaf nodes. Providing fairness at leaf nodes can help us use the resources optimally and in the process we can get fairness at higher level also in many of the cases. But do we want strict fairness numbers on higher level logical devices even if it means sub-optimal usage of unerlying phsical devices? I think that for proportinal bandwidth control, it should be ok to provide fairness at higher level logical device but for max bandwidth control it might make more sense to provide fairness at higher level. Consider a case where from a striped device a customer wants to limit a group to 30MB/s and in case of leaf node control, if every leaf node provides 30MB/s, it might accumulate to much more than specified rate at logical device. Latency Control and strong isolation between groups =================================================== Do we want a good isolation between groups and better latencies and stronger isolation between groups? I think if problem is solved at IO scheduler level, we can achieve better latency control and hence stronger isolation between groups. Higher level solutions should find it hard to provide same kind of latency control and isolation between groups as IO scheduler based solution. Fairness for buffered writes ============================ Doing io control at any place below page cache has disadvantage that page cache might not dispatch more writes from higher weight group hence higher weight group might not see more IO done. Andrew says that we don't have a solution to this problem in kernel and he would like to see it handled properly. Only way to solve this seems to be to slow down the writers before they write into page cache. IO throttling patch handled it by slowing down writer if it crossed max specified rate. Other suggestions have come in the form of dirty_ratio per memory cgroup or a separate cgroup controller al-together where some kind of per group write limit can be specified. So if solution is implemented at IO scheduler layer or at device mapper layer, both shall have to rely on another controller to be co-mounted to handle buffered writes properly. Fairness with-in group ====================== One of the issues with higher level controller is that how to do fair throttling so that fairness with-in group is not impacted. Especially the case of making sure that we don't break the notion of ioprio of the processes with-in group. Especially io throttling patch was very bad in terms of prio with-in group where throttling treated everyone equally and difference between process prio disappeared. Reads Vs Writes =============== A higher level control most likely will change the ratio in which reads and writes are dispatched to disk with-in group. It used to be decided by IO scheduler so far but with higher level groups doing throttling and possibly buffering the bios and releasing them later, they will have to come up with their own policy on in what proportion reads and writes should be dispatched. In case of IO scheduler based control, all the queuing takes place at IO scheduler and it still retains control of in what ration reads and writes should be dispatched. Summary ======= - An io scheduler based io controller can provide better latencies, stronger isolation between groups, time based fairness and will not interfere with io schedulers policies like class, ioprio and reader vs writer issues. But it can gunrantee fairness at higher logical level devices. Especially in case of max bw control, leaf node control does not sound to be the most appropriate thing. - IO throttling provides max bw control in terms of absolute rate. It has the advantage that it can provide control at higher level logical device and also control buffered writes without need of additional controller co-mounted. But it does only max bw control and not proportion control so one might not be using resources optimally. It looses sense of task prio and class with-in group as any of the task can be throttled with-in group. Because throttling does not kick in till you hit the max bw limit, it should find it hard to provide same latencies as io scheduler based control. - dm-ioband also has the advantage that it can provide fairness at higher level logical devices. But, fairness is provided only in terms of size of IO or number of IO. No time based fairness. It is very throughput oriented and does not throttle high speed group if other group is running slow random reader. This results in bad latnecies for random reader group and weaker isolation between groups. Also it does not provide fairness if a group is not continuously backlogged. So if one is running 1-2 dd/sequential readers in the group, one does not get fairness until workload is increased to a point where group becomes continuously backlogged. This also results in poor latencies and limited fairness. At this point of time it does not look like a single IO controller all the scenarios/requirements. This means few things to me. - Drop some of the requirements and go with one implementation which meets those reduced set of requirements. - Have more than one IO controller implementation in kenrel. One for lower level control for better latencies, stronger isolation and optimal resource usage and other one for fairness at higher level logical devices and max bandwidth control. And let user decide which one to use based on his/her needs. - Come up with more intelligent way of doing IO control where single controller covers all the cases. At this point of time, I am more inclined towards option 2 of having more than one implementation in kernel. :-) (Until and unless we can brainstrom and come up with ideas to make option 3 happen). > > It would be great if we discuss our plans on the mailing list, so we > can get early feedback from everyone. This is what comes to my mind so far. Please add to the list if I have missed some points. Also correct me if I am wrong about the pros/cons of the approaches. Thoughts/ideas/opinions are welcome... Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-29 3:22 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-29 3:22 UTC (permalink / raw) To: Nauman Rafique Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > Hi Vivek, > Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with > Jens about IO controller during Linux Plumbers Conference '09. Jens > expressed his concerns about the size and complexity of the patches. I > believe that is a reasonable concern. We talked about things that > could be done to reduce the size of the patches. The requirement that > the "solution has to work with all IO schedulers" seems like a > secondary concern at this point; and it came out as one thing that can > help to reduce the size of the patch set. Initially doing cgroup based IO control only for CFQ should help a lot in reducing the patchset size. > Another possibility is to > use a simpler scheduling algorithm e.g. weighted round robin, instead > of BFQ scheduler. BFQ indeed has great properties, but we cannot deny > the fact that it is complex to understand, and might be cumbersome to > maintain. Core of the BFQ I have gotten rid of already. The remaining part is idle tree and data structures. I will see how can I simplify it further. > Also, hierarchical scheduling is something that could be > unnecessary in the first set of patches, even though cgroups are > hierarchical in nature. Sure. Though I don't think that a lot of code is there because of hierarchical nature. If we solve the issue at CFQ layer, we have to maintain atleast two levels. One for queue and other for groups. So even the simplest solution becomes almost hierarchical in nature. But I will still see how to get rid of some code here too... > > We are starting from a point where there is no cgroup based IO > scheduling in the kernel. And it is probably not reasonable to satisfy > all IO scheduling related requirements in one patch set. We can start > with something simple, and build on top of that. So a very simple > patch set that enables cgroup based proportional scheduling for CFQ > seems like the way to go at this point. Sure, we can start with CFQ only. But a bigger question we need to answer is that is CFQ the right place to solve the issue? Jens, do you think that CFQ is the right place to solve the problem? Andrew seems to favor a high level approach so that IO schedulers are less complex and we can provide fairness at high level logical devices also. I will again try to summarize my understanding so far about the pros/cons of each approach and then we can take the discussion forward. Fairness in terms of size of IO or disk time used ================================================= On a seeky media, fairness in terms of disk time can get us better results instead fairness interms of size of IO or number of IO. If we implement some kind of time based solution at higher layer, then that higher layer should know who used how much of time each group used. We can probably do some kind of timestamping in bio to get a sense when did it get into disk and when did it finish. But on a multi queue hardware there can be multiple requests in the disk either from same queue or from differnet queues and with pure timestamping based apparoch, so far I could not think how at high level we will get an idea who used how much of time. So this is the first point of contention that how do we want to provide fairness. In terms of disk time used or in terms of size of IO/number of IO. Max bandwidth Controller or Proportional bandwidth controller ============================================================= What is our primary requirement here? A weight based proportional bandwidth controller where we can use the resources optimally and any kind of throttling kicks in only if there is contention for the disk. Or we want max bandwidth control where a group is not allowed to use the disk even if disk is free. Or we need both? I would think that at some point of time we will need both but we can start with proportional bandwidth control first. Fairness for higher level logical devices ========================================= Do we want good fairness numbers for higher level logical devices also or it is sufficient to provide fairness at leaf nodes. Providing fairness at leaf nodes can help us use the resources optimally and in the process we can get fairness at higher level also in many of the cases. But do we want strict fairness numbers on higher level logical devices even if it means sub-optimal usage of unerlying phsical devices? I think that for proportinal bandwidth control, it should be ok to provide fairness at higher level logical device but for max bandwidth control it might make more sense to provide fairness at higher level. Consider a case where from a striped device a customer wants to limit a group to 30MB/s and in case of leaf node control, if every leaf node provides 30MB/s, it might accumulate to much more than specified rate at logical device. Latency Control and strong isolation between groups =================================================== Do we want a good isolation between groups and better latencies and stronger isolation between groups? I think if problem is solved at IO scheduler level, we can achieve better latency control and hence stronger isolation between groups. Higher level solutions should find it hard to provide same kind of latency control and isolation between groups as IO scheduler based solution. Fairness for buffered writes ============================ Doing io control at any place below page cache has disadvantage that page cache might not dispatch more writes from higher weight group hence higher weight group might not see more IO done. Andrew says that we don't have a solution to this problem in kernel and he would like to see it handled properly. Only way to solve this seems to be to slow down the writers before they write into page cache. IO throttling patch handled it by slowing down writer if it crossed max specified rate. Other suggestions have come in the form of dirty_ratio per memory cgroup or a separate cgroup controller al-together where some kind of per group write limit can be specified. So if solution is implemented at IO scheduler layer or at device mapper layer, both shall have to rely on another controller to be co-mounted to handle buffered writes properly. Fairness with-in group ====================== One of the issues with higher level controller is that how to do fair throttling so that fairness with-in group is not impacted. Especially the case of making sure that we don't break the notion of ioprio of the processes with-in group. Especially io throttling patch was very bad in terms of prio with-in group where throttling treated everyone equally and difference between process prio disappeared. Reads Vs Writes =============== A higher level control most likely will change the ratio in which reads and writes are dispatched to disk with-in group. It used to be decided by IO scheduler so far but with higher level groups doing throttling and possibly buffering the bios and releasing them later, they will have to come up with their own policy on in what proportion reads and writes should be dispatched. In case of IO scheduler based control, all the queuing takes place at IO scheduler and it still retains control of in what ration reads and writes should be dispatched. Summary ======= - An io scheduler based io controller can provide better latencies, stronger isolation between groups, time based fairness and will not interfere with io schedulers policies like class, ioprio and reader vs writer issues. But it can gunrantee fairness at higher logical level devices. Especially in case of max bw control, leaf node control does not sound to be the most appropriate thing. - IO throttling provides max bw control in terms of absolute rate. It has the advantage that it can provide control at higher level logical device and also control buffered writes without need of additional controller co-mounted. But it does only max bw control and not proportion control so one might not be using resources optimally. It looses sense of task prio and class with-in group as any of the task can be throttled with-in group. Because throttling does not kick in till you hit the max bw limit, it should find it hard to provide same latencies as io scheduler based control. - dm-ioband also has the advantage that it can provide fairness at higher level logical devices. But, fairness is provided only in terms of size of IO or number of IO. No time based fairness. It is very throughput oriented and does not throttle high speed group if other group is running slow random reader. This results in bad latnecies for random reader group and weaker isolation between groups. Also it does not provide fairness if a group is not continuously backlogged. So if one is running 1-2 dd/sequential readers in the group, one does not get fairness until workload is increased to a point where group becomes continuously backlogged. This also results in poor latencies and limited fairness. At this point of time it does not look like a single IO controller all the scenarios/requirements. This means few things to me. - Drop some of the requirements and go with one implementation which meets those reduced set of requirements. - Have more than one IO controller implementation in kenrel. One for lower level control for better latencies, stronger isolation and optimal resource usage and other one for fairness at higher level logical devices and max bandwidth control. And let user decide which one to use based on his/her needs. - Come up with more intelligent way of doing IO control where single controller covers all the cases. At this point of time, I am more inclined towards option 2 of having more than one implementation in kernel. :-) (Until and unless we can brainstrom and come up with ideas to make option 3 happen). > > It would be great if we discuss our plans on the mailing list, so we > can get early feedback from everyone. This is what comes to my mind so far. Please add to the list if I have missed some points. Also correct me if I am wrong about the pros/cons of the approaches. Thoughts/ideas/opinions are welcome... Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090929032255.GA10664-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090929032255.GA10664-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-29 9:56 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-29 9:56 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek and all, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > > We are starting from a point where there is no cgroup based IO > > scheduling in the kernel. And it is probably not reasonable to satisfy > > all IO scheduling related requirements in one patch set. We can start > > with something simple, and build on top of that. So a very simple > > patch set that enables cgroup based proportional scheduling for CFQ > > seems like the way to go at this point. > > Sure, we can start with CFQ only. But a bigger question we need to answer > is that is CFQ the right place to solve the issue? Jens, do you think > that CFQ is the right place to solve the problem? > > Andrew seems to favor a high level approach so that IO schedulers are less > complex and we can provide fairness at high level logical devices also. I'm not in favor of expansion of CFQ, because some enterprise storages are better performed with NOOP rather than CFQ, and I think bandwidth control is needed much more for such storage system. Is it easy to support other IO schedulers even if a new IO scheduler is introduced? I would like to know a bit more specific about Namuman's scheduler design. > I will again try to summarize my understanding so far about the pros/cons > of each approach and then we can take the discussion forward. Good summary. Thanks for your work. > Fairness in terms of size of IO or disk time used > ================================================= > On a seeky media, fairness in terms of disk time can get us better results > instead fairness interms of size of IO or number of IO. > > If we implement some kind of time based solution at higher layer, then > that higher layer should know who used how much of time each group used. We > can probably do some kind of timestamping in bio to get a sense when did it > get into disk and when did it finish. But on a multi queue hardware there > can be multiple requests in the disk either from same queue or from differnet > queues and with pure timestamping based apparoch, so far I could not think > how at high level we will get an idea who used how much of time. IIUC, could the overlap time be calculated from time-stamp on a multi queue hardware? > So this is the first point of contention that how do we want to provide > fairness. In terms of disk time used or in terms of size of IO/number of > IO. > > Max bandwidth Controller or Proportional bandwidth controller > ============================================================= > What is our primary requirement here? A weight based proportional > bandwidth controller where we can use the resources optimally and any > kind of throttling kicks in only if there is contention for the disk. > > Or we want max bandwidth control where a group is not allowed to use the > disk even if disk is free. > > Or we need both? I would think that at some point of time we will need > both but we can start with proportional bandwidth control first. How about making throttling policy be user selectable like the IO scheduler and putting it in the higher layer? So we could support all of policies (time-based, size-based and rate limiting). There seems not to only one solution which satisfies all users. But I agree with starting with proportional bandwidth control first. BTW, I will start to reimplement dm-ioband into block layer. > Fairness for higher level logical devices > ========================================= > Do we want good fairness numbers for higher level logical devices also > or it is sufficient to provide fairness at leaf nodes. Providing fairness > at leaf nodes can help us use the resources optimally and in the process > we can get fairness at higher level also in many of the cases. We should also take care of block devices which provide their own make_request_fn() and not use a IO scheduler. We can't use the leaf nodes approach to such devices. > But do we want strict fairness numbers on higher level logical devices > even if it means sub-optimal usage of unerlying phsical devices? > > I think that for proportinal bandwidth control, it should be ok to provide > fairness at higher level logical device but for max bandwidth control it > might make more sense to provide fairness at higher level. Consider a > case where from a striped device a customer wants to limit a group to > 30MB/s and in case of leaf node control, if every leaf node provides > 30MB/s, it might accumulate to much more than specified rate at logical > device. > > Latency Control and strong isolation between groups > =================================================== > Do we want a good isolation between groups and better latencies and > stronger isolation between groups? > > I think if problem is solved at IO scheduler level, we can achieve better > latency control and hence stronger isolation between groups. > > Higher level solutions should find it hard to provide same kind of latency > control and isolation between groups as IO scheduler based solution. Why do you think that the higher level solution is hard to provide it? I think that it is a matter of how to implement throttling policy. > Fairness for buffered writes > ============================ > Doing io control at any place below page cache has disadvantage that page > cache might not dispatch more writes from higher weight group hence higher > weight group might not see more IO done. Andrew says that we don't have > a solution to this problem in kernel and he would like to see it handled > properly. > > Only way to solve this seems to be to slow down the writers before they > write into page cache. IO throttling patch handled it by slowing down > writer if it crossed max specified rate. Other suggestions have come in > the form of dirty_ratio per memory cgroup or a separate cgroup controller > al-together where some kind of per group write limit can be specified. > > So if solution is implemented at IO scheduler layer or at device mapper > layer, both shall have to rely on another controller to be co-mounted > to handle buffered writes properly. > > Fairness with-in group > ====================== > One of the issues with higher level controller is that how to do fair > throttling so that fairness with-in group is not impacted. Especially > the case of making sure that we don't break the notion of ioprio of the > processes with-in group. I ran your test script to confirm that the notion of ioprio was not broken by dm-ioband. Here is the results of the test. https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html I think that the time period during which dm-ioband holds IO requests for throttling would be too short to break the notion of ioprio. > Especially io throttling patch was very bad in terms of prio with-in > group where throttling treated everyone equally and difference between > process prio disappeared. > > Reads Vs Writes > =============== > A higher level control most likely will change the ratio in which reads > and writes are dispatched to disk with-in group. It used to be decided > by IO scheduler so far but with higher level groups doing throttling and > possibly buffering the bios and releasing them later, they will have to > come up with their own policy on in what proportion reads and writes > should be dispatched. In case of IO scheduler based control, all the > queuing takes place at IO scheduler and it still retains control of > in what ration reads and writes should be dispatched. I don't think it is a concern. The current implementation of dm-ioband is that sync/async IO requests are handled separately and the backlogged IOs are released according to the order of arrival if both sync and async requests are backlogged. > Summary > ======= > > - An io scheduler based io controller can provide better latencies, > stronger isolation between groups, time based fairness and will not > interfere with io schedulers policies like class, ioprio and > reader vs writer issues. > > But it can gunrantee fairness at higher logical level devices. > Especially in case of max bw control, leaf node control does not sound > to be the most appropriate thing. > > - IO throttling provides max bw control in terms of absolute rate. It has > the advantage that it can provide control at higher level logical device > and also control buffered writes without need of additional controller > co-mounted. > > But it does only max bw control and not proportion control so one might > not be using resources optimally. It looses sense of task prio and class > with-in group as any of the task can be throttled with-in group. Because > throttling does not kick in till you hit the max bw limit, it should find > it hard to provide same latencies as io scheduler based control. > > - dm-ioband also has the advantage that it can provide fairness at higher > level logical devices. > > But, fairness is provided only in terms of size of IO or number of IO. > No time based fairness. It is very throughput oriented and does not > throttle high speed group if other group is running slow random reader. > This results in bad latnecies for random reader group and weaker > isolation between groups. A new policy can be added to dm-ioband. Actually, range-bw policy, which provides min and max bandwidth control, does time-based throttling. Moreover there is room for improvement for existing policies. The write-starve-read issue you pointed out will be solved soon. > Also it does not provide fairness if a group is not continuously > backlogged. So if one is running 1-2 dd/sequential readers in the group, > one does not get fairness until workload is increased to a point where > group becomes continuously backlogged. This also results in poor > latencies and limited fairness. This is intended to efficiently use bandwidth of underlying devices when IO load is low. > At this point of time it does not look like a single IO controller all > the scenarios/requirements. This means few things to me. > > - Drop some of the requirements and go with one implementation which meets > those reduced set of requirements. > > - Have more than one IO controller implementation in kenrel. One for lower > level control for better latencies, stronger isolation and optimal resource > usage and other one for fairness at higher level logical devices and max > bandwidth control. > > And let user decide which one to use based on his/her needs. > > - Come up with more intelligent way of doing IO control where single > controller covers all the cases. > > At this point of time, I am more inclined towards option 2 of having more > than one implementation in kernel. :-) (Until and unless we can brainstrom > and come up with ideas to make option 3 happen). > > > It would be great if we discuss our plans on the mailing list, so we > > can get early feedback from everyone. > > This is what comes to my mind so far. Please add to the list if I have missed > some points. Also correct me if I am wrong about the pros/cons of the > approaches. > > Thoughts/ideas/opinions are welcome... > > Thanks > Vivek Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-29 3:22 ` Vivek Goyal (?) (?) @ 2009-09-29 9:56 ` Ryo Tsuruta 2009-09-29 10:49 ` Takuya Yoshikawa ` (3 more replies) -1 siblings, 4 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-29 9:56 UTC (permalink / raw) To: vgoyal Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi Vivek and all, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > > We are starting from a point where there is no cgroup based IO > > scheduling in the kernel. And it is probably not reasonable to satisfy > > all IO scheduling related requirements in one patch set. We can start > > with something simple, and build on top of that. So a very simple > > patch set that enables cgroup based proportional scheduling for CFQ > > seems like the way to go at this point. > > Sure, we can start with CFQ only. But a bigger question we need to answer > is that is CFQ the right place to solve the issue? Jens, do you think > that CFQ is the right place to solve the problem? > > Andrew seems to favor a high level approach so that IO schedulers are less > complex and we can provide fairness at high level logical devices also. I'm not in favor of expansion of CFQ, because some enterprise storages are better performed with NOOP rather than CFQ, and I think bandwidth control is needed much more for such storage system. Is it easy to support other IO schedulers even if a new IO scheduler is introduced? I would like to know a bit more specific about Namuman's scheduler design. > I will again try to summarize my understanding so far about the pros/cons > of each approach and then we can take the discussion forward. Good summary. Thanks for your work. > Fairness in terms of size of IO or disk time used > ================================================= > On a seeky media, fairness in terms of disk time can get us better results > instead fairness interms of size of IO or number of IO. > > If we implement some kind of time based solution at higher layer, then > that higher layer should know who used how much of time each group used. We > can probably do some kind of timestamping in bio to get a sense when did it > get into disk and when did it finish. But on a multi queue hardware there > can be multiple requests in the disk either from same queue or from differnet > queues and with pure timestamping based apparoch, so far I could not think > how at high level we will get an idea who used how much of time. IIUC, could the overlap time be calculated from time-stamp on a multi queue hardware? > So this is the first point of contention that how do we want to provide > fairness. In terms of disk time used or in terms of size of IO/number of > IO. > > Max bandwidth Controller or Proportional bandwidth controller > ============================================================= > What is our primary requirement here? A weight based proportional > bandwidth controller where we can use the resources optimally and any > kind of throttling kicks in only if there is contention for the disk. > > Or we want max bandwidth control where a group is not allowed to use the > disk even if disk is free. > > Or we need both? I would think that at some point of time we will need > both but we can start with proportional bandwidth control first. How about making throttling policy be user selectable like the IO scheduler and putting it in the higher layer? So we could support all of policies (time-based, size-based and rate limiting). There seems not to only one solution which satisfies all users. But I agree with starting with proportional bandwidth control first. BTW, I will start to reimplement dm-ioband into block layer. > Fairness for higher level logical devices > ========================================= > Do we want good fairness numbers for higher level logical devices also > or it is sufficient to provide fairness at leaf nodes. Providing fairness > at leaf nodes can help us use the resources optimally and in the process > we can get fairness at higher level also in many of the cases. We should also take care of block devices which provide their own make_request_fn() and not use a IO scheduler. We can't use the leaf nodes approach to such devices. > But do we want strict fairness numbers on higher level logical devices > even if it means sub-optimal usage of unerlying phsical devices? > > I think that for proportinal bandwidth control, it should be ok to provide > fairness at higher level logical device but for max bandwidth control it > might make more sense to provide fairness at higher level. Consider a > case where from a striped device a customer wants to limit a group to > 30MB/s and in case of leaf node control, if every leaf node provides > 30MB/s, it might accumulate to much more than specified rate at logical > device. > > Latency Control and strong isolation between groups > =================================================== > Do we want a good isolation between groups and better latencies and > stronger isolation between groups? > > I think if problem is solved at IO scheduler level, we can achieve better > latency control and hence stronger isolation between groups. > > Higher level solutions should find it hard to provide same kind of latency > control and isolation between groups as IO scheduler based solution. Why do you think that the higher level solution is hard to provide it? I think that it is a matter of how to implement throttling policy. > Fairness for buffered writes > ============================ > Doing io control at any place below page cache has disadvantage that page > cache might not dispatch more writes from higher weight group hence higher > weight group might not see more IO done. Andrew says that we don't have > a solution to this problem in kernel and he would like to see it handled > properly. > > Only way to solve this seems to be to slow down the writers before they > write into page cache. IO throttling patch handled it by slowing down > writer if it crossed max specified rate. Other suggestions have come in > the form of dirty_ratio per memory cgroup or a separate cgroup controller > al-together where some kind of per group write limit can be specified. > > So if solution is implemented at IO scheduler layer or at device mapper > layer, both shall have to rely on another controller to be co-mounted > to handle buffered writes properly. > > Fairness with-in group > ====================== > One of the issues with higher level controller is that how to do fair > throttling so that fairness with-in group is not impacted. Especially > the case of making sure that we don't break the notion of ioprio of the > processes with-in group. I ran your test script to confirm that the notion of ioprio was not broken by dm-ioband. Here is the results of the test. https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html I think that the time period during which dm-ioband holds IO requests for throttling would be too short to break the notion of ioprio. > Especially io throttling patch was very bad in terms of prio with-in > group where throttling treated everyone equally and difference between > process prio disappeared. > > Reads Vs Writes > =============== > A higher level control most likely will change the ratio in which reads > and writes are dispatched to disk with-in group. It used to be decided > by IO scheduler so far but with higher level groups doing throttling and > possibly buffering the bios and releasing them later, they will have to > come up with their own policy on in what proportion reads and writes > should be dispatched. In case of IO scheduler based control, all the > queuing takes place at IO scheduler and it still retains control of > in what ration reads and writes should be dispatched. I don't think it is a concern. The current implementation of dm-ioband is that sync/async IO requests are handled separately and the backlogged IOs are released according to the order of arrival if both sync and async requests are backlogged. > Summary > ======= > > - An io scheduler based io controller can provide better latencies, > stronger isolation between groups, time based fairness and will not > interfere with io schedulers policies like class, ioprio and > reader vs writer issues. > > But it can gunrantee fairness at higher logical level devices. > Especially in case of max bw control, leaf node control does not sound > to be the most appropriate thing. > > - IO throttling provides max bw control in terms of absolute rate. It has > the advantage that it can provide control at higher level logical device > and also control buffered writes without need of additional controller > co-mounted. > > But it does only max bw control and not proportion control so one might > not be using resources optimally. It looses sense of task prio and class > with-in group as any of the task can be throttled with-in group. Because > throttling does not kick in till you hit the max bw limit, it should find > it hard to provide same latencies as io scheduler based control. > > - dm-ioband also has the advantage that it can provide fairness at higher > level logical devices. > > But, fairness is provided only in terms of size of IO or number of IO. > No time based fairness. It is very throughput oriented and does not > throttle high speed group if other group is running slow random reader. > This results in bad latnecies for random reader group and weaker > isolation between groups. A new policy can be added to dm-ioband. Actually, range-bw policy, which provides min and max bandwidth control, does time-based throttling. Moreover there is room for improvement for existing policies. The write-starve-read issue you pointed out will be solved soon. > Also it does not provide fairness if a group is not continuously > backlogged. So if one is running 1-2 dd/sequential readers in the group, > one does not get fairness until workload is increased to a point where > group becomes continuously backlogged. This also results in poor > latencies and limited fairness. This is intended to efficiently use bandwidth of underlying devices when IO load is low. > At this point of time it does not look like a single IO controller all > the scenarios/requirements. This means few things to me. > > - Drop some of the requirements and go with one implementation which meets > those reduced set of requirements. > > - Have more than one IO controller implementation in kenrel. One for lower > level control for better latencies, stronger isolation and optimal resource > usage and other one for fairness at higher level logical devices and max > bandwidth control. > > And let user decide which one to use based on his/her needs. > > - Come up with more intelligent way of doing IO control where single > controller covers all the cases. > > At this point of time, I am more inclined towards option 2 of having more > than one implementation in kernel. :-) (Until and unless we can brainstrom > and come up with ideas to make option 3 happen). > > > It would be great if we discuss our plans on the mailing list, so we > > can get early feedback from everyone. > > This is what comes to my mind so far. Please add to the list if I have missed > some points. Also correct me if I am wrong about the pros/cons of the > approaches. > > Thoughts/ideas/opinions are welcome... > > Thanks > Vivek Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-29 9:56 ` Ryo Tsuruta @ 2009-09-29 10:49 ` Takuya Yoshikawa 2009-09-29 14:10 ` Vivek Goyal ` (2 subsequent siblings) 3 siblings, 0 replies; 349+ messages in thread From: Takuya Yoshikawa @ 2009-09-29 10:49 UTC (permalink / raw) To: Ryo Tsuruta Cc: vgoyal, nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel Hi, Ryo Tsuruta wrote: > Hi Vivek and all, > > Vivek Goyal <vgoyal@redhat.com> wrote: >> On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > >>> We are starting from a point where there is no cgroup based IO >>> scheduling in the kernel. And it is probably not reasonable to satisfy >>> all IO scheduling related requirements in one patch set. We can start >>> with something simple, and build on top of that. So a very simple >>> patch set that enables cgroup based proportional scheduling for CFQ >>> seems like the way to go at this point. >> Sure, we can start with CFQ only. But a bigger question we need to answer >> is that is CFQ the right place to solve the issue? Jens, do you think >> that CFQ is the right place to solve the problem? >> >> Andrew seems to favor a high level approach so that IO schedulers are less >> complex and we can provide fairness at high level logical devices also. > > I'm not in favor of expansion of CFQ, because some enterprise storages > are better performed with NOOP rather than CFQ, and I think bandwidth > control is needed much more for such storage system. Is it easy to > support other IO schedulers even if a new IO scheduler is introduced? > I would like to know a bit more specific about Namuman's scheduler design. Nauman said "cgroup based proportional scheduling for CFQ" and we need not expand much of CFQ itself, is it right Nauman? If so, we can reuse the io controller for new schedulers similar to CFQ. I do not know well about how much important is it to consider which scheduler is the current enterprise storages' favarite. If we introduce an io controller, io pattern to disks will change, in that case there is no guarantee that NOOP with some io controller should work better than CFQ with some io controller. Of course io controller for NOOP may be better. Thanks, Takuya Yoshikawa > >> I will again try to summarize my understanding so far about the pros/cons >> of each approach and then we can take the discussion forward. > > Good summary. Thanks for your work. > >> Fairness in terms of size of IO or disk time used >> ================================================= >> On a seeky media, fairness in terms of disk time can get us better results >> instead fairness interms of size of IO or number of IO. >> >> If we implement some kind of time based solution at higher layer, then >> that higher layer should know who used how much of time each group used. We >> can probably do some kind of timestamping in bio to get a sense when did it >> get into disk and when did it finish. But on a multi queue hardware there >> can be multiple requests in the disk either from same queue or from differnet >> queues and with pure timestamping based apparoch, so far I could not think >> how at high level we will get an idea who used how much of time. > > IIUC, could the overlap time be calculated from time-stamp on a multi > queue hardware? > >> So this is the first point of contention that how do we want to provide >> fairness. In terms of disk time used or in terms of size of IO/number of >> IO. >> >> Max bandwidth Controller or Proportional bandwidth controller >> ============================================================= >> What is our primary requirement here? A weight based proportional >> bandwidth controller where we can use the resources optimally and any >> kind of throttling kicks in only if there is contention for the disk. >> >> Or we want max bandwidth control where a group is not allowed to use the >> disk even if disk is free. >> >> Or we need both? I would think that at some point of time we will need >> both but we can start with proportional bandwidth control first. > > How about making throttling policy be user selectable like the IO > scheduler and putting it in the higher layer? So we could support > all of policies (time-based, size-based and rate limiting). There > seems not to only one solution which satisfies all users. But I agree > with starting with proportional bandwidth control first. > > BTW, I will start to reimplement dm-ioband into block layer. > >> Fairness for higher level logical devices >> ========================================= >> Do we want good fairness numbers for higher level logical devices also >> or it is sufficient to provide fairness at leaf nodes. Providing fairness >> at leaf nodes can help us use the resources optimally and in the process >> we can get fairness at higher level also in many of the cases. > > We should also take care of block devices which provide their own > make_request_fn() and not use a IO scheduler. We can't use the leaf > nodes approach to such devices. > >> But do we want strict fairness numbers on higher level logical devices >> even if it means sub-optimal usage of unerlying phsical devices? >> >> I think that for proportinal bandwidth control, it should be ok to provide >> fairness at higher level logical device but for max bandwidth control it >> might make more sense to provide fairness at higher level. Consider a >> case where from a striped device a customer wants to limit a group to >> 30MB/s and in case of leaf node control, if every leaf node provides >> 30MB/s, it might accumulate to much more than specified rate at logical >> device. >> >> Latency Control and strong isolation between groups >> =================================================== >> Do we want a good isolation between groups and better latencies and >> stronger isolation between groups? >> >> I think if problem is solved at IO scheduler level, we can achieve better >> latency control and hence stronger isolation between groups. >> >> Higher level solutions should find it hard to provide same kind of latency >> control and isolation between groups as IO scheduler based solution. > > Why do you think that the higher level solution is hard to provide it? > I think that it is a matter of how to implement throttling policy. > >> Fairness for buffered writes >> ============================ >> Doing io control at any place below page cache has disadvantage that page >> cache might not dispatch more writes from higher weight group hence higher >> weight group might not see more IO done. Andrew says that we don't have >> a solution to this problem in kernel and he would like to see it handled >> properly. >> >> Only way to solve this seems to be to slow down the writers before they >> write into page cache. IO throttling patch handled it by slowing down >> writer if it crossed max specified rate. Other suggestions have come in >> the form of dirty_ratio per memory cgroup or a separate cgroup controller >> al-together where some kind of per group write limit can be specified. >> >> So if solution is implemented at IO scheduler layer or at device mapper >> layer, both shall have to rely on another controller to be co-mounted >> to handle buffered writes properly. >> >> Fairness with-in group >> ====================== >> One of the issues with higher level controller is that how to do fair >> throttling so that fairness with-in group is not impacted. Especially >> the case of making sure that we don't break the notion of ioprio of the >> processes with-in group. > > I ran your test script to confirm that the notion of ioprio was not > broken by dm-ioband. Here is the results of the test. > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > I think that the time period during which dm-ioband holds IO requests > for throttling would be too short to break the notion of ioprio. > >> Especially io throttling patch was very bad in terms of prio with-in >> group where throttling treated everyone equally and difference between >> process prio disappeared. >> >> Reads Vs Writes >> =============== >> A higher level control most likely will change the ratio in which reads >> and writes are dispatched to disk with-in group. It used to be decided >> by IO scheduler so far but with higher level groups doing throttling and >> possibly buffering the bios and releasing them later, they will have to >> come up with their own policy on in what proportion reads and writes >> should be dispatched. In case of IO scheduler based control, all the >> queuing takes place at IO scheduler and it still retains control of >> in what ration reads and writes should be dispatched. > > I don't think it is a concern. The current implementation of dm-ioband > is that sync/async IO requests are handled separately and the > backlogged IOs are released according to the order of arrival if both > sync and async requests are backlogged. > >> Summary >> ======= >> >> - An io scheduler based io controller can provide better latencies, >> stronger isolation between groups, time based fairness and will not >> interfere with io schedulers policies like class, ioprio and >> reader vs writer issues. >> >> But it can gunrantee fairness at higher logical level devices. >> Especially in case of max bw control, leaf node control does not sound >> to be the most appropriate thing. >> >> - IO throttling provides max bw control in terms of absolute rate. It has >> the advantage that it can provide control at higher level logical device >> and also control buffered writes without need of additional controller >> co-mounted. >> >> But it does only max bw control and not proportion control so one might >> not be using resources optimally. It looses sense of task prio and class >> with-in group as any of the task can be throttled with-in group. Because >> throttling does not kick in till you hit the max bw limit, it should find >> it hard to provide same latencies as io scheduler based control. >> >> - dm-ioband also has the advantage that it can provide fairness at higher >> level logical devices. >> >> But, fairness is provided only in terms of size of IO or number of IO. >> No time based fairness. It is very throughput oriented and does not >> throttle high speed group if other group is running slow random reader. >> This results in bad latnecies for random reader group and weaker >> isolation between groups. > > A new policy can be added to dm-ioband. Actually, range-bw policy, > which provides min and max bandwidth control, does time-based > throttling. Moreover there is room for improvement for existing > policies. The write-starve-read issue you pointed out will be solved > soon. > >> Also it does not provide fairness if a group is not continuously >> backlogged. So if one is running 1-2 dd/sequential readers in the group, >> one does not get fairness until workload is increased to a point where >> group becomes continuously backlogged. This also results in poor >> latencies and limited fairness. > > This is intended to efficiently use bandwidth of underlying devices > when IO load is low. > >> At this point of time it does not look like a single IO controller all >> the scenarios/requirements. This means few things to me. >> >> - Drop some of the requirements and go with one implementation which meets >> those reduced set of requirements. >> >> - Have more than one IO controller implementation in kenrel. One for lower >> level control for better latencies, stronger isolation and optimal resource >> usage and other one for fairness at higher level logical devices and max >> bandwidth control. >> >> And let user decide which one to use based on his/her needs. >> >> - Come up with more intelligent way of doing IO control where single >> controller covers all the cases. >> >> At this point of time, I am more inclined towards option 2 of having more >> than one implementation in kernel. :-) (Until and unless we can brainstrom >> and come up with ideas to make option 3 happen). >> >>> It would be great if we discuss our plans on the mailing list, so we >>> can get early feedback from everyone. >> >> This is what comes to my mind so far. Please add to the list if I have missed >> some points. Also correct me if I am wrong about the pros/cons of the >> approaches. >> >> Thoughts/ideas/opinions are welcome... >> >> Thanks >> Vivek > > Thanks, > Ryo Tsuruta > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-29 9:56 ` Ryo Tsuruta @ 2009-09-29 14:10 ` Vivek Goyal 2009-09-29 14:10 ` Vivek Goyal ` (2 subsequent siblings) 3 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-29 14:10 UTC (permalink / raw) To: Ryo Tsuruta Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote: > Hi Vivek and all, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > > > > We are starting from a point where there is no cgroup based IO > > > scheduling in the kernel. And it is probably not reasonable to satisfy > > > all IO scheduling related requirements in one patch set. We can start > > > with something simple, and build on top of that. So a very simple > > > patch set that enables cgroup based proportional scheduling for CFQ > > > seems like the way to go at this point. > > > > Sure, we can start with CFQ only. But a bigger question we need to answer > > is that is CFQ the right place to solve the issue? Jens, do you think > > that CFQ is the right place to solve the problem? > > > > Andrew seems to favor a high level approach so that IO schedulers are less > > complex and we can provide fairness at high level logical devices also. > > I'm not in favor of expansion of CFQ, because some enterprise storages > are better performed with NOOP rather than CFQ, and I think bandwidth > control is needed much more for such storage system. Is it easy to > support other IO schedulers even if a new IO scheduler is introduced? > I would like to know a bit more specific about Namuman's scheduler design. > The new design is essentially the old design. Except the fact that suggestion is that in the first step instead of covering all the 4 IO schedulers, first cover only CFQ and then later others. So providing fairness for NOOP is not an issue. Even if we introduce new IO schedulers down the line, I can't think of a reason why can't we cover that too with common layer. > > I will again try to summarize my understanding so far about the pros/cons > > of each approach and then we can take the discussion forward. > > Good summary. Thanks for your work. > > > Fairness in terms of size of IO or disk time used > > ================================================= > > On a seeky media, fairness in terms of disk time can get us better results > > instead fairness interms of size of IO or number of IO. > > > > If we implement some kind of time based solution at higher layer, then > > that higher layer should know who used how much of time each group used. We > > can probably do some kind of timestamping in bio to get a sense when did it > > get into disk and when did it finish. But on a multi queue hardware there > > can be multiple requests in the disk either from same queue or from differnet > > queues and with pure timestamping based apparoch, so far I could not think > > how at high level we will get an idea who used how much of time. > > IIUC, could the overlap time be calculated from time-stamp on a multi > queue hardware? So far could not think of anything clean. Do you have something in mind. I was thinking that elevator layer will do the merge of bios. So IO scheduler/elevator can time stamp the first bio in the request as it goes into the disk and again timestamp with finish time once request finishes. This way higher layer can get an idea how much disk time a group of bios used. But on multi queue, if we dispatch say 4 requests from same queue, then time accounting becomes an issue. Consider following where four requests rq1, rq2, rq3 and rq4 are dispatched to disk at time t0, t1, t2 and t3 respectively and these requests finish at time t4, t5, t6 and t7. For sake of simlicity assume time elapsed between each of milestones is t. Also assume that all these requests are from same queue/group. t0 t1 t2 t3 t4 t5 t6 t7 rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 Now higher layer will think that time consumed by group is: (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t But the time elapsed is only 7t. Secondly if a different group is running only single sequential reader, there CFQ will be driving queue depth of 1 and time will not be running faster and this inaccuracy in accounting will lead to unfair share between groups. So we need something better to get a sense which group used how much of disk time. > > > So this is the first point of contention that how do we want to provide > > fairness. In terms of disk time used or in terms of size of IO/number of > > IO. > > > > Max bandwidth Controller or Proportional bandwidth controller > > ============================================================= > > What is our primary requirement here? A weight based proportional > > bandwidth controller where we can use the resources optimally and any > > kind of throttling kicks in only if there is contention for the disk. > > > > Or we want max bandwidth control where a group is not allowed to use the > > disk even if disk is free. > > > > Or we need both? I would think that at some point of time we will need > > both but we can start with proportional bandwidth control first. > > How about making throttling policy be user selectable like the IO > scheduler and putting it in the higher layer? So we could support > all of policies (time-based, size-based and rate limiting). There > seems not to only one solution which satisfies all users. But I agree > with starting with proportional bandwidth control first. > What are the cases where time based policy does not work and size based policy works better and user would choose size based policy and not timed based one? I am not against implementing things in higher layer as long as we can ensure tight control on latencies, strong isolation between groups and not break CFQ's class and ioprio model with-in group. > BTW, I will start to reimplement dm-ioband into block layer. Can you elaborate little bit on this? > > > Fairness for higher level logical devices > > ========================================= > > Do we want good fairness numbers for higher level logical devices also > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > at leaf nodes can help us use the resources optimally and in the process > > we can get fairness at higher level also in many of the cases. > > We should also take care of block devices which provide their own > make_request_fn() and not use a IO scheduler. We can't use the leaf > nodes approach to such devices. > I am not sure how big an issue is this. This can be easily solved by making use of NOOP scheduler by these devices. What are the reasons for these devices to not use even noop? > > But do we want strict fairness numbers on higher level logical devices > > even if it means sub-optimal usage of unerlying phsical devices? > > > > I think that for proportinal bandwidth control, it should be ok to provide > > fairness at higher level logical device but for max bandwidth control it > > might make more sense to provide fairness at higher level. Consider a > > case where from a striped device a customer wants to limit a group to > > 30MB/s and in case of leaf node control, if every leaf node provides > > 30MB/s, it might accumulate to much more than specified rate at logical > > device. > > > > Latency Control and strong isolation between groups > > =================================================== > > Do we want a good isolation between groups and better latencies and > > stronger isolation between groups? > > > > I think if problem is solved at IO scheduler level, we can achieve better > > latency control and hence stronger isolation between groups. > > > > Higher level solutions should find it hard to provide same kind of latency > > control and isolation between groups as IO scheduler based solution. > > Why do you think that the higher level solution is hard to provide it? > I think that it is a matter of how to implement throttling policy. > So far both in dm-ioband and IO throttling solution I have seen that higher layer implements some of kind leaky bucket/token bucket algorithm, which inherently allows IO from all the competing groups until they run out of tokens and then these groups are made to wait till fresh tokens are issued. That means, most of the times, IO scheduler will see requests from more than one group at the same time and that will be the source of weak isolation between groups. Consider following simple examples. Assume there are two groups and one contains 16 random readers and other contains 1 random reader. G1 G2 16RR 1RR Now it might happen that IO scheduler sees requests from all the 17 RR readers at the same time. (Throttling probably will kick in later because you would like to give one group a nice slice of 100ms otherwise sequential readers will suffer a lot and disk will become seek bound). So CFQ will dispatch requests (at least one), from each of the 16 random readers first and then from 1 random reader in group 2 and this increases the max latency for the application in group 2 and provides weak isolation. There will also be additional issues with CFQ preemtpion logic. CFQ will have no knowledge of groups and it will do cross group preemtptions. For example if a meta data request comes in group1, it will preempt any of the queue being served in other groups. So somebody doing "find . *" or "cat <small files>" in one group will keep on preempting a sequential reader in other group. Again this will probably lead to higher max latencies. Note, even if CFQ does not enable idling on random readers, and expires queue after single dispatch, seeking time between queues can be significant. Similarly, if instead of 16 random reders we had 16 random synchronous writers we will have seek time issue as well as writers can often dump bigger requests which also adds to latency. This latency issue can be solved if we dispatch requests only from one group for a certain time of time and then move to next group. (Something what common layer is doing). If we go for only single group dispatching requests, then we shall have to implemnt some of the preemption semantics also in higher layer because in certain cases we want to do preemption across the groups. Like RT task group preemting non-RT task group etc. Once we go deeper into implementation, I think we will find more issues. > > Fairness for buffered writes > > ============================ > > Doing io control at any place below page cache has disadvantage that page > > cache might not dispatch more writes from higher weight group hence higher > > weight group might not see more IO done. Andrew says that we don't have > > a solution to this problem in kernel and he would like to see it handled > > properly. > > > > Only way to solve this seems to be to slow down the writers before they > > write into page cache. IO throttling patch handled it by slowing down > > writer if it crossed max specified rate. Other suggestions have come in > > the form of dirty_ratio per memory cgroup or a separate cgroup controller > > al-together where some kind of per group write limit can be specified. > > > > So if solution is implemented at IO scheduler layer or at device mapper > > layer, both shall have to rely on another controller to be co-mounted > > to handle buffered writes properly. > > > > Fairness with-in group > > ====================== > > One of the issues with higher level controller is that how to do fair > > throttling so that fairness with-in group is not impacted. Especially > > the case of making sure that we don't break the notion of ioprio of the > > processes with-in group. > > I ran your test script to confirm that the notion of ioprio was not > broken by dm-ioband. Here is the results of the test. > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > I think that the time period during which dm-ioband holds IO requests > for throttling would be too short to break the notion of ioprio. Ok, I re-ran that test. Previously default io_limit value was 192 and now I set it up to 256 as you suggested. I still see writer starving reader. I have removed "conv=fdatasync" from writer so that a writer is pure buffered writes. With vanilla CFQ ---------------- reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s with dm-ioband default io_limit=192 ----------------------------------- writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 with dm-ioband default io_limit=256 ----------------------------------- reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100 ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100 Notice that with vanilla CFQ, reader is taking 10 seconds to finish and with dm-ioband it takes more than 40 seconds to finish. So writer is still starving the reader with both io_limit 192 and 256. On top of that can you please give some details how increasing the buffered queue length reduces the impact of writers? IO Prio issue -------------- I ran another test where two ioband devices were created of weight 100 each on two partitions. In first group 4 readers were launched. Three readers are of class BE and prio 7, fourth one is of class BE prio 0. In group2, I launched a buffered writer. One would expect that prio0 reader gets more bandwidth as compared to prio 4 readers and prio 7 readers will get more or less same bw. Looks like that is not happening. Look how vanilla CFQ provides much more bandwidth to prio0 reader as compared to prio7 reader and how putting them in the group reduces the difference betweej prio0 and prio7 readers. Following are the results. Vanilla CFQ =========== set1 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s 578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s 578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s set2 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s 578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s 578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s set3 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s 578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s with dm-ioband ============== ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 set1 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s 578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s 578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s 578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s set2 --- prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s 578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s 578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s 578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s set3 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s 578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s 578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s 578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader. With dm-ioband this ratio changed to less than 200%. I will run more tests, but this show how notion of priority with-in a group changes if we implement throttling at higher layer and don't keep it with CFQ. The second thing which strikes me is that I divided the disk 50% each between readers and writers and in that case would expect protection for writers and expect writers to finish fast. But writers have been slowed down like and it also kills overall disk throughput. I think it probably became seek bound. I think the moment I get more time, I will run some timed fio tests and look at how overall disk performed and how bandwidth was distributed with-in group and between groups. > > > Especially io throttling patch was very bad in terms of prio with-in > > group where throttling treated everyone equally and difference between > > process prio disappeared. > > > > Reads Vs Writes > > =============== > > A higher level control most likely will change the ratio in which reads > > and writes are dispatched to disk with-in group. It used to be decided > > by IO scheduler so far but with higher level groups doing throttling and > > possibly buffering the bios and releasing them later, they will have to > > come up with their own policy on in what proportion reads and writes > > should be dispatched. In case of IO scheduler based control, all the > > queuing takes place at IO scheduler and it still retains control of > > in what ration reads and writes should be dispatched. > > I don't think it is a concern. The current implementation of dm-ioband > is that sync/async IO requests are handled separately and the > backlogged IOs are released according to the order of arrival if both > sync and async requests are backlogged. At least the version of dm-ioband I have is not producing the desired results. See above. Is there a newer version? I will run some tests on that too. But I think you will again run into same issue where you will decide the ratio of read vs write with-in group and as I change the IO schedulers results will vary. So at this point of time I can't think how can you solve read vs write ratio issue at higher layer without changing the behavior or underlying IO scheduler. > > > Summary > > ======= > > > > - An io scheduler based io controller can provide better latencies, > > stronger isolation between groups, time based fairness and will not > > interfere with io schedulers policies like class, ioprio and > > reader vs writer issues. > > > > But it can gunrantee fairness at higher logical level devices. > > Especially in case of max bw control, leaf node control does not sound > > to be the most appropriate thing. > > > > - IO throttling provides max bw control in terms of absolute rate. It has > > the advantage that it can provide control at higher level logical device > > and also control buffered writes without need of additional controller > > co-mounted. > > > > But it does only max bw control and not proportion control so one might > > not be using resources optimally. It looses sense of task prio and class > > with-in group as any of the task can be throttled with-in group. Because > > throttling does not kick in till you hit the max bw limit, it should find > > it hard to provide same latencies as io scheduler based control. > > > > - dm-ioband also has the advantage that it can provide fairness at higher > > level logical devices. > > > > But, fairness is provided only in terms of size of IO or number of IO. > > No time based fairness. It is very throughput oriented and does not > > throttle high speed group if other group is running slow random reader. > > This results in bad latnecies for random reader group and weaker > > isolation between groups. > > A new policy can be added to dm-ioband. Actually, range-bw policy, > which provides min and max bandwidth control, does time-based > throttling. Moreover there is room for improvement for existing > policies. The write-starve-read issue you pointed out will be solved > soon. > > > Also it does not provide fairness if a group is not continuously > > backlogged. So if one is running 1-2 dd/sequential readers in the group, > > one does not get fairness until workload is increased to a point where > > group becomes continuously backlogged. This also results in poor > > latencies and limited fairness. > > This is intended to efficiently use bandwidth of underlying devices > when IO load is low. But this has following undesired results. - Slow moving group does not get reduced latencies. For example, random readers in slow moving group get no isolation and will continue to see higher max latencies. - A single sequential reader in one group does not get fair share and we might be pushing buffered writes in other group thinking that we are getting better throughput. But the fact is that we are eating away readers share in group1 and giving it to writers in group2. Also I showed that we did not necessarily improve the overall throughput of the system by doing so. (Because it increases the number of seeks). I had sent you a mail to show that. http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html But you changed the test case to run 4 readers in a single group to show that it throughput does not decrease. Please don't change test cases. In case of 4 sequential readers in the group, group is continuously backlogged and you don't steal bandwidth from slow moving group. So in that mail I was not even discussing the scenario when you don't steal the bandwidth from other group. I specially created one slow moving group with one reader so that we end up stealing bandwidth from slow moving group and show that we did not achive higher overall throughput by stealing the BW at the same time we did not get fairness for single reader and observed decreasing throughput for single reader as number of writers in other group increased. Thanks Vivek > > > At this point of time it does not look like a single IO controller all > > the scenarios/requirements. This means few things to me. > > > > - Drop some of the requirements and go with one implementation which meets > > those reduced set of requirements. > > > > - Have more than one IO controller implementation in kenrel. One for lower > > level control for better latencies, stronger isolation and optimal resource > > usage and other one for fairness at higher level logical devices and max > > bandwidth control. > > > > And let user decide which one to use based on his/her needs. > > > > - Come up with more intelligent way of doing IO control where single > > controller covers all the cases. > > > > At this point of time, I am more inclined towards option 2 of having more > > than one implementation in kernel. :-) (Until and unless we can brainstrom > > and come up with ideas to make option 3 happen). > > > > > It would be great if we discuss our plans on the mailing list, so we > > > can get early feedback from everyone. > > > > This is what comes to my mind so far. Please add to the list if I have missed > > some points. Also correct me if I am wrong about the pros/cons of the > > approaches. > > > > Thoughts/ideas/opinions are welcome... > > > > Thanks > > Vivek > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-29 14:10 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-29 14:10 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote: > Hi Vivek and all, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > > > > We are starting from a point where there is no cgroup based IO > > > scheduling in the kernel. And it is probably not reasonable to satisfy > > > all IO scheduling related requirements in one patch set. We can start > > > with something simple, and build on top of that. So a very simple > > > patch set that enables cgroup based proportional scheduling for CFQ > > > seems like the way to go at this point. > > > > Sure, we can start with CFQ only. But a bigger question we need to answer > > is that is CFQ the right place to solve the issue? Jens, do you think > > that CFQ is the right place to solve the problem? > > > > Andrew seems to favor a high level approach so that IO schedulers are less > > complex and we can provide fairness at high level logical devices also. > > I'm not in favor of expansion of CFQ, because some enterprise storages > are better performed with NOOP rather than CFQ, and I think bandwidth > control is needed much more for such storage system. Is it easy to > support other IO schedulers even if a new IO scheduler is introduced? > I would like to know a bit more specific about Namuman's scheduler design. > The new design is essentially the old design. Except the fact that suggestion is that in the first step instead of covering all the 4 IO schedulers, first cover only CFQ and then later others. So providing fairness for NOOP is not an issue. Even if we introduce new IO schedulers down the line, I can't think of a reason why can't we cover that too with common layer. > > I will again try to summarize my understanding so far about the pros/cons > > of each approach and then we can take the discussion forward. > > Good summary. Thanks for your work. > > > Fairness in terms of size of IO or disk time used > > ================================================= > > On a seeky media, fairness in terms of disk time can get us better results > > instead fairness interms of size of IO or number of IO. > > > > If we implement some kind of time based solution at higher layer, then > > that higher layer should know who used how much of time each group used. We > > can probably do some kind of timestamping in bio to get a sense when did it > > get into disk and when did it finish. But on a multi queue hardware there > > can be multiple requests in the disk either from same queue or from differnet > > queues and with pure timestamping based apparoch, so far I could not think > > how at high level we will get an idea who used how much of time. > > IIUC, could the overlap time be calculated from time-stamp on a multi > queue hardware? So far could not think of anything clean. Do you have something in mind. I was thinking that elevator layer will do the merge of bios. So IO scheduler/elevator can time stamp the first bio in the request as it goes into the disk and again timestamp with finish time once request finishes. This way higher layer can get an idea how much disk time a group of bios used. But on multi queue, if we dispatch say 4 requests from same queue, then time accounting becomes an issue. Consider following where four requests rq1, rq2, rq3 and rq4 are dispatched to disk at time t0, t1, t2 and t3 respectively and these requests finish at time t4, t5, t6 and t7. For sake of simlicity assume time elapsed between each of milestones is t. Also assume that all these requests are from same queue/group. t0 t1 t2 t3 t4 t5 t6 t7 rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 Now higher layer will think that time consumed by group is: (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t But the time elapsed is only 7t. Secondly if a different group is running only single sequential reader, there CFQ will be driving queue depth of 1 and time will not be running faster and this inaccuracy in accounting will lead to unfair share between groups. So we need something better to get a sense which group used how much of disk time. > > > So this is the first point of contention that how do we want to provide > > fairness. In terms of disk time used or in terms of size of IO/number of > > IO. > > > > Max bandwidth Controller or Proportional bandwidth controller > > ============================================================= > > What is our primary requirement here? A weight based proportional > > bandwidth controller where we can use the resources optimally and any > > kind of throttling kicks in only if there is contention for the disk. > > > > Or we want max bandwidth control where a group is not allowed to use the > > disk even if disk is free. > > > > Or we need both? I would think that at some point of time we will need > > both but we can start with proportional bandwidth control first. > > How about making throttling policy be user selectable like the IO > scheduler and putting it in the higher layer? So we could support > all of policies (time-based, size-based and rate limiting). There > seems not to only one solution which satisfies all users. But I agree > with starting with proportional bandwidth control first. > What are the cases where time based policy does not work and size based policy works better and user would choose size based policy and not timed based one? I am not against implementing things in higher layer as long as we can ensure tight control on latencies, strong isolation between groups and not break CFQ's class and ioprio model with-in group. > BTW, I will start to reimplement dm-ioband into block layer. Can you elaborate little bit on this? > > > Fairness for higher level logical devices > > ========================================= > > Do we want good fairness numbers for higher level logical devices also > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > at leaf nodes can help us use the resources optimally and in the process > > we can get fairness at higher level also in many of the cases. > > We should also take care of block devices which provide their own > make_request_fn() and not use a IO scheduler. We can't use the leaf > nodes approach to such devices. > I am not sure how big an issue is this. This can be easily solved by making use of NOOP scheduler by these devices. What are the reasons for these devices to not use even noop? > > But do we want strict fairness numbers on higher level logical devices > > even if it means sub-optimal usage of unerlying phsical devices? > > > > I think that for proportinal bandwidth control, it should be ok to provide > > fairness at higher level logical device but for max bandwidth control it > > might make more sense to provide fairness at higher level. Consider a > > case where from a striped device a customer wants to limit a group to > > 30MB/s and in case of leaf node control, if every leaf node provides > > 30MB/s, it might accumulate to much more than specified rate at logical > > device. > > > > Latency Control and strong isolation between groups > > =================================================== > > Do we want a good isolation between groups and better latencies and > > stronger isolation between groups? > > > > I think if problem is solved at IO scheduler level, we can achieve better > > latency control and hence stronger isolation between groups. > > > > Higher level solutions should find it hard to provide same kind of latency > > control and isolation between groups as IO scheduler based solution. > > Why do you think that the higher level solution is hard to provide it? > I think that it is a matter of how to implement throttling policy. > So far both in dm-ioband and IO throttling solution I have seen that higher layer implements some of kind leaky bucket/token bucket algorithm, which inherently allows IO from all the competing groups until they run out of tokens and then these groups are made to wait till fresh tokens are issued. That means, most of the times, IO scheduler will see requests from more than one group at the same time and that will be the source of weak isolation between groups. Consider following simple examples. Assume there are two groups and one contains 16 random readers and other contains 1 random reader. G1 G2 16RR 1RR Now it might happen that IO scheduler sees requests from all the 17 RR readers at the same time. (Throttling probably will kick in later because you would like to give one group a nice slice of 100ms otherwise sequential readers will suffer a lot and disk will become seek bound). So CFQ will dispatch requests (at least one), from each of the 16 random readers first and then from 1 random reader in group 2 and this increases the max latency for the application in group 2 and provides weak isolation. There will also be additional issues with CFQ preemtpion logic. CFQ will have no knowledge of groups and it will do cross group preemtptions. For example if a meta data request comes in group1, it will preempt any of the queue being served in other groups. So somebody doing "find . *" or "cat <small files>" in one group will keep on preempting a sequential reader in other group. Again this will probably lead to higher max latencies. Note, even if CFQ does not enable idling on random readers, and expires queue after single dispatch, seeking time between queues can be significant. Similarly, if instead of 16 random reders we had 16 random synchronous writers we will have seek time issue as well as writers can often dump bigger requests which also adds to latency. This latency issue can be solved if we dispatch requests only from one group for a certain time of time and then move to next group. (Something what common layer is doing). If we go for only single group dispatching requests, then we shall have to implemnt some of the preemption semantics also in higher layer because in certain cases we want to do preemption across the groups. Like RT task group preemting non-RT task group etc. Once we go deeper into implementation, I think we will find more issues. > > Fairness for buffered writes > > ============================ > > Doing io control at any place below page cache has disadvantage that page > > cache might not dispatch more writes from higher weight group hence higher > > weight group might not see more IO done. Andrew says that we don't have > > a solution to this problem in kernel and he would like to see it handled > > properly. > > > > Only way to solve this seems to be to slow down the writers before they > > write into page cache. IO throttling patch handled it by slowing down > > writer if it crossed max specified rate. Other suggestions have come in > > the form of dirty_ratio per memory cgroup or a separate cgroup controller > > al-together where some kind of per group write limit can be specified. > > > > So if solution is implemented at IO scheduler layer or at device mapper > > layer, both shall have to rely on another controller to be co-mounted > > to handle buffered writes properly. > > > > Fairness with-in group > > ====================== > > One of the issues with higher level controller is that how to do fair > > throttling so that fairness with-in group is not impacted. Especially > > the case of making sure that we don't break the notion of ioprio of the > > processes with-in group. > > I ran your test script to confirm that the notion of ioprio was not > broken by dm-ioband. Here is the results of the test. > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > I think that the time period during which dm-ioband holds IO requests > for throttling would be too short to break the notion of ioprio. Ok, I re-ran that test. Previously default io_limit value was 192 and now I set it up to 256 as you suggested. I still see writer starving reader. I have removed "conv=fdatasync" from writer so that a writer is pure buffered writes. With vanilla CFQ ---------------- reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s with dm-ioband default io_limit=192 ----------------------------------- writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 with dm-ioband default io_limit=256 ----------------------------------- reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100 ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100 Notice that with vanilla CFQ, reader is taking 10 seconds to finish and with dm-ioband it takes more than 40 seconds to finish. So writer is still starving the reader with both io_limit 192 and 256. On top of that can you please give some details how increasing the buffered queue length reduces the impact of writers? IO Prio issue -------------- I ran another test where two ioband devices were created of weight 100 each on two partitions. In first group 4 readers were launched. Three readers are of class BE and prio 7, fourth one is of class BE prio 0. In group2, I launched a buffered writer. One would expect that prio0 reader gets more bandwidth as compared to prio 4 readers and prio 7 readers will get more or less same bw. Looks like that is not happening. Look how vanilla CFQ provides much more bandwidth to prio0 reader as compared to prio7 reader and how putting them in the group reduces the difference betweej prio0 and prio7 readers. Following are the results. Vanilla CFQ =========== set1 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s 578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s 578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s set2 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s 578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s 578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s set3 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s 578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s with dm-ioband ============== ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 set1 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s 578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s 578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s 578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s set2 --- prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s 578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s 578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s 578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s set3 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s 578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s 578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s 578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader. With dm-ioband this ratio changed to less than 200%. I will run more tests, but this show how notion of priority with-in a group changes if we implement throttling at higher layer and don't keep it with CFQ. The second thing which strikes me is that I divided the disk 50% each between readers and writers and in that case would expect protection for writers and expect writers to finish fast. But writers have been slowed down like and it also kills overall disk throughput. I think it probably became seek bound. I think the moment I get more time, I will run some timed fio tests and look at how overall disk performed and how bandwidth was distributed with-in group and between groups. > > > Especially io throttling patch was very bad in terms of prio with-in > > group where throttling treated everyone equally and difference between > > process prio disappeared. > > > > Reads Vs Writes > > =============== > > A higher level control most likely will change the ratio in which reads > > and writes are dispatched to disk with-in group. It used to be decided > > by IO scheduler so far but with higher level groups doing throttling and > > possibly buffering the bios and releasing them later, they will have to > > come up with their own policy on in what proportion reads and writes > > should be dispatched. In case of IO scheduler based control, all the > > queuing takes place at IO scheduler and it still retains control of > > in what ration reads and writes should be dispatched. > > I don't think it is a concern. The current implementation of dm-ioband > is that sync/async IO requests are handled separately and the > backlogged IOs are released according to the order of arrival if both > sync and async requests are backlogged. At least the version of dm-ioband I have is not producing the desired results. See above. Is there a newer version? I will run some tests on that too. But I think you will again run into same issue where you will decide the ratio of read vs write with-in group and as I change the IO schedulers results will vary. So at this point of time I can't think how can you solve read vs write ratio issue at higher layer without changing the behavior or underlying IO scheduler. > > > Summary > > ======= > > > > - An io scheduler based io controller can provide better latencies, > > stronger isolation between groups, time based fairness and will not > > interfere with io schedulers policies like class, ioprio and > > reader vs writer issues. > > > > But it can gunrantee fairness at higher logical level devices. > > Especially in case of max bw control, leaf node control does not sound > > to be the most appropriate thing. > > > > - IO throttling provides max bw control in terms of absolute rate. It has > > the advantage that it can provide control at higher level logical device > > and also control buffered writes without need of additional controller > > co-mounted. > > > > But it does only max bw control and not proportion control so one might > > not be using resources optimally. It looses sense of task prio and class > > with-in group as any of the task can be throttled with-in group. Because > > throttling does not kick in till you hit the max bw limit, it should find > > it hard to provide same latencies as io scheduler based control. > > > > - dm-ioband also has the advantage that it can provide fairness at higher > > level logical devices. > > > > But, fairness is provided only in terms of size of IO or number of IO. > > No time based fairness. It is very throughput oriented and does not > > throttle high speed group if other group is running slow random reader. > > This results in bad latnecies for random reader group and weaker > > isolation between groups. > > A new policy can be added to dm-ioband. Actually, range-bw policy, > which provides min and max bandwidth control, does time-based > throttling. Moreover there is room for improvement for existing > policies. The write-starve-read issue you pointed out will be solved > soon. > > > Also it does not provide fairness if a group is not continuously > > backlogged. So if one is running 1-2 dd/sequential readers in the group, > > one does not get fairness until workload is increased to a point where > > group becomes continuously backlogged. This also results in poor > > latencies and limited fairness. > > This is intended to efficiently use bandwidth of underlying devices > when IO load is low. But this has following undesired results. - Slow moving group does not get reduced latencies. For example, random readers in slow moving group get no isolation and will continue to see higher max latencies. - A single sequential reader in one group does not get fair share and we might be pushing buffered writes in other group thinking that we are getting better throughput. But the fact is that we are eating away readers share in group1 and giving it to writers in group2. Also I showed that we did not necessarily improve the overall throughput of the system by doing so. (Because it increases the number of seeks). I had sent you a mail to show that. http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html But you changed the test case to run 4 readers in a single group to show that it throughput does not decrease. Please don't change test cases. In case of 4 sequential readers in the group, group is continuously backlogged and you don't steal bandwidth from slow moving group. So in that mail I was not even discussing the scenario when you don't steal the bandwidth from other group. I specially created one slow moving group with one reader so that we end up stealing bandwidth from slow moving group and show that we did not achive higher overall throughput by stealing the BW at the same time we did not get fairness for single reader and observed decreasing throughput for single reader as number of writers in other group increased. Thanks Vivek > > > At this point of time it does not look like a single IO controller all > > the scenarios/requirements. This means few things to me. > > > > - Drop some of the requirements and go with one implementation which meets > > those reduced set of requirements. > > > > - Have more than one IO controller implementation in kenrel. One for lower > > level control for better latencies, stronger isolation and optimal resource > > usage and other one for fairness at higher level logical devices and max > > bandwidth control. > > > > And let user decide which one to use based on his/her needs. > > > > - Come up with more intelligent way of doing IO control where single > > controller covers all the cases. > > > > At this point of time, I am more inclined towards option 2 of having more > > than one implementation in kernel. :-) (Until and unless we can brainstrom > > and come up with ideas to make option 3 happen). > > > > > It would be great if we discuss our plans on the mailing list, so we > > > can get early feedback from everyone. > > > > This is what comes to my mind so far. Please add to the list if I have missed > > some points. Also correct me if I am wrong about the pros/cons of the > > approaches. > > > > Thoughts/ideas/opinions are welcome... > > > > Thanks > > Vivek > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-29 14:10 ` Vivek Goyal (?) @ 2009-09-29 19:53 ` Nauman Rafique -1 siblings, 0 replies; 349+ messages in thread From: Nauman Rafique @ 2009-09-29 19:53 UTC (permalink / raw) To: Vivek Goyal Cc: Ryo Tsuruta, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya We have been going around in circles for past many months on this issue of IO controller. I thought that we are getting closer to a point where we agree on one approach and go with it, but apparently we are not. I think it would be useful at this point to learn from the example of how similar functionality was introduced for other resources like cpu scheduling and memory controllers. We are starting from a point where there is no cgroup based resource allocation for disks and there is a lot to be done. CFS has been doing hierarchical proportional allocation for CPU scheduling for a while now. Only recently someone has sent out patches for enforcing upper limits. And it makes a lot of sense (more discussion on this later). Also Fernando tells me that memory controller did not support hierarchies in the first attempt. What I don't understand is, if we are starting from scratch, why do we want to solve all the problems of IO scheduling in one attempt? Max bandwidth Controller or Proportional bandwidth controller =============================================== Enforcing limits is applicable in the scenario where you are managing a bunch of services in a data center and you want to either charge them for what they use or you want a very predictable performance over time. If we just do proportional allocation, then the actual performance received by a user depends on other co-scheduled tasks. If other tasks are not using the resource, you end up using their share. But if all the other co-users become active, the 'extra' resource that you had would be taken away. Thus without enforcing some upper limit, predictability gets hurt. But this becomes an issue only if we are sharing resources. The most important precondition to sharing resources is 'the requirement to provide isolation'. And isolation includes controlling both bandwidth AND latency, in the presence of other sharers. As Vivek has rightly pointed out, a ticket allocation based algorithm is good for enforcing upper limits, but it is NOT good for providing isolation i.e. latency control and even bandwidth in some cases (as Vivek has shown with results in the last few emails). Moreover, a solution that is implemented in higher layers (be it VFS or DM) has little control over what happens in IO scheduler, again hurting the isolation goal. In the absence of isolation, we cannot even start sharing a resource. The predictability or billing are secondary concerns that arise only if we are sharing resources. If there is somebody who does not care about isolation, but want to do their billing correctly, I would like to know about it. Needless to say that max bandwidth limits can also be enforced at IO scheduling layer. Common layer vs CFS ================== Takuya has raised an interesting point here. If somebody wishes to use noop, using a common layer IO controller on top of noop isn't necessarily going to give them the same thing. In fact, with IO controller, noop might behave much like CFQ. Moreover at one point, if we decide that we absolutely need IO controller to work for other schedulers too, we have this Vivek's patch set as a proof-of-concept. For now, as Jens very rightly pointed out in our discussion, we can have a "simple scheduler: Noop" and an "intelligent scheduler: CFQ with cgroup based scheduling". Class based scheduling =================== CFQ has this notion of classes that needs to be supported in any solution that we come up with, otherwise we break the semantics of the existing scheduler. We have workloads which have strong latency requirements. We have two options: either don't do resource sharing for them OR share the resource but put them in a higher class (RT) so that their latencies are not (or minimally) effected by other workloads running with them. A solution in higher layer can try to support those semantics, but what if somebody wants to use a Noop scheduler and does not care about those semantics? We will end up with multiple schedulers in the upper layers, and who knows where all this will stop. Controlling writeback ================ It seems like writeback path has problems, but we should not try to solve those problems with the same patch set that is trying to do basic cgroup based IO scheduling. Jens patches for per-bdi pdflush are already in. They should solve the problem of pdflush not sending down enough IOs; at least Jens results seem to show that. IMHO, the next step is to use memory controller in conjunction with IO controller, and a per group per bdi pdflush threads (only if a group is doing IO on that bdi), something similar to io_group that we have in Vivek's patches. That should solve multiple problems. First, it would allow us to obviate the need of any tracking for dirty pages. Second, we can build a feedback from IO scheduling layer to the upper layers. If the number of pending writes in IO controller for a given group exceed a limit, we block the submitting thread (pdflush), similar to current congestion implementation. Then the group would start hitting dirty limits at one point (we would need per group dirty limits, as has already been pointed out by others), thus blocking the tasks that are dirtying the pages. Thus using a block layer IO controller, we can achieve the affect similar achieved by Righi's proposal. Vivek has summarized most of the other arguments very well. In short, what I am trying to say is lets start with something very simple that satisfies some of the most important requirements and we can build upon that. On Tue, Sep 29, 2009 at 7:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote: >> Hi Vivek and all, >> >> Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: >> >> > > We are starting from a point where there is no cgroup based IO >> > > scheduling in the kernel. And it is probably not reasonable to satisfy >> > > all IO scheduling related requirements in one patch set. We can start >> > > with something simple, and build on top of that. So a very simple >> > > patch set that enables cgroup based proportional scheduling for CFQ >> > > seems like the way to go at this point. >> > >> > Sure, we can start with CFQ only. But a bigger question we need to answer >> > is that is CFQ the right place to solve the issue? Jens, do you think >> > that CFQ is the right place to solve the problem? >> > >> > Andrew seems to favor a high level approach so that IO schedulers are less >> > complex and we can provide fairness at high level logical devices also. >> >> I'm not in favor of expansion of CFQ, because some enterprise storages >> are better performed with NOOP rather than CFQ, and I think bandwidth >> control is needed much more for such storage system. Is it easy to >> support other IO schedulers even if a new IO scheduler is introduced? >> I would like to know a bit more specific about Namuman's scheduler design. >> > > The new design is essentially the old design. Except the fact that > suggestion is that in the first step instead of covering all the 4 IO > schedulers, first cover only CFQ and then later others. > > So providing fairness for NOOP is not an issue. Even if we introduce new > IO schedulers down the line, I can't think of a reason why can't we cover > that too with common layer. > >> > I will again try to summarize my understanding so far about the pros/cons >> > of each approach and then we can take the discussion forward. >> >> Good summary. Thanks for your work. >> >> > Fairness in terms of size of IO or disk time used >> > ================================================= >> > On a seeky media, fairness in terms of disk time can get us better results >> > instead fairness interms of size of IO or number of IO. >> > >> > If we implement some kind of time based solution at higher layer, then >> > that higher layer should know who used how much of time each group used. We >> > can probably do some kind of timestamping in bio to get a sense when did it >> > get into disk and when did it finish. But on a multi queue hardware there >> > can be multiple requests in the disk either from same queue or from differnet >> > queues and with pure timestamping based apparoch, so far I could not think >> > how at high level we will get an idea who used how much of time. >> >> IIUC, could the overlap time be calculated from time-stamp on a multi >> queue hardware? > > So far could not think of anything clean. Do you have something in mind. > > I was thinking that elevator layer will do the merge of bios. So IO > scheduler/elevator can time stamp the first bio in the request as it goes > into the disk and again timestamp with finish time once request finishes. > > This way higher layer can get an idea how much disk time a group of bios > used. But on multi queue, if we dispatch say 4 requests from same queue, > then time accounting becomes an issue. > > Consider following where four requests rq1, rq2, rq3 and rq4 are > dispatched to disk at time t0, t1, t2 and t3 respectively and these > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > time elapsed between each of milestones is t. Also assume that all these > requests are from same queue/group. > > t0 t1 t2 t3 t4 t5 t6 t7 > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > Now higher layer will think that time consumed by group is: > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > But the time elapsed is only 7t. > > Secondly if a different group is running only single sequential reader, > there CFQ will be driving queue depth of 1 and time will not be running > faster and this inaccuracy in accounting will lead to unfair share between > groups. > > So we need something better to get a sense which group used how much of > disk time. > >> >> > So this is the first point of contention that how do we want to provide >> > fairness. In terms of disk time used or in terms of size of IO/number of >> > IO. >> > >> > Max bandwidth Controller or Proportional bandwidth controller >> > ============================================================= >> > What is our primary requirement here? A weight based proportional >> > bandwidth controller where we can use the resources optimally and any >> > kind of throttling kicks in only if there is contention for the disk. >> > >> > Or we want max bandwidth control where a group is not allowed to use the >> > disk even if disk is free. >> > >> > Or we need both? I would think that at some point of time we will need >> > both but we can start with proportional bandwidth control first. >> >> How about making throttling policy be user selectable like the IO >> scheduler and putting it in the higher layer? So we could support >> all of policies (time-based, size-based and rate limiting). There >> seems not to only one solution which satisfies all users. But I agree >> with starting with proportional bandwidth control first. >> > > What are the cases where time based policy does not work and size based > policy works better and user would choose size based policy and not timed > based one? > > I am not against implementing things in higher layer as long as we can > ensure tight control on latencies, strong isolation between groups and > not break CFQ's class and ioprio model with-in group. > >> BTW, I will start to reimplement dm-ioband into block layer. > > Can you elaborate little bit on this? > >> >> > Fairness for higher level logical devices >> > ========================================= >> > Do we want good fairness numbers for higher level logical devices also >> > or it is sufficient to provide fairness at leaf nodes. Providing fairness >> > at leaf nodes can help us use the resources optimally and in the process >> > we can get fairness at higher level also in many of the cases. >> >> We should also take care of block devices which provide their own >> make_request_fn() and not use a IO scheduler. We can't use the leaf >> nodes approach to such devices. >> > > I am not sure how big an issue is this. This can be easily solved by > making use of NOOP scheduler by these devices. What are the reasons for > these devices to not use even noop? > >> > But do we want strict fairness numbers on higher level logical devices >> > even if it means sub-optimal usage of unerlying phsical devices? >> > >> > I think that for proportinal bandwidth control, it should be ok to provide >> > fairness at higher level logical device but for max bandwidth control it >> > might make more sense to provide fairness at higher level. Consider a >> > case where from a striped device a customer wants to limit a group to >> > 30MB/s and in case of leaf node control, if every leaf node provides >> > 30MB/s, it might accumulate to much more than specified rate at logical >> > device. >> > >> > Latency Control and strong isolation between groups >> > =================================================== >> > Do we want a good isolation between groups and better latencies and >> > stronger isolation between groups? >> > >> > I think if problem is solved at IO scheduler level, we can achieve better >> > latency control and hence stronger isolation between groups. >> > >> > Higher level solutions should find it hard to provide same kind of latency >> > control and isolation between groups as IO scheduler based solution. >> >> Why do you think that the higher level solution is hard to provide it? >> I think that it is a matter of how to implement throttling policy. >> > > So far both in dm-ioband and IO throttling solution I have seen that > higher layer implements some of kind leaky bucket/token bucket algorithm, > which inherently allows IO from all the competing groups until they run > out of tokens and then these groups are made to wait till fresh tokens are > issued. > > That means, most of the times, IO scheduler will see requests from more > than one group at the same time and that will be the source of weak > isolation between groups. > > Consider following simple examples. Assume there are two groups and one > contains 16 random readers and other contains 1 random reader. > > G1 G2 > 16RR 1RR > > Now it might happen that IO scheduler sees requests from all the 17 RR > readers at the same time. (Throttling probably will kick in later because > you would like to give one group a nice slice of 100ms otherwise > sequential readers will suffer a lot and disk will become seek bound). > > So CFQ will dispatch requests (at least one), from each of the 16 random > readers first and then from 1 random reader in group 2 and this increases > the max latency for the application in group 2 and provides weak > isolation. > > There will also be additional issues with CFQ preemtpion logic. CFQ will > have no knowledge of groups and it will do cross group preemtptions. For > example if a meta data request comes in group1, it will preempt any of > the queue being served in other groups. So somebody doing "find . *" or > "cat <small files>" in one group will keep on preempting a sequential > reader in other group. Again this will probably lead to higher max > latencies. > > Note, even if CFQ does not enable idling on random readers, and expires > queue after single dispatch, seeking time between queues can be > significant. Similarly, if instead of 16 random reders we had 16 random > synchronous writers we will have seek time issue as well as writers can > often dump bigger requests which also adds to latency. > > This latency issue can be solved if we dispatch requests only from one > group for a certain time of time and then move to next group. (Something > what common layer is doing). > > If we go for only single group dispatching requests, then we shall have > to implemnt some of the preemption semantics also in higher layer because > in certain cases we want to do preemption across the groups. Like RT task > group preemting non-RT task group etc. > > Once we go deeper into implementation, I think we will find more issues. > >> > Fairness for buffered writes >> > ============================ >> > Doing io control at any place below page cache has disadvantage that page >> > cache might not dispatch more writes from higher weight group hence higher >> > weight group might not see more IO done. Andrew says that we don't have >> > a solution to this problem in kernel and he would like to see it handled >> > properly. >> > >> > Only way to solve this seems to be to slow down the writers before they >> > write into page cache. IO throttling patch handled it by slowing down >> > writer if it crossed max specified rate. Other suggestions have come in >> > the form of dirty_ratio per memory cgroup or a separate cgroup controller >> > al-together where some kind of per group write limit can be specified. >> > >> > So if solution is implemented at IO scheduler layer or at device mapper >> > layer, both shall have to rely on another controller to be co-mounted >> > to handle buffered writes properly. >> > >> > Fairness with-in group >> > ====================== >> > One of the issues with higher level controller is that how to do fair >> > throttling so that fairness with-in group is not impacted. Especially >> > the case of making sure that we don't break the notion of ioprio of the >> > processes with-in group. >> >> I ran your test script to confirm that the notion of ioprio was not >> broken by dm-ioband. Here is the results of the test. >> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html >> >> I think that the time period during which dm-ioband holds IO requests >> for throttling would be too short to break the notion of ioprio. > > Ok, I re-ran that test. Previously default io_limit value was 192 and now > I set it up to 256 as you suggested. I still see writer starving reader. I > have removed "conv=fdatasync" from writer so that a writer is pure buffered > writes. > > With vanilla CFQ > ---------------- > reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s > > with dm-ioband default io_limit=192 > ----------------------------------- > writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s > reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s > > ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 > ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 > > with dm-ioband default io_limit=256 > ----------------------------------- > reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s > > ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100 > ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100 > > Notice that with vanilla CFQ, reader is taking 10 seconds to finish and > with dm-ioband it takes more than 40 seconds to finish. So writer is still > starving the reader with both io_limit 192 and 256. > > On top of that can you please give some details how increasing the > buffered queue length reduces the impact of writers? > > IO Prio issue > -------------- > I ran another test where two ioband devices were created of weight 100 > each on two partitions. In first group 4 readers were launched. Three > readers are of class BE and prio 7, fourth one is of class BE prio 0. In > group2, I launched a buffered writer. > > One would expect that prio0 reader gets more bandwidth as compared to > prio 4 readers and prio 7 readers will get more or less same bw. Looks like > that is not happening. Look how vanilla CFQ provides much more bandwidth > to prio0 reader as compared to prio7 reader and how putting them in the > group reduces the difference betweej prio0 and prio7 readers. > > Following are the results. > > Vanilla CFQ > =========== > set1 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s > 578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s > 578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s > 578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s > > set2 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s > 578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s > 578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s > 578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s > > set3 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s > 578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s > 578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s > 578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s > > with dm-ioband > ============== > ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 > ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 > > set1 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s > 578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s > 578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s > 578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s > > set2 > --- > prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s > 578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s > 578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s > 578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s > > set3 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s > 578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s > 578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s > 578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s > > Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader. > With dm-ioband this ratio changed to less than 200%. > > I will run more tests, but this show how notion of priority with-in a > group changes if we implement throttling at higher layer and don't > keep it with CFQ. > > The second thing which strikes me is that I divided the disk 50% each > between readers and writers and in that case would expect protection > for writers and expect writers to finish fast. But writers have been > slowed down like and it also kills overall disk throughput. I think > it probably became seek bound. > > I think the moment I get more time, I will run some timed fio tests > and look at how overall disk performed and how bandwidth was > distributed with-in group and between groups. > >> >> > Especially io throttling patch was very bad in terms of prio with-in >> > group where throttling treated everyone equally and difference between >> > process prio disappeared. >> > >> > Reads Vs Writes >> > =============== >> > A higher level control most likely will change the ratio in which reads >> > and writes are dispatched to disk with-in group. It used to be decided >> > by IO scheduler so far but with higher level groups doing throttling and >> > possibly buffering the bios and releasing them later, they will have to >> > come up with their own policy on in what proportion reads and writes >> > should be dispatched. In case of IO scheduler based control, all the >> > queuing takes place at IO scheduler and it still retains control of >> > in what ration reads and writes should be dispatched. >> >> I don't think it is a concern. The current implementation of dm-ioband >> is that sync/async IO requests are handled separately and the >> backlogged IOs are released according to the order of arrival if both >> sync and async requests are backlogged. > > At least the version of dm-ioband I have is not producing the desired > results. See above. > > Is there a newer version? I will run some tests on that too. But I think > you will again run into same issue where you will decide the ratio of > read vs write with-in group and as I change the IO schedulers results > will vary. > > So at this point of time I can't think how can you solve read vs write > ratio issue at higher layer without changing the behavior or underlying > IO scheduler. > >> >> > Summary >> > ======= >> > >> > - An io scheduler based io controller can provide better latencies, >> > stronger isolation between groups, time based fairness and will not >> > interfere with io schedulers policies like class, ioprio and >> > reader vs writer issues. >> > >> > But it can gunrantee fairness at higher logical level devices. >> > Especially in case of max bw control, leaf node control does not sound >> > to be the most appropriate thing. >> > >> > - IO throttling provides max bw control in terms of absolute rate. It has >> > the advantage that it can provide control at higher level logical device >> > and also control buffered writes without need of additional controller >> > co-mounted. >> > >> > But it does only max bw control and not proportion control so one might >> > not be using resources optimally. It looses sense of task prio and class >> > with-in group as any of the task can be throttled with-in group. Because >> > throttling does not kick in till you hit the max bw limit, it should find >> > it hard to provide same latencies as io scheduler based control. >> > >> > - dm-ioband also has the advantage that it can provide fairness at higher >> > level logical devices. >> > >> > But, fairness is provided only in terms of size of IO or number of IO. >> > No time based fairness. It is very throughput oriented and does not >> > throttle high speed group if other group is running slow random reader. >> > This results in bad latnecies for random reader group and weaker >> > isolation between groups. >> >> A new policy can be added to dm-ioband. Actually, range-bw policy, >> which provides min and max bandwidth control, does time-based >> throttling. Moreover there is room for improvement for existing >> policies. The write-starve-read issue you pointed out will be solved >> soon. >> >> > Also it does not provide fairness if a group is not continuously >> > backlogged. So if one is running 1-2 dd/sequential readers in the group, >> > one does not get fairness until workload is increased to a point where >> > group becomes continuously backlogged. This also results in poor >> > latencies and limited fairness. >> >> This is intended to efficiently use bandwidth of underlying devices >> when IO load is low. > > But this has following undesired results. > > - Slow moving group does not get reduced latencies. For example, random readers > in slow moving group get no isolation and will continue to see higher max > latencies. > > - A single sequential reader in one group does not get fair share and > we might be pushing buffered writes in other group thinking that we > are getting better throughput. But the fact is that we are eating away > readers share in group1 and giving it to writers in group2. Also I > showed that we did not necessarily improve the overall throughput of > the system by doing so. (Because it increases the number of seeks). > > I had sent you a mail to show that. > > http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html > > But you changed the test case to run 4 readers in a single group to show that > it throughput does not decrease. Please don't change test cases. In case of 4 > sequential readers in the group, group is continuously backlogged and you > don't steal bandwidth from slow moving group. So in that mail I was not > even discussing the scenario when you don't steal the bandwidth from > other group. > > I specially created one slow moving group with one reader so that we end up > stealing bandwidth from slow moving group and show that we did not achive > higher overall throughput by stealing the BW at the same time we did not get > fairness for single reader and observed decreasing throughput for single > reader as number of writers in other group increased. > > Thanks > Vivek > >> >> > At this point of time it does not look like a single IO controller all >> > the scenarios/requirements. This means few things to me. >> > >> > - Drop some of the requirements and go with one implementation which meets >> > those reduced set of requirements. >> > >> > - Have more than one IO controller implementation in kenrel. One for lower >> > level control for better latencies, stronger isolation and optimal resource >> > usage and other one for fairness at higher level logical devices and max >> > bandwidth control. >> > >> > And let user decide which one to use based on his/her needs. >> > >> > - Come up with more intelligent way of doing IO control where single >> > controller covers all the cases. >> > >> > At this point of time, I am more inclined towards option 2 of having more >> > than one implementation in kernel. :-) (Until and unless we can brainstrom >> > and come up with ideas to make option 3 happen). >> > >> > > It would be great if we discuss our plans on the mailing list, so we >> > > can get early feedback from everyone. >> > >> > This is what comes to my mind so far. Please add to the list if I have missed >> > some points. Also correct me if I am wrong about the pros/cons of the >> > approaches. >> > >> > Thoughts/ideas/opinions are welcome... >> > >> > Thanks >> > Vivek >> >> Thanks, >> Ryo Tsuruta > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090929141049.GA12141-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090929141049.GA12141-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-29 19:53 ` Nauman Rafique 2009-09-30 8:43 ` Ryo Tsuruta 1 sibling, 0 replies; 349+ messages in thread From: Nauman Rafique @ 2009-09-29 19:53 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b We have been going around in circles for past many months on this issue of IO controller. I thought that we are getting closer to a point where we agree on one approach and go with it, but apparently we are not. I think it would be useful at this point to learn from the example of how similar functionality was introduced for other resources like cpu scheduling and memory controllers. We are starting from a point where there is no cgroup based resource allocation for disks and there is a lot to be done. CFS has been doing hierarchical proportional allocation for CPU scheduling for a while now. Only recently someone has sent out patches for enforcing upper limits. And it makes a lot of sense (more discussion on this later). Also Fernando tells me that memory controller did not support hierarchies in the first attempt. What I don't understand is, if we are starting from scratch, why do we want to solve all the problems of IO scheduling in one attempt? Max bandwidth Controller or Proportional bandwidth controller =============================================== Enforcing limits is applicable in the scenario where you are managing a bunch of services in a data center and you want to either charge them for what they use or you want a very predictable performance over time. If we just do proportional allocation, then the actual performance received by a user depends on other co-scheduled tasks. If other tasks are not using the resource, you end up using their share. But if all the other co-users become active, the 'extra' resource that you had would be taken away. Thus without enforcing some upper limit, predictability gets hurt. But this becomes an issue only if we are sharing resources. The most important precondition to sharing resources is 'the requirement to provide isolation'. And isolation includes controlling both bandwidth AND latency, in the presence of other sharers. As Vivek has rightly pointed out, a ticket allocation based algorithm is good for enforcing upper limits, but it is NOT good for providing isolation i.e. latency control and even bandwidth in some cases (as Vivek has shown with results in the last few emails). Moreover, a solution that is implemented in higher layers (be it VFS or DM) has little control over what happens in IO scheduler, again hurting the isolation goal. In the absence of isolation, we cannot even start sharing a resource. The predictability or billing are secondary concerns that arise only if we are sharing resources. If there is somebody who does not care about isolation, but want to do their billing correctly, I would like to know about it. Needless to say that max bandwidth limits can also be enforced at IO scheduling layer. Common layer vs CFS ================== Takuya has raised an interesting point here. If somebody wishes to use noop, using a common layer IO controller on top of noop isn't necessarily going to give them the same thing. In fact, with IO controller, noop might behave much like CFQ. Moreover at one point, if we decide that we absolutely need IO controller to work for other schedulers too, we have this Vivek's patch set as a proof-of-concept. For now, as Jens very rightly pointed out in our discussion, we can have a "simple scheduler: Noop" and an "intelligent scheduler: CFQ with cgroup based scheduling". Class based scheduling =================== CFQ has this notion of classes that needs to be supported in any solution that we come up with, otherwise we break the semantics of the existing scheduler. We have workloads which have strong latency requirements. We have two options: either don't do resource sharing for them OR share the resource but put them in a higher class (RT) so that their latencies are not (or minimally) effected by other workloads running with them. A solution in higher layer can try to support those semantics, but what if somebody wants to use a Noop scheduler and does not care about those semantics? We will end up with multiple schedulers in the upper layers, and who knows where all this will stop. Controlling writeback ================ It seems like writeback path has problems, but we should not try to solve those problems with the same patch set that is trying to do basic cgroup based IO scheduling. Jens patches for per-bdi pdflush are already in. They should solve the problem of pdflush not sending down enough IOs; at least Jens results seem to show that. IMHO, the next step is to use memory controller in conjunction with IO controller, and a per group per bdi pdflush threads (only if a group is doing IO on that bdi), something similar to io_group that we have in Vivek's patches. That should solve multiple problems. First, it would allow us to obviate the need of any tracking for dirty pages. Second, we can build a feedback from IO scheduling layer to the upper layers. If the number of pending writes in IO controller for a given group exceed a limit, we block the submitting thread (pdflush), similar to current congestion implementation. Then the group would start hitting dirty limits at one point (we would need per group dirty limits, as has already been pointed out by others), thus blocking the tasks that are dirtying the pages. Thus using a block layer IO controller, we can achieve the affect similar achieved by Righi's proposal. Vivek has summarized most of the other arguments very well. In short, what I am trying to say is lets start with something very simple that satisfies some of the most important requirements and we can build upon that. On Tue, Sep 29, 2009 at 7:10 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote: >> Hi Vivek and all, >> >> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: >> >> > > We are starting from a point where there is no cgroup based IO >> > > scheduling in the kernel. And it is probably not reasonable to satisfy >> > > all IO scheduling related requirements in one patch set. We can start >> > > with something simple, and build on top of that. So a very simple >> > > patch set that enables cgroup based proportional scheduling for CFQ >> > > seems like the way to go at this point. >> > >> > Sure, we can start with CFQ only. But a bigger question we need to answer >> > is that is CFQ the right place to solve the issue? Jens, do you think >> > that CFQ is the right place to solve the problem? >> > >> > Andrew seems to favor a high level approach so that IO schedulers are less >> > complex and we can provide fairness at high level logical devices also. >> >> I'm not in favor of expansion of CFQ, because some enterprise storages >> are better performed with NOOP rather than CFQ, and I think bandwidth >> control is needed much more for such storage system. Is it easy to >> support other IO schedulers even if a new IO scheduler is introduced? >> I would like to know a bit more specific about Namuman's scheduler design. >> > > The new design is essentially the old design. Except the fact that > suggestion is that in the first step instead of covering all the 4 IO > schedulers, first cover only CFQ and then later others. > > So providing fairness for NOOP is not an issue. Even if we introduce new > IO schedulers down the line, I can't think of a reason why can't we cover > that too with common layer. > >> > I will again try to summarize my understanding so far about the pros/cons >> > of each approach and then we can take the discussion forward. >> >> Good summary. Thanks for your work. >> >> > Fairness in terms of size of IO or disk time used >> > ================================================= >> > On a seeky media, fairness in terms of disk time can get us better results >> > instead fairness interms of size of IO or number of IO. >> > >> > If we implement some kind of time based solution at higher layer, then >> > that higher layer should know who used how much of time each group used. We >> > can probably do some kind of timestamping in bio to get a sense when did it >> > get into disk and when did it finish. But on a multi queue hardware there >> > can be multiple requests in the disk either from same queue or from differnet >> > queues and with pure timestamping based apparoch, so far I could not think >> > how at high level we will get an idea who used how much of time. >> >> IIUC, could the overlap time be calculated from time-stamp on a multi >> queue hardware? > > So far could not think of anything clean. Do you have something in mind. > > I was thinking that elevator layer will do the merge of bios. So IO > scheduler/elevator can time stamp the first bio in the request as it goes > into the disk and again timestamp with finish time once request finishes. > > This way higher layer can get an idea how much disk time a group of bios > used. But on multi queue, if we dispatch say 4 requests from same queue, > then time accounting becomes an issue. > > Consider following where four requests rq1, rq2, rq3 and rq4 are > dispatched to disk at time t0, t1, t2 and t3 respectively and these > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > time elapsed between each of milestones is t. Also assume that all these > requests are from same queue/group. > > t0 t1 t2 t3 t4 t5 t6 t7 > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > Now higher layer will think that time consumed by group is: > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > But the time elapsed is only 7t. > > Secondly if a different group is running only single sequential reader, > there CFQ will be driving queue depth of 1 and time will not be running > faster and this inaccuracy in accounting will lead to unfair share between > groups. > > So we need something better to get a sense which group used how much of > disk time. > >> >> > So this is the first point of contention that how do we want to provide >> > fairness. In terms of disk time used or in terms of size of IO/number of >> > IO. >> > >> > Max bandwidth Controller or Proportional bandwidth controller >> > ============================================================= >> > What is our primary requirement here? A weight based proportional >> > bandwidth controller where we can use the resources optimally and any >> > kind of throttling kicks in only if there is contention for the disk. >> > >> > Or we want max bandwidth control where a group is not allowed to use the >> > disk even if disk is free. >> > >> > Or we need both? I would think that at some point of time we will need >> > both but we can start with proportional bandwidth control first. >> >> How about making throttling policy be user selectable like the IO >> scheduler and putting it in the higher layer? So we could support >> all of policies (time-based, size-based and rate limiting). There >> seems not to only one solution which satisfies all users. But I agree >> with starting with proportional bandwidth control first. >> > > What are the cases where time based policy does not work and size based > policy works better and user would choose size based policy and not timed > based one? > > I am not against implementing things in higher layer as long as we can > ensure tight control on latencies, strong isolation between groups and > not break CFQ's class and ioprio model with-in group. > >> BTW, I will start to reimplement dm-ioband into block layer. > > Can you elaborate little bit on this? > >> >> > Fairness for higher level logical devices >> > ========================================= >> > Do we want good fairness numbers for higher level logical devices also >> > or it is sufficient to provide fairness at leaf nodes. Providing fairness >> > at leaf nodes can help us use the resources optimally and in the process >> > we can get fairness at higher level also in many of the cases. >> >> We should also take care of block devices which provide their own >> make_request_fn() and not use a IO scheduler. We can't use the leaf >> nodes approach to such devices. >> > > I am not sure how big an issue is this. This can be easily solved by > making use of NOOP scheduler by these devices. What are the reasons for > these devices to not use even noop? > >> > But do we want strict fairness numbers on higher level logical devices >> > even if it means sub-optimal usage of unerlying phsical devices? >> > >> > I think that for proportinal bandwidth control, it should be ok to provide >> > fairness at higher level logical device but for max bandwidth control it >> > might make more sense to provide fairness at higher level. Consider a >> > case where from a striped device a customer wants to limit a group to >> > 30MB/s and in case of leaf node control, if every leaf node provides >> > 30MB/s, it might accumulate to much more than specified rate at logical >> > device. >> > >> > Latency Control and strong isolation between groups >> > =================================================== >> > Do we want a good isolation between groups and better latencies and >> > stronger isolation between groups? >> > >> > I think if problem is solved at IO scheduler level, we can achieve better >> > latency control and hence stronger isolation between groups. >> > >> > Higher level solutions should find it hard to provide same kind of latency >> > control and isolation between groups as IO scheduler based solution. >> >> Why do you think that the higher level solution is hard to provide it? >> I think that it is a matter of how to implement throttling policy. >> > > So far both in dm-ioband and IO throttling solution I have seen that > higher layer implements some of kind leaky bucket/token bucket algorithm, > which inherently allows IO from all the competing groups until they run > out of tokens and then these groups are made to wait till fresh tokens are > issued. > > That means, most of the times, IO scheduler will see requests from more > than one group at the same time and that will be the source of weak > isolation between groups. > > Consider following simple examples. Assume there are two groups and one > contains 16 random readers and other contains 1 random reader. > > G1 G2 > 16RR 1RR > > Now it might happen that IO scheduler sees requests from all the 17 RR > readers at the same time. (Throttling probably will kick in later because > you would like to give one group a nice slice of 100ms otherwise > sequential readers will suffer a lot and disk will become seek bound). > > So CFQ will dispatch requests (at least one), from each of the 16 random > readers first and then from 1 random reader in group 2 and this increases > the max latency for the application in group 2 and provides weak > isolation. > > There will also be additional issues with CFQ preemtpion logic. CFQ will > have no knowledge of groups and it will do cross group preemtptions. For > example if a meta data request comes in group1, it will preempt any of > the queue being served in other groups. So somebody doing "find . *" or > "cat <small files>" in one group will keep on preempting a sequential > reader in other group. Again this will probably lead to higher max > latencies. > > Note, even if CFQ does not enable idling on random readers, and expires > queue after single dispatch, seeking time between queues can be > significant. Similarly, if instead of 16 random reders we had 16 random > synchronous writers we will have seek time issue as well as writers can > often dump bigger requests which also adds to latency. > > This latency issue can be solved if we dispatch requests only from one > group for a certain time of time and then move to next group. (Something > what common layer is doing). > > If we go for only single group dispatching requests, then we shall have > to implemnt some of the preemption semantics also in higher layer because > in certain cases we want to do preemption across the groups. Like RT task > group preemting non-RT task group etc. > > Once we go deeper into implementation, I think we will find more issues. > >> > Fairness for buffered writes >> > ============================ >> > Doing io control at any place below page cache has disadvantage that page >> > cache might not dispatch more writes from higher weight group hence higher >> > weight group might not see more IO done. Andrew says that we don't have >> > a solution to this problem in kernel and he would like to see it handled >> > properly. >> > >> > Only way to solve this seems to be to slow down the writers before they >> > write into page cache. IO throttling patch handled it by slowing down >> > writer if it crossed max specified rate. Other suggestions have come in >> > the form of dirty_ratio per memory cgroup or a separate cgroup controller >> > al-together where some kind of per group write limit can be specified. >> > >> > So if solution is implemented at IO scheduler layer or at device mapper >> > layer, both shall have to rely on another controller to be co-mounted >> > to handle buffered writes properly. >> > >> > Fairness with-in group >> > ====================== >> > One of the issues with higher level controller is that how to do fair >> > throttling so that fairness with-in group is not impacted. Especially >> > the case of making sure that we don't break the notion of ioprio of the >> > processes with-in group. >> >> I ran your test script to confirm that the notion of ioprio was not >> broken by dm-ioband. Here is the results of the test. >> https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html >> >> I think that the time period during which dm-ioband holds IO requests >> for throttling would be too short to break the notion of ioprio. > > Ok, I re-ran that test. Previously default io_limit value was 192 and now > I set it up to 256 as you suggested. I still see writer starving reader. I > have removed "conv=fdatasync" from writer so that a writer is pure buffered > writes. > > With vanilla CFQ > ---------------- > reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s > > with dm-ioband default io_limit=192 > ----------------------------------- > writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s > reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s > > ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 > ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 > > with dm-ioband default io_limit=256 > ----------------------------------- > reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s > > ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100 > ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100 > > Notice that with vanilla CFQ, reader is taking 10 seconds to finish and > with dm-ioband it takes more than 40 seconds to finish. So writer is still > starving the reader with both io_limit 192 and 256. > > On top of that can you please give some details how increasing the > buffered queue length reduces the impact of writers? > > IO Prio issue > -------------- > I ran another test where two ioband devices were created of weight 100 > each on two partitions. In first group 4 readers were launched. Three > readers are of class BE and prio 7, fourth one is of class BE prio 0. In > group2, I launched a buffered writer. > > One would expect that prio0 reader gets more bandwidth as compared to > prio 4 readers and prio 7 readers will get more or less same bw. Looks like > that is not happening. Look how vanilla CFQ provides much more bandwidth > to prio0 reader as compared to prio7 reader and how putting them in the > group reduces the difference betweej prio0 and prio7 readers. > > Following are the results. > > Vanilla CFQ > =========== > set1 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s > 578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s > 578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s > 578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s > > set2 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s > 578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s > 578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s > 578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s > > set3 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s > 578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s > 578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s > 578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s > > with dm-ioband > ============== > ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 > ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 > > set1 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s > 578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s > 578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s > 578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s > > set2 > --- > prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s > 578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s > 578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s > 578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s > > set3 > ---- > prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s > 578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s > 578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s > 578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s > writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s > > Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader. > With dm-ioband this ratio changed to less than 200%. > > I will run more tests, but this show how notion of priority with-in a > group changes if we implement throttling at higher layer and don't > keep it with CFQ. > > The second thing which strikes me is that I divided the disk 50% each > between readers and writers and in that case would expect protection > for writers and expect writers to finish fast. But writers have been > slowed down like and it also kills overall disk throughput. I think > it probably became seek bound. > > I think the moment I get more time, I will run some timed fio tests > and look at how overall disk performed and how bandwidth was > distributed with-in group and between groups. > >> >> > Especially io throttling patch was very bad in terms of prio with-in >> > group where throttling treated everyone equally and difference between >> > process prio disappeared. >> > >> > Reads Vs Writes >> > =============== >> > A higher level control most likely will change the ratio in which reads >> > and writes are dispatched to disk with-in group. It used to be decided >> > by IO scheduler so far but with higher level groups doing throttling and >> > possibly buffering the bios and releasing them later, they will have to >> > come up with their own policy on in what proportion reads and writes >> > should be dispatched. In case of IO scheduler based control, all the >> > queuing takes place at IO scheduler and it still retains control of >> > in what ration reads and writes should be dispatched. >> >> I don't think it is a concern. The current implementation of dm-ioband >> is that sync/async IO requests are handled separately and the >> backlogged IOs are released according to the order of arrival if both >> sync and async requests are backlogged. > > At least the version of dm-ioband I have is not producing the desired > results. See above. > > Is there a newer version? I will run some tests on that too. But I think > you will again run into same issue where you will decide the ratio of > read vs write with-in group and as I change the IO schedulers results > will vary. > > So at this point of time I can't think how can you solve read vs write > ratio issue at higher layer without changing the behavior or underlying > IO scheduler. > >> >> > Summary >> > ======= >> > >> > - An io scheduler based io controller can provide better latencies, >> > stronger isolation between groups, time based fairness and will not >> > interfere with io schedulers policies like class, ioprio and >> > reader vs writer issues. >> > >> > But it can gunrantee fairness at higher logical level devices. >> > Especially in case of max bw control, leaf node control does not sound >> > to be the most appropriate thing. >> > >> > - IO throttling provides max bw control in terms of absolute rate. It has >> > the advantage that it can provide control at higher level logical device >> > and also control buffered writes without need of additional controller >> > co-mounted. >> > >> > But it does only max bw control and not proportion control so one might >> > not be using resources optimally. It looses sense of task prio and class >> > with-in group as any of the task can be throttled with-in group. Because >> > throttling does not kick in till you hit the max bw limit, it should find >> > it hard to provide same latencies as io scheduler based control. >> > >> > - dm-ioband also has the advantage that it can provide fairness at higher >> > level logical devices. >> > >> > But, fairness is provided only in terms of size of IO or number of IO. >> > No time based fairness. It is very throughput oriented and does not >> > throttle high speed group if other group is running slow random reader. >> > This results in bad latnecies for random reader group and weaker >> > isolation between groups. >> >> A new policy can be added to dm-ioband. Actually, range-bw policy, >> which provides min and max bandwidth control, does time-based >> throttling. Moreover there is room for improvement for existing >> policies. The write-starve-read issue you pointed out will be solved >> soon. >> >> > Also it does not provide fairness if a group is not continuously >> > backlogged. So if one is running 1-2 dd/sequential readers in the group, >> > one does not get fairness until workload is increased to a point where >> > group becomes continuously backlogged. This also results in poor >> > latencies and limited fairness. >> >> This is intended to efficiently use bandwidth of underlying devices >> when IO load is low. > > But this has following undesired results. > > - Slow moving group does not get reduced latencies. For example, random readers > in slow moving group get no isolation and will continue to see higher max > latencies. > > - A single sequential reader in one group does not get fair share and > we might be pushing buffered writes in other group thinking that we > are getting better throughput. But the fact is that we are eating away > readers share in group1 and giving it to writers in group2. Also I > showed that we did not necessarily improve the overall throughput of > the system by doing so. (Because it increases the number of seeks). > > I had sent you a mail to show that. > > http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html > > But you changed the test case to run 4 readers in a single group to show that > it throughput does not decrease. Please don't change test cases. In case of 4 > sequential readers in the group, group is continuously backlogged and you > don't steal bandwidth from slow moving group. So in that mail I was not > even discussing the scenario when you don't steal the bandwidth from > other group. > > I specially created one slow moving group with one reader so that we end up > stealing bandwidth from slow moving group and show that we did not achive > higher overall throughput by stealing the BW at the same time we did not get > fairness for single reader and observed decreasing throughput for single > reader as number of writers in other group increased. > > Thanks > Vivek > >> >> > At this point of time it does not look like a single IO controller all >> > the scenarios/requirements. This means few things to me. >> > >> > - Drop some of the requirements and go with one implementation which meets >> > those reduced set of requirements. >> > >> > - Have more than one IO controller implementation in kenrel. One for lower >> > level control for better latencies, stronger isolation and optimal resource >> > usage and other one for fairness at higher level logical devices and max >> > bandwidth control. >> > >> > And let user decide which one to use based on his/her needs. >> > >> > - Come up with more intelligent way of doing IO control where single >> > controller covers all the cases. >> > >> > At this point of time, I am more inclined towards option 2 of having more >> > than one implementation in kernel. :-) (Until and unless we can brainstrom >> > and come up with ideas to make option 3 happen). >> > >> > > It would be great if we discuss our plans on the mailing list, so we >> > > can get early feedback from everyone. >> > >> > This is what comes to my mind so far. Please add to the list if I have missed >> > some points. Also correct me if I am wrong about the pros/cons of the >> > approaches. >> > >> > Thoughts/ideas/opinions are welcome... >> > >> > Thanks >> > Vivek >> >> Thanks, >> Ryo Tsuruta > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20090929141049.GA12141-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-29 19:53 ` Nauman Rafique @ 2009-09-30 8:43 ` Ryo Tsuruta 1 sibling, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-30 8:43 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > I was thinking that elevator layer will do the merge of bios. So IO > scheduler/elevator can time stamp the first bio in the request as it goes > into the disk and again timestamp with finish time once request finishes. > > This way higher layer can get an idea how much disk time a group of bios > used. But on multi queue, if we dispatch say 4 requests from same queue, > then time accounting becomes an issue. > > Consider following where four requests rq1, rq2, rq3 and rq4 are > dispatched to disk at time t0, t1, t2 and t3 respectively and these > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > time elapsed between each of milestones is t. Also assume that all these > requests are from same queue/group. > > t0 t1 t2 t3 t4 t5 t6 t7 > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > Now higher layer will think that time consumed by group is: > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > But the time elapsed is only 7t. IO controller can know how many requests are issued and still in progress. Is it not enough to accumulate the time while in-flight IOs exist? > Secondly if a different group is running only single sequential reader, > there CFQ will be driving queue depth of 1 and time will not be running > faster and this inaccuracy in accounting will lead to unfair share between > groups. > > So we need something better to get a sense which group used how much of > disk time. It could be solved by implementing the way to pass on such information from IO scheduler to higher layer controller. > > How about making throttling policy be user selectable like the IO > > scheduler and putting it in the higher layer? So we could support > > all of policies (time-based, size-based and rate limiting). There > > seems not to only one solution which satisfies all users. But I agree > > with starting with proportional bandwidth control first. > > > > What are the cases where time based policy does not work and size based > policy works better and user would choose size based policy and not timed > based one? I think that disk time is not simply proportional to IO size. If there are two groups whose wights are equally assigned and they issue different sized IOs repsectively, the bandwidth of each group would not distributed equally as expected. > I am not against implementing things in higher layer as long as we can > ensure tight control on latencies, strong isolation between groups and > not break CFQ's class and ioprio model with-in group. > > > BTW, I will start to reimplement dm-ioband into block layer. > > Can you elaborate little bit on this? bio is grabbed in generic_make_request() and throttled as well as dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > Fairness for higher level logical devices > > > ========================================= > > > Do we want good fairness numbers for higher level logical devices also > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > > at leaf nodes can help us use the resources optimally and in the process > > > we can get fairness at higher level also in many of the cases. > > > > We should also take care of block devices which provide their own > > make_request_fn() and not use a IO scheduler. We can't use the leaf > > nodes approach to such devices. > > > > I am not sure how big an issue is this. This can be easily solved by > making use of NOOP scheduler by these devices. What are the reasons for > these devices to not use even noop? I'm not sure why the developers of the device driver choose their own way, and the driver is provided in binary form, so we can't modify it. > > > Fairness with-in group > > > ====================== > > > One of the issues with higher level controller is that how to do fair > > > throttling so that fairness with-in group is not impacted. Especially > > > the case of making sure that we don't break the notion of ioprio of the > > > processes with-in group. > > > > I ran your test script to confirm that the notion of ioprio was not > > broken by dm-ioband. Here is the results of the test. > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > > > I think that the time period during which dm-ioband holds IO requests > > for throttling would be too short to break the notion of ioprio. > > Ok, I re-ran that test. Previously default io_limit value was 192 and now The default value of io_limit on the previous test was 128 (not 192) which is equall to the default value of nr_request. > I set it up to 256 as you suggested. I still see writer starving reader. I > have removed "conv=fdatasync" from writer so that a writer is pure buffered > writes. O.K. You removed "conv=fdatasync", the new dm-ioband handles sync/async requests separately, and it solves this buffered-write-starves-read problem. I would like to post it soon after doing some more test. > On top of that can you please give some details how increasing the > buffered queue length reduces the impact of writers? When the number of in-flight IOs exceeds io_limit, processes which are going to issue IOs are made sleep by dm-ioband until all the in-flight IOs are finished. But IO scheduler layer can accept IO requests more than the value of io_limit, so it was a bottleneck of the throughput. > IO Prio issue > -------------- > I ran another test where two ioband devices were created of weight 100 > each on two partitions. In first group 4 readers were launched. Three > readers are of class BE and prio 7, fourth one is of class BE prio 0. In > group2, I launched a buffered writer. > > One would expect that prio0 reader gets more bandwidth as compared to > prio 4 readers and prio 7 readers will get more or less same bw. Looks like > that is not happening. Look how vanilla CFQ provides much more bandwidth > to prio0 reader as compared to prio7 reader and how putting them in the > group reduces the difference betweej prio0 and prio7 readers. > > Following are the results. O.K. I'll try to do more test with dm-ioband according to your comments especially working with CFQ. Thanks for pointing out. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-29 14:10 ` Vivek Goyal ` (2 preceding siblings ...) (?) @ 2009-09-30 8:43 ` Ryo Tsuruta 2009-09-30 11:05 ` Vivek Goyal [not found] ` <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> -1 siblings, 2 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-09-30 8:43 UTC (permalink / raw) To: vgoyal Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > I was thinking that elevator layer will do the merge of bios. So IO > scheduler/elevator can time stamp the first bio in the request as it goes > into the disk and again timestamp with finish time once request finishes. > > This way higher layer can get an idea how much disk time a group of bios > used. But on multi queue, if we dispatch say 4 requests from same queue, > then time accounting becomes an issue. > > Consider following where four requests rq1, rq2, rq3 and rq4 are > dispatched to disk at time t0, t1, t2 and t3 respectively and these > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > time elapsed between each of milestones is t. Also assume that all these > requests are from same queue/group. > > t0 t1 t2 t3 t4 t5 t6 t7 > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > Now higher layer will think that time consumed by group is: > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > But the time elapsed is only 7t. IO controller can know how many requests are issued and still in progress. Is it not enough to accumulate the time while in-flight IOs exist? > Secondly if a different group is running only single sequential reader, > there CFQ will be driving queue depth of 1 and time will not be running > faster and this inaccuracy in accounting will lead to unfair share between > groups. > > So we need something better to get a sense which group used how much of > disk time. It could be solved by implementing the way to pass on such information from IO scheduler to higher layer controller. > > How about making throttling policy be user selectable like the IO > > scheduler and putting it in the higher layer? So we could support > > all of policies (time-based, size-based and rate limiting). There > > seems not to only one solution which satisfies all users. But I agree > > with starting with proportional bandwidth control first. > > > > What are the cases where time based policy does not work and size based > policy works better and user would choose size based policy and not timed > based one? I think that disk time is not simply proportional to IO size. If there are two groups whose wights are equally assigned and they issue different sized IOs repsectively, the bandwidth of each group would not distributed equally as expected. > I am not against implementing things in higher layer as long as we can > ensure tight control on latencies, strong isolation between groups and > not break CFQ's class and ioprio model with-in group. > > > BTW, I will start to reimplement dm-ioband into block layer. > > Can you elaborate little bit on this? bio is grabbed in generic_make_request() and throttled as well as dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > Fairness for higher level logical devices > > > ========================================= > > > Do we want good fairness numbers for higher level logical devices also > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > > at leaf nodes can help us use the resources optimally and in the process > > > we can get fairness at higher level also in many of the cases. > > > > We should also take care of block devices which provide their own > > make_request_fn() and not use a IO scheduler. We can't use the leaf > > nodes approach to such devices. > > > > I am not sure how big an issue is this. This can be easily solved by > making use of NOOP scheduler by these devices. What are the reasons for > these devices to not use even noop? I'm not sure why the developers of the device driver choose their own way, and the driver is provided in binary form, so we can't modify it. > > > Fairness with-in group > > > ====================== > > > One of the issues with higher level controller is that how to do fair > > > throttling so that fairness with-in group is not impacted. Especially > > > the case of making sure that we don't break the notion of ioprio of the > > > processes with-in group. > > > > I ran your test script to confirm that the notion of ioprio was not > > broken by dm-ioband. Here is the results of the test. > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > > > I think that the time period during which dm-ioband holds IO requests > > for throttling would be too short to break the notion of ioprio. > > Ok, I re-ran that test. Previously default io_limit value was 192 and now The default value of io_limit on the previous test was 128 (not 192) which is equall to the default value of nr_request. > I set it up to 256 as you suggested. I still see writer starving reader. I > have removed "conv=fdatasync" from writer so that a writer is pure buffered > writes. O.K. You removed "conv=fdatasync", the new dm-ioband handles sync/async requests separately, and it solves this buffered-write-starves-read problem. I would like to post it soon after doing some more test. > On top of that can you please give some details how increasing the > buffered queue length reduces the impact of writers? When the number of in-flight IOs exceeds io_limit, processes which are going to issue IOs are made sleep by dm-ioband until all the in-flight IOs are finished. But IO scheduler layer can accept IO requests more than the value of io_limit, so it was a bottleneck of the throughput. > IO Prio issue > -------------- > I ran another test where two ioband devices were created of weight 100 > each on two partitions. In first group 4 readers were launched. Three > readers are of class BE and prio 7, fourth one is of class BE prio 0. In > group2, I launched a buffered writer. > > One would expect that prio0 reader gets more bandwidth as compared to > prio 4 readers and prio 7 readers will get more or less same bw. Looks like > that is not happening. Look how vanilla CFQ provides much more bandwidth > to prio0 reader as compared to prio7 reader and how putting them in the > group reduces the difference betweej prio0 and prio7 readers. > > Following are the results. O.K. I'll try to do more test with dm-ioband according to your comments especially working with CFQ. Thanks for pointing out. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-30 8:43 ` Ryo Tsuruta @ 2009-09-30 11:05 ` Vivek Goyal [not found] ` <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 11:05 UTC (permalink / raw) To: Ryo Tsuruta Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > I was thinking that elevator layer will do the merge of bios. So IO > > scheduler/elevator can time stamp the first bio in the request as it goes > > into the disk and again timestamp with finish time once request finishes. > > > > This way higher layer can get an idea how much disk time a group of bios > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > then time accounting becomes an issue. > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > time elapsed between each of milestones is t. Also assume that all these > > requests are from same queue/group. > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > Now higher layer will think that time consumed by group is: > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > But the time elapsed is only 7t. > > IO controller can know how many requests are issued and still in > progress. Is it not enough to accumulate the time while in-flight IOs > exist? > That time would not reflect disk time used. It will be follwoing. (time spent waiting in CFQ queues) + (time spent in dispatch queue) + (time spent in disk) > > Secondly if a different group is running only single sequential reader, > > there CFQ will be driving queue depth of 1 and time will not be running > > faster and this inaccuracy in accounting will lead to unfair share between > > groups. > > > > So we need something better to get a sense which group used how much of > > disk time. > > It could be solved by implementing the way to pass on such information > from IO scheduler to higher layer controller. > How would you do that? Can you give some details exactly how and what information IO scheduler will pass to higher level IO controller so that IO controller can attribute right time to the group. > > > How about making throttling policy be user selectable like the IO > > > scheduler and putting it in the higher layer? So we could support > > > all of policies (time-based, size-based and rate limiting). There > > > seems not to only one solution which satisfies all users. But I agree > > > with starting with proportional bandwidth control first. > > > > > > > What are the cases where time based policy does not work and size based > > policy works better and user would choose size based policy and not timed > > based one? > > I think that disk time is not simply proportional to IO size. If there > are two groups whose wights are equally assigned and they issue > different sized IOs repsectively, the bandwidth of each group would > not distributed equally as expected. > If we are providing fairness in terms of time, it is fair. If we provide equal time slots to two processes and if one got more IO done because it was not wasting time seeking or it issued bigger size IO, it deserves that higher BW. IO controller will make sure that process gets fair share in terms of time and exactly how much BW one got will depend on the workload. That's the precise reason that fairness in terms of time is better on seeky media. > > I am not against implementing things in higher layer as long as we can > > ensure tight control on latencies, strong isolation between groups and > > not break CFQ's class and ioprio model with-in group. > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > Can you elaborate little bit on this? > > bio is grabbed in generic_make_request() and throttled as well as > dm-ioband's mechanism. dmsetup command is not necessary any longer. > Ok, so one would not need dm-ioband device now, but same dm-ioband throttling policies will apply. So until and unless we figure out a better way, the issues I have pointed out will still exists even in new implementation. > > > > Fairness for higher level logical devices > > > > ========================================= > > > > Do we want good fairness numbers for higher level logical devices also > > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > > > at leaf nodes can help us use the resources optimally and in the process > > > > we can get fairness at higher level also in many of the cases. > > > > > > We should also take care of block devices which provide their own > > > make_request_fn() and not use a IO scheduler. We can't use the leaf > > > nodes approach to such devices. > > > > > > > I am not sure how big an issue is this. This can be easily solved by > > making use of NOOP scheduler by these devices. What are the reasons for > > these devices to not use even noop? > > I'm not sure why the developers of the device driver choose their own > way, and the driver is provided in binary form, so we can't modify it. > > > > > Fairness with-in group > > > > ====================== > > > > One of the issues with higher level controller is that how to do fair > > > > throttling so that fairness with-in group is not impacted. Especially > > > > the case of making sure that we don't break the notion of ioprio of the > > > > processes with-in group. > > > > > > I ran your test script to confirm that the notion of ioprio was not > > > broken by dm-ioband. Here is the results of the test. > > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > > > > > I think that the time period during which dm-ioband holds IO requests > > > for throttling would be too short to break the notion of ioprio. > > > > Ok, I re-ran that test. Previously default io_limit value was 192 and now > > The default value of io_limit on the previous test was 128 (not 192) > which is equall to the default value of nr_request. Hm..., I used following commands to create two ioband devices. echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" "weight 0 :100" | dmsetup create ioband1 echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" "weight 0 :100" | dmsetup create ioband2 Here io_limit value is zero so it should pick default value. Following is output of "dmsetup table" command. ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 ^^^^ IIUC, above number 192 is reflecting io_limit? If yes, then default seems to be 192? > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > writes. > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > sync/async requests separately, and it solves this > buffered-write-starves-read problem. I would like to post it soon > after doing some more test. > > > On top of that can you please give some details how increasing the > > buffered queue length reduces the impact of writers? > > When the number of in-flight IOs exceeds io_limit, processes which are > going to issue IOs are made sleep by dm-ioband until all the in-flight > IOs are finished. But IO scheduler layer can accept IO requests more > than the value of io_limit, so it was a bottleneck of the throughput. > Ok, so it should have been throughput bottleneck but how did it solve the issue of writer starving the reader as you had mentioned in the mail. Secondly, you mentioned that processes are made to sleep once we cross io_limit. This sounds like request descriptor facility on requeust queue where processes are made to sleep. There are threads in kernel which don't want to sleep while submitting bios. For example, btrfs has bio submitting thread which does not want to sleep hence it checks with device if it is congested or not and not submit the bio if it is congested. How would you handle such cases. Have you implemented any per group congestion kind of interface to make sure such IO's don't sleep if group is congested. Or this limit is per ioband device which every group on the device is sharing. If yes, then how would you provide isolation between groups because if one groups consumes io_limit tokens, then other will simply be serialized on that device? > > IO Prio issue > > -------------- > > I ran another test where two ioband devices were created of weight 100 > > each on two partitions. In first group 4 readers were launched. Three > > readers are of class BE and prio 7, fourth one is of class BE prio 0. In > > group2, I launched a buffered writer. > > > > One would expect that prio0 reader gets more bandwidth as compared to > > prio 4 readers and prio 7 readers will get more or less same bw. Looks like > > that is not happening. Look how vanilla CFQ provides much more bandwidth > > to prio0 reader as compared to prio7 reader and how putting them in the > > group reduces the difference betweej prio0 and prio7 readers. > > > > Following are the results. > > O.K. I'll try to do more test with dm-ioband according to your > comments especially working with CFQ. Thanks for pointing out. > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-30 11:05 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 11:05 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > I was thinking that elevator layer will do the merge of bios. So IO > > scheduler/elevator can time stamp the first bio in the request as it goes > > into the disk and again timestamp with finish time once request finishes. > > > > This way higher layer can get an idea how much disk time a group of bios > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > then time accounting becomes an issue. > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > time elapsed between each of milestones is t. Also assume that all these > > requests are from same queue/group. > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > Now higher layer will think that time consumed by group is: > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > But the time elapsed is only 7t. > > IO controller can know how many requests are issued and still in > progress. Is it not enough to accumulate the time while in-flight IOs > exist? > That time would not reflect disk time used. It will be follwoing. (time spent waiting in CFQ queues) + (time spent in dispatch queue) + (time spent in disk) > > Secondly if a different group is running only single sequential reader, > > there CFQ will be driving queue depth of 1 and time will not be running > > faster and this inaccuracy in accounting will lead to unfair share between > > groups. > > > > So we need something better to get a sense which group used how much of > > disk time. > > It could be solved by implementing the way to pass on such information > from IO scheduler to higher layer controller. > How would you do that? Can you give some details exactly how and what information IO scheduler will pass to higher level IO controller so that IO controller can attribute right time to the group. > > > How about making throttling policy be user selectable like the IO > > > scheduler and putting it in the higher layer? So we could support > > > all of policies (time-based, size-based and rate limiting). There > > > seems not to only one solution which satisfies all users. But I agree > > > with starting with proportional bandwidth control first. > > > > > > > What are the cases where time based policy does not work and size based > > policy works better and user would choose size based policy and not timed > > based one? > > I think that disk time is not simply proportional to IO size. If there > are two groups whose wights are equally assigned and they issue > different sized IOs repsectively, the bandwidth of each group would > not distributed equally as expected. > If we are providing fairness in terms of time, it is fair. If we provide equal time slots to two processes and if one got more IO done because it was not wasting time seeking or it issued bigger size IO, it deserves that higher BW. IO controller will make sure that process gets fair share in terms of time and exactly how much BW one got will depend on the workload. That's the precise reason that fairness in terms of time is better on seeky media. > > I am not against implementing things in higher layer as long as we can > > ensure tight control on latencies, strong isolation between groups and > > not break CFQ's class and ioprio model with-in group. > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > Can you elaborate little bit on this? > > bio is grabbed in generic_make_request() and throttled as well as > dm-ioband's mechanism. dmsetup command is not necessary any longer. > Ok, so one would not need dm-ioband device now, but same dm-ioband throttling policies will apply. So until and unless we figure out a better way, the issues I have pointed out will still exists even in new implementation. > > > > Fairness for higher level logical devices > > > > ========================================= > > > > Do we want good fairness numbers for higher level logical devices also > > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > > > at leaf nodes can help us use the resources optimally and in the process > > > > we can get fairness at higher level also in many of the cases. > > > > > > We should also take care of block devices which provide their own > > > make_request_fn() and not use a IO scheduler. We can't use the leaf > > > nodes approach to such devices. > > > > > > > I am not sure how big an issue is this. This can be easily solved by > > making use of NOOP scheduler by these devices. What are the reasons for > > these devices to not use even noop? > > I'm not sure why the developers of the device driver choose their own > way, and the driver is provided in binary form, so we can't modify it. > > > > > Fairness with-in group > > > > ====================== > > > > One of the issues with higher level controller is that how to do fair > > > > throttling so that fairness with-in group is not impacted. Especially > > > > the case of making sure that we don't break the notion of ioprio of the > > > > processes with-in group. > > > > > > I ran your test script to confirm that the notion of ioprio was not > > > broken by dm-ioband. Here is the results of the test. > > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > > > > > I think that the time period during which dm-ioband holds IO requests > > > for throttling would be too short to break the notion of ioprio. > > > > Ok, I re-ran that test. Previously default io_limit value was 192 and now > > The default value of io_limit on the previous test was 128 (not 192) > which is equall to the default value of nr_request. Hm..., I used following commands to create two ioband devices. echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" "weight 0 :100" | dmsetup create ioband1 echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" "weight 0 :100" | dmsetup create ioband2 Here io_limit value is zero so it should pick default value. Following is output of "dmsetup table" command. ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 ^^^^ IIUC, above number 192 is reflecting io_limit? If yes, then default seems to be 192? > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > writes. > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > sync/async requests separately, and it solves this > buffered-write-starves-read problem. I would like to post it soon > after doing some more test. > > > On top of that can you please give some details how increasing the > > buffered queue length reduces the impact of writers? > > When the number of in-flight IOs exceeds io_limit, processes which are > going to issue IOs are made sleep by dm-ioband until all the in-flight > IOs are finished. But IO scheduler layer can accept IO requests more > than the value of io_limit, so it was a bottleneck of the throughput. > Ok, so it should have been throughput bottleneck but how did it solve the issue of writer starving the reader as you had mentioned in the mail. Secondly, you mentioned that processes are made to sleep once we cross io_limit. This sounds like request descriptor facility on requeust queue where processes are made to sleep. There are threads in kernel which don't want to sleep while submitting bios. For example, btrfs has bio submitting thread which does not want to sleep hence it checks with device if it is congested or not and not submit the bio if it is congested. How would you handle such cases. Have you implemented any per group congestion kind of interface to make sure such IO's don't sleep if group is congested. Or this limit is per ioband device which every group on the device is sharing. If yes, then how would you provide isolation between groups because if one groups consumes io_limit tokens, then other will simply be serialized on that device? > > IO Prio issue > > -------------- > > I ran another test where two ioband devices were created of weight 100 > > each on two partitions. In first group 4 readers were launched. Three > > readers are of class BE and prio 7, fourth one is of class BE prio 0. In > > group2, I launched a buffered writer. > > > > One would expect that prio0 reader gets more bandwidth as compared to > > prio 4 readers and prio 7 readers will get more or less same bw. Looks like > > that is not happening. Look how vanilla CFQ provides much more bandwidth > > to prio0 reader as compared to prio7 reader and how putting them in the > > group reduces the difference betweej prio0 and prio7 readers. > > > > Following are the results. > > O.K. I'll try to do more test with dm-ioband according to your > comments especially working with CFQ. Thanks for pointing out. > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-30 11:05 ` Vivek Goyal @ 2009-10-01 6:41 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-01 6:41 UTC (permalink / raw) To: vgoyal Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > I was thinking that elevator layer will do the merge of bios. So IO > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > into the disk and again timestamp with finish time once request finishes. > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > then time accounting becomes an issue. > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > time elapsed between each of milestones is t. Also assume that all these > > > requests are from same queue/group. > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > Now higher layer will think that time consumed by group is: > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > But the time elapsed is only 7t. > > > > IO controller can know how many requests are issued and still in > > progress. Is it not enough to accumulate the time while in-flight IOs > > exist? > > > > That time would not reflect disk time used. It will be follwoing. > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > (time spent in disk) In the case where multiple IO requests are issued from IO controller, that time measurement is the time from when the first IO request is issued until when the endio is called for the last IO request. Does not it reflect disk time? > > > Secondly if a different group is running only single sequential reader, > > > there CFQ will be driving queue depth of 1 and time will not be running > > > faster and this inaccuracy in accounting will lead to unfair share between > > > groups. > > > > > > So we need something better to get a sense which group used how much of > > > disk time. > > > > It could be solved by implementing the way to pass on such information > > from IO scheduler to higher layer controller. > > > > How would you do that? Can you give some details exactly how and what > information IO scheduler will pass to higher level IO controller so that IO > controller can attribute right time to the group. If you would like to know when the idle timer is expired, how about adding a function to IO controller to be notified it from IO scheduler? IO scheduler calls the function when the timer is expired. > > > > How about making throttling policy be user selectable like the IO > > > > scheduler and putting it in the higher layer? So we could support > > > > all of policies (time-based, size-based and rate limiting). There > > > > seems not to only one solution which satisfies all users. But I agree > > > > with starting with proportional bandwidth control first. > > > > > > > > > > What are the cases where time based policy does not work and size based > > > policy works better and user would choose size based policy and not timed > > > based one? > > > > I think that disk time is not simply proportional to IO size. If there > > are two groups whose wights are equally assigned and they issue > > different sized IOs repsectively, the bandwidth of each group would > > not distributed equally as expected. > > > > If we are providing fairness in terms of time, it is fair. If we provide > equal time slots to two processes and if one got more IO done because it > was not wasting time seeking or it issued bigger size IO, it deserves that > higher BW. IO controller will make sure that process gets fair share in > terms of time and exactly how much BW one got will depend on the workload. > > That's the precise reason that fairness in terms of time is better on > seeky media. If the seek time is negligible, the bandwidth would not be distributed according to a proportion of weight settings. I think that it would be unclear for users to understand how bandwidth is distributed. And I also think that seeky media would gradually become obsolete, > > > I am not against implementing things in higher layer as long as we can > > > ensure tight control on latencies, strong isolation between groups and > > > not break CFQ's class and ioprio model with-in group. > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > Can you elaborate little bit on this? > > > > bio is grabbed in generic_make_request() and throttled as well as > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > throttling policies will apply. So until and unless we figure out a > better way, the issues I have pointed out will still exists even in > new implementation. Yes, those still exist, but somehow I would like to try to solve them. > > The default value of io_limit on the previous test was 128 (not 192) > > which is equall to the default value of nr_request. > > Hm..., I used following commands to create two ioband devices. > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > "weight 0 :100" | dmsetup create ioband1 > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > "weight 0 :100" | dmsetup create ioband2 > > Here io_limit value is zero so it should pick default value. Following is > output of "dmsetup table" command. > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > ^^^^ > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > to be 192? The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > writes. > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > sync/async requests separately, and it solves this > > buffered-write-starves-read problem. I would like to post it soon > > after doing some more test. > > > > > On top of that can you please give some details how increasing the > > > buffered queue length reduces the impact of writers? > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > IOs are finished. But IO scheduler layer can accept IO requests more > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > Ok, so it should have been throughput bottleneck but how did it solve the > issue of writer starving the reader as you had mentioned in the mail. As wrote above, I modified dm-ioband to handle sync/async requests separately, so even if writers do a lot of buffered IOs, readers can issue IOs regardless writers' busyness. Once the IOs are backlogged for throttling, the both sync and async requests are issued according to the other of arrival. > Secondly, you mentioned that processes are made to sleep once we cross > io_limit. This sounds like request descriptor facility on requeust queue > where processes are made to sleep. > > There are threads in kernel which don't want to sleep while submitting > bios. For example, btrfs has bio submitting thread which does not want > to sleep hence it checks with device if it is congested or not and not > submit the bio if it is congested. How would you handle such cases. Have > you implemented any per group congestion kind of interface to make sure > such IO's don't sleep if group is congested. > > Or this limit is per ioband device which every group on the device is > sharing. If yes, then how would you provide isolation between groups > because if one groups consumes io_limit tokens, then other will simply > be serialized on that device? There are two kind of limit and both limit the number of IO requests which can be issued simultaneously, but one is for per ioband device, the other is for per ioband group. The per group limit assigned to each group is calculated by dividing io_limit according to their proportion of weight. The kernel thread is not made to sleep by the per group limit, because several kinds of kernel threads submit IOs from multiple groups and for multiple devices in a single thread. At this time, the kernel thread is made to sleep by the per device limit only. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-01 6:41 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-01 6:41 UTC (permalink / raw) To: vgoyal Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > I was thinking that elevator layer will do the merge of bios. So IO > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > into the disk and again timestamp with finish time once request finishes. > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > then time accounting becomes an issue. > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > time elapsed between each of milestones is t. Also assume that all these > > > requests are from same queue/group. > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > Now higher layer will think that time consumed by group is: > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > But the time elapsed is only 7t. > > > > IO controller can know how many requests are issued and still in > > progress. Is it not enough to accumulate the time while in-flight IOs > > exist? > > > > That time would not reflect disk time used. It will be follwoing. > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > (time spent in disk) In the case where multiple IO requests are issued from IO controller, that time measurement is the time from when the first IO request is issued until when the endio is called for the last IO request. Does not it reflect disk time? > > > Secondly if a different group is running only single sequential reader, > > > there CFQ will be driving queue depth of 1 and time will not be running > > > faster and this inaccuracy in accounting will lead to unfair share between > > > groups. > > > > > > So we need something better to get a sense which group used how much of > > > disk time. > > > > It could be solved by implementing the way to pass on such information > > from IO scheduler to higher layer controller. > > > > How would you do that? Can you give some details exactly how and what > information IO scheduler will pass to higher level IO controller so that IO > controller can attribute right time to the group. If you would like to know when the idle timer is expired, how about adding a function to IO controller to be notified it from IO scheduler? IO scheduler calls the function when the timer is expired. > > > > How about making throttling policy be user selectable like the IO > > > > scheduler and putting it in the higher layer? So we could support > > > > all of policies (time-based, size-based and rate limiting). There > > > > seems not to only one solution which satisfies all users. But I agree > > > > with starting with proportional bandwidth control first. > > > > > > > > > > What are the cases where time based policy does not work and size based > > > policy works better and user would choose size based policy and not timed > > > based one? > > > > I think that disk time is not simply proportional to IO size. If there > > are two groups whose wights are equally assigned and they issue > > different sized IOs repsectively, the bandwidth of each group would > > not distributed equally as expected. > > > > If we are providing fairness in terms of time, it is fair. If we provide > equal time slots to two processes and if one got more IO done because it > was not wasting time seeking or it issued bigger size IO, it deserves that > higher BW. IO controller will make sure that process gets fair share in > terms of time and exactly how much BW one got will depend on the workload. > > That's the precise reason that fairness in terms of time is better on > seeky media. If the seek time is negligible, the bandwidth would not be distributed according to a proportion of weight settings. I think that it would be unclear for users to understand how bandwidth is distributed. And I also think that seeky media would gradually become obsolete, > > > I am not against implementing things in higher layer as long as we can > > > ensure tight control on latencies, strong isolation between groups and > > > not break CFQ's class and ioprio model with-in group. > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > Can you elaborate little bit on this? > > > > bio is grabbed in generic_make_request() and throttled as well as > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > throttling policies will apply. So until and unless we figure out a > better way, the issues I have pointed out will still exists even in > new implementation. Yes, those still exist, but somehow I would like to try to solve them. > > The default value of io_limit on the previous test was 128 (not 192) > > which is equall to the default value of nr_request. > > Hm..., I used following commands to create two ioband devices. > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > "weight 0 :100" | dmsetup create ioband1 > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > "weight 0 :100" | dmsetup create ioband2 > > Here io_limit value is zero so it should pick default value. Following is > output of "dmsetup table" command. > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > ^^^^ > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > to be 192? The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > writes. > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > sync/async requests separately, and it solves this > > buffered-write-starves-read problem. I would like to post it soon > > after doing some more test. > > > > > On top of that can you please give some details how increasing the > > > buffered queue length reduces the impact of writers? > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > IOs are finished. But IO scheduler layer can accept IO requests more > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > Ok, so it should have been throughput bottleneck but how did it solve the > issue of writer starving the reader as you had mentioned in the mail. As wrote above, I modified dm-ioband to handle sync/async requests separately, so even if writers do a lot of buffered IOs, readers can issue IOs regardless writers' busyness. Once the IOs are backlogged for throttling, the both sync and async requests are issued according to the other of arrival. > Secondly, you mentioned that processes are made to sleep once we cross > io_limit. This sounds like request descriptor facility on requeust queue > where processes are made to sleep. > > There are threads in kernel which don't want to sleep while submitting > bios. For example, btrfs has bio submitting thread which does not want > to sleep hence it checks with device if it is congested or not and not > submit the bio if it is congested. How would you handle such cases. Have > you implemented any per group congestion kind of interface to make sure > such IO's don't sleep if group is congested. > > Or this limit is per ioband device which every group on the device is > sharing. If yes, then how would you provide isolation between groups > because if one groups consumes io_limit tokens, then other will simply > be serialized on that device? There are two kind of limit and both limit the number of IO requests which can be issued simultaneously, but one is for per ioband device, the other is for per ioband group. The per group limit assigned to each group is calculated by dividing io_limit according to their proportion of weight. The kernel thread is not made to sleep by the per group limit, because several kinds of kernel threads submit IOs from multiple groups and for multiple devices in a single thread. At this time, the kernel thread is made to sleep by the per device limit only. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091001.154125.104044685.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091001.154125.104044685.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2009-10-01 13:31 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-01 13:31 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > > Hi Vivek, > > > > > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > I was thinking that elevator layer will do the merge of bios. So IO > > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > > into the disk and again timestamp with finish time once request finishes. > > > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > > then time accounting becomes an issue. > > > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > > time elapsed between each of milestones is t. Also assume that all these > > > > requests are from same queue/group. > > > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > > > Now higher layer will think that time consumed by group is: > > > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > > > But the time elapsed is only 7t. > > > > > > IO controller can know how many requests are issued and still in > > > progress. Is it not enough to accumulate the time while in-flight IOs > > > exist? > > > > > > > That time would not reflect disk time used. It will be follwoing. > > > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > > (time spent in disk) > > In the case where multiple IO requests are issued from IO controller, > that time measurement is the time from when the first IO request is > issued until when the endio is called for the last IO request. Does > not it reflect disk time? > Not accurately as it will be including the time spent in CFQ queues as well as dispatch queue. I will not worry much about dispatch queue time but time spent CFQ queues can be significant. This is assuming that you are using token based scheme and will be dispatching requests from multiple groups at the same time. But if you figure out a way that you dispatch requests from one group only at one time and wait for all requests to finish and then let next group go, then above can work fairly accurately. In that case it will become like CFQ with the only difference that effectively we have one queue per group instread of per process. > > > > Secondly if a different group is running only single sequential reader, > > > > there CFQ will be driving queue depth of 1 and time will not be running > > > > faster and this inaccuracy in accounting will lead to unfair share between > > > > groups. > > > > > > > > So we need something better to get a sense which group used how much of > > > > disk time. > > > > > > It could be solved by implementing the way to pass on such information > > > from IO scheduler to higher layer controller. > > > > > > > How would you do that? Can you give some details exactly how and what > > information IO scheduler will pass to higher level IO controller so that IO > > controller can attribute right time to the group. > > If you would like to know when the idle timer is expired, how about > adding a function to IO controller to be notified it from IO > scheduler? IO scheduler calls the function when the timer is expired. > This probably can be done. So this is like syncing between lower layers and higher layers about when do we start idling and when do we stop it and both the layers should be in sync. This is something my common layer approach does. Becuase it is so close to IO scheuler, I can do it relatively easily. One probably can create interfaces to even propogate this information up. But this all will probably come into the picture only if we don't use token based schemes and come up with something where at one point of time dispatch are from one group only. > > > > > How about making throttling policy be user selectable like the IO > > > > > scheduler and putting it in the higher layer? So we could support > > > > > all of policies (time-based, size-based and rate limiting). There > > > > > seems not to only one solution which satisfies all users. But I agree > > > > > with starting with proportional bandwidth control first. > > > > > > > > > > > > > What are the cases where time based policy does not work and size based > > > > policy works better and user would choose size based policy and not timed > > > > based one? > > > > > > I think that disk time is not simply proportional to IO size. If there > > > are two groups whose wights are equally assigned and they issue > > > different sized IOs repsectively, the bandwidth of each group would > > > not distributed equally as expected. > > > > > > > If we are providing fairness in terms of time, it is fair. If we provide > > equal time slots to two processes and if one got more IO done because it > > was not wasting time seeking or it issued bigger size IO, it deserves that > > higher BW. IO controller will make sure that process gets fair share in > > terms of time and exactly how much BW one got will depend on the workload. > > > > That's the precise reason that fairness in terms of time is better on > > seeky media. > > If the seek time is negligible, the bandwidth would not be distributed > according to a proportion of weight settings. I think that it would be > unclear for users to understand how bandwidth is distributed. And I > also think that seeky media would gradually become obsolete, > I can understand that if lesser the seek cost game starts changing and probably a size based policy also work decently. In that case at some point of time probably CFQ will also need to support another mode/policy where fairness is provided in terms of size of IO, if it detects a SSD with hardware queuing. Currently it seem to be disabling the idling in that case. But this is not very good from fairness point of view. I guess if CFQ wants to provide fairness in such cases, it needs to dynamically change the shape and start thinking in terms of size of IO. So far my testing has been very limited to hard disks connected to my computer. I will do some testing on high end enterprise storage and see how much do seek matter and how well both the implementations work. > > > > I am not against implementing things in higher layer as long as we can > > > > ensure tight control on latencies, strong isolation between groups and > > > > not break CFQ's class and ioprio model with-in group. > > > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > > > Can you elaborate little bit on this? > > > > > > bio is grabbed in generic_make_request() and throttled as well as > > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > > throttling policies will apply. So until and unless we figure out a > > better way, the issues I have pointed out will still exists even in > > new implementation. > > Yes, those still exist, but somehow I would like to try to solve them. > > > > The default value of io_limit on the previous test was 128 (not 192) > > > which is equall to the default value of nr_request. > > > > Hm..., I used following commands to create two ioband devices. > > > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > > "weight 0 :100" | dmsetup create ioband1 > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > > "weight 0 :100" | dmsetup create ioband2 > > > > Here io_limit value is zero so it should pick default value. Following is > > output of "dmsetup table" command. > > > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > > ^^^^ > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > > to be 192? > > The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > > writes. > > > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > > sync/async requests separately, and it solves this > > > buffered-write-starves-read problem. I would like to post it soon > > > after doing some more test. > > > > > > > On top of that can you please give some details how increasing the > > > > buffered queue length reduces the impact of writers? > > > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > > IOs are finished. But IO scheduler layer can accept IO requests more > > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > > > > Ok, so it should have been throughput bottleneck but how did it solve the > > issue of writer starving the reader as you had mentioned in the mail. > > As wrote above, I modified dm-ioband to handle sync/async requests > separately, so even if writers do a lot of buffered IOs, readers can > issue IOs regardless writers' busyness. Once the IOs are backlogged > for throttling, the both sync and async requests are issued according > to the other of arrival. > Ok, so if both the readers and writers are buffered and some tokens become available then these tokens will be divided half and half between readers or writer queues? > > Secondly, you mentioned that processes are made to sleep once we cross > > io_limit. This sounds like request descriptor facility on requeust queue > > where processes are made to sleep. > > > > There are threads in kernel which don't want to sleep while submitting > > bios. For example, btrfs has bio submitting thread which does not want > > to sleep hence it checks with device if it is congested or not and not > > submit the bio if it is congested. How would you handle such cases. Have > > you implemented any per group congestion kind of interface to make sure > > such IO's don't sleep if group is congested. > > > > Or this limit is per ioband device which every group on the device is > > sharing. If yes, then how would you provide isolation between groups > > because if one groups consumes io_limit tokens, then other will simply > > be serialized on that device? > > There are two kind of limit and both limit the number of IO requests > which can be issued simultaneously, but one is for per ioband device, > the other is for per ioband group. The per group limit assigned to > each group is calculated by dividing io_limit according to their > proportion of weight. > > The kernel thread is not made to sleep by the per group limit, because > several kinds of kernel threads submit IOs from multiple groups and > for multiple devices in a single thread. At this time, the kernel > thread is made to sleep by the per device limit only. > Interesting. Actually not blocking kernel threads on per group limit and instead blocking it only on per device limts sounds like a good idea. I can also do something similar and that will take away the need of exporting per group congestion interface to higher layers and reduce complexity. If some kernel thread does not want to block, these will continue to use existing per device/bdi congestion interface. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-01 6:41 ` Ryo Tsuruta @ 2009-10-01 13:31 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-01 13:31 UTC (permalink / raw) To: Ryo Tsuruta Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > > Hi Vivek, > > > > > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > I was thinking that elevator layer will do the merge of bios. So IO > > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > > into the disk and again timestamp with finish time once request finishes. > > > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > > then time accounting becomes an issue. > > > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > > time elapsed between each of milestones is t. Also assume that all these > > > > requests are from same queue/group. > > > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > > > Now higher layer will think that time consumed by group is: > > > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > > > But the time elapsed is only 7t. > > > > > > IO controller can know how many requests are issued and still in > > > progress. Is it not enough to accumulate the time while in-flight IOs > > > exist? > > > > > > > That time would not reflect disk time used. It will be follwoing. > > > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > > (time spent in disk) > > In the case where multiple IO requests are issued from IO controller, > that time measurement is the time from when the first IO request is > issued until when the endio is called for the last IO request. Does > not it reflect disk time? > Not accurately as it will be including the time spent in CFQ queues as well as dispatch queue. I will not worry much about dispatch queue time but time spent CFQ queues can be significant. This is assuming that you are using token based scheme and will be dispatching requests from multiple groups at the same time. But if you figure out a way that you dispatch requests from one group only at one time and wait for all requests to finish and then let next group go, then above can work fairly accurately. In that case it will become like CFQ with the only difference that effectively we have one queue per group instread of per process. > > > > Secondly if a different group is running only single sequential reader, > > > > there CFQ will be driving queue depth of 1 and time will not be running > > > > faster and this inaccuracy in accounting will lead to unfair share between > > > > groups. > > > > > > > > So we need something better to get a sense which group used how much of > > > > disk time. > > > > > > It could be solved by implementing the way to pass on such information > > > from IO scheduler to higher layer controller. > > > > > > > How would you do that? Can you give some details exactly how and what > > information IO scheduler will pass to higher level IO controller so that IO > > controller can attribute right time to the group. > > If you would like to know when the idle timer is expired, how about > adding a function to IO controller to be notified it from IO > scheduler? IO scheduler calls the function when the timer is expired. > This probably can be done. So this is like syncing between lower layers and higher layers about when do we start idling and when do we stop it and both the layers should be in sync. This is something my common layer approach does. Becuase it is so close to IO scheuler, I can do it relatively easily. One probably can create interfaces to even propogate this information up. But this all will probably come into the picture only if we don't use token based schemes and come up with something where at one point of time dispatch are from one group only. > > > > > How about making throttling policy be user selectable like the IO > > > > > scheduler and putting it in the higher layer? So we could support > > > > > all of policies (time-based, size-based and rate limiting). There > > > > > seems not to only one solution which satisfies all users. But I agree > > > > > with starting with proportional bandwidth control first. > > > > > > > > > > > > > What are the cases where time based policy does not work and size based > > > > policy works better and user would choose size based policy and not timed > > > > based one? > > > > > > I think that disk time is not simply proportional to IO size. If there > > > are two groups whose wights are equally assigned and they issue > > > different sized IOs repsectively, the bandwidth of each group would > > > not distributed equally as expected. > > > > > > > If we are providing fairness in terms of time, it is fair. If we provide > > equal time slots to two processes and if one got more IO done because it > > was not wasting time seeking or it issued bigger size IO, it deserves that > > higher BW. IO controller will make sure that process gets fair share in > > terms of time and exactly how much BW one got will depend on the workload. > > > > That's the precise reason that fairness in terms of time is better on > > seeky media. > > If the seek time is negligible, the bandwidth would not be distributed > according to a proportion of weight settings. I think that it would be > unclear for users to understand how bandwidth is distributed. And I > also think that seeky media would gradually become obsolete, > I can understand that if lesser the seek cost game starts changing and probably a size based policy also work decently. In that case at some point of time probably CFQ will also need to support another mode/policy where fairness is provided in terms of size of IO, if it detects a SSD with hardware queuing. Currently it seem to be disabling the idling in that case. But this is not very good from fairness point of view. I guess if CFQ wants to provide fairness in such cases, it needs to dynamically change the shape and start thinking in terms of size of IO. So far my testing has been very limited to hard disks connected to my computer. I will do some testing on high end enterprise storage and see how much do seek matter and how well both the implementations work. > > > > I am not against implementing things in higher layer as long as we can > > > > ensure tight control on latencies, strong isolation between groups and > > > > not break CFQ's class and ioprio model with-in group. > > > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > > > Can you elaborate little bit on this? > > > > > > bio is grabbed in generic_make_request() and throttled as well as > > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > > throttling policies will apply. So until and unless we figure out a > > better way, the issues I have pointed out will still exists even in > > new implementation. > > Yes, those still exist, but somehow I would like to try to solve them. > > > > The default value of io_limit on the previous test was 128 (not 192) > > > which is equall to the default value of nr_request. > > > > Hm..., I used following commands to create two ioband devices. > > > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > > "weight 0 :100" | dmsetup create ioband1 > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > > "weight 0 :100" | dmsetup create ioband2 > > > > Here io_limit value is zero so it should pick default value. Following is > > output of "dmsetup table" command. > > > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > > ^^^^ > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > > to be 192? > > The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > > writes. > > > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > > sync/async requests separately, and it solves this > > > buffered-write-starves-read problem. I would like to post it soon > > > after doing some more test. > > > > > > > On top of that can you please give some details how increasing the > > > > buffered queue length reduces the impact of writers? > > > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > > IOs are finished. But IO scheduler layer can accept IO requests more > > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > > > > Ok, so it should have been throughput bottleneck but how did it solve the > > issue of writer starving the reader as you had mentioned in the mail. > > As wrote above, I modified dm-ioband to handle sync/async requests > separately, so even if writers do a lot of buffered IOs, readers can > issue IOs regardless writers' busyness. Once the IOs are backlogged > for throttling, the both sync and async requests are issued according > to the other of arrival. > Ok, so if both the readers and writers are buffered and some tokens become available then these tokens will be divided half and half between readers or writer queues? > > Secondly, you mentioned that processes are made to sleep once we cross > > io_limit. This sounds like request descriptor facility on requeust queue > > where processes are made to sleep. > > > > There are threads in kernel which don't want to sleep while submitting > > bios. For example, btrfs has bio submitting thread which does not want > > to sleep hence it checks with device if it is congested or not and not > > submit the bio if it is congested. How would you handle such cases. Have > > you implemented any per group congestion kind of interface to make sure > > such IO's don't sleep if group is congested. > > > > Or this limit is per ioband device which every group on the device is > > sharing. If yes, then how would you provide isolation between groups > > because if one groups consumes io_limit tokens, then other will simply > > be serialized on that device? > > There are two kind of limit and both limit the number of IO requests > which can be issued simultaneously, but one is for per ioband device, > the other is for per ioband group. The per group limit assigned to > each group is calculated by dividing io_limit according to their > proportion of weight. > > The kernel thread is not made to sleep by the per group limit, because > several kinds of kernel threads submit IOs from multiple groups and > for multiple devices in a single thread. At this time, the kernel > thread is made to sleep by the per device limit only. > Interesting. Actually not blocking kernel threads on per group limit and instead blocking it only on per device limts sounds like a good idea. I can also do something similar and that will take away the need of exporting per group congestion interface to higher layers and reduce complexity. If some kernel thread does not want to block, these will continue to use existing per device/bdi congestion interface. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-01 13:31 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-01 13:31 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > > Hi Vivek, > > > > > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > I was thinking that elevator layer will do the merge of bios. So IO > > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > > into the disk and again timestamp with finish time once request finishes. > > > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > > then time accounting becomes an issue. > > > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > > time elapsed between each of milestones is t. Also assume that all these > > > > requests are from same queue/group. > > > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > > > Now higher layer will think that time consumed by group is: > > > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > > > But the time elapsed is only 7t. > > > > > > IO controller can know how many requests are issued and still in > > > progress. Is it not enough to accumulate the time while in-flight IOs > > > exist? > > > > > > > That time would not reflect disk time used. It will be follwoing. > > > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > > (time spent in disk) > > In the case where multiple IO requests are issued from IO controller, > that time measurement is the time from when the first IO request is > issued until when the endio is called for the last IO request. Does > not it reflect disk time? > Not accurately as it will be including the time spent in CFQ queues as well as dispatch queue. I will not worry much about dispatch queue time but time spent CFQ queues can be significant. This is assuming that you are using token based scheme and will be dispatching requests from multiple groups at the same time. But if you figure out a way that you dispatch requests from one group only at one time and wait for all requests to finish and then let next group go, then above can work fairly accurately. In that case it will become like CFQ with the only difference that effectively we have one queue per group instread of per process. > > > > Secondly if a different group is running only single sequential reader, > > > > there CFQ will be driving queue depth of 1 and time will not be running > > > > faster and this inaccuracy in accounting will lead to unfair share between > > > > groups. > > > > > > > > So we need something better to get a sense which group used how much of > > > > disk time. > > > > > > It could be solved by implementing the way to pass on such information > > > from IO scheduler to higher layer controller. > > > > > > > How would you do that? Can you give some details exactly how and what > > information IO scheduler will pass to higher level IO controller so that IO > > controller can attribute right time to the group. > > If you would like to know when the idle timer is expired, how about > adding a function to IO controller to be notified it from IO > scheduler? IO scheduler calls the function when the timer is expired. > This probably can be done. So this is like syncing between lower layers and higher layers about when do we start idling and when do we stop it and both the layers should be in sync. This is something my common layer approach does. Becuase it is so close to IO scheuler, I can do it relatively easily. One probably can create interfaces to even propogate this information up. But this all will probably come into the picture only if we don't use token based schemes and come up with something where at one point of time dispatch are from one group only. > > > > > How about making throttling policy be user selectable like the IO > > > > > scheduler and putting it in the higher layer? So we could support > > > > > all of policies (time-based, size-based and rate limiting). There > > > > > seems not to only one solution which satisfies all users. But I agree > > > > > with starting with proportional bandwidth control first. > > > > > > > > > > > > > What are the cases where time based policy does not work and size based > > > > policy works better and user would choose size based policy and not timed > > > > based one? > > > > > > I think that disk time is not simply proportional to IO size. If there > > > are two groups whose wights are equally assigned and they issue > > > different sized IOs repsectively, the bandwidth of each group would > > > not distributed equally as expected. > > > > > > > If we are providing fairness in terms of time, it is fair. If we provide > > equal time slots to two processes and if one got more IO done because it > > was not wasting time seeking or it issued bigger size IO, it deserves that > > higher BW. IO controller will make sure that process gets fair share in > > terms of time and exactly how much BW one got will depend on the workload. > > > > That's the precise reason that fairness in terms of time is better on > > seeky media. > > If the seek time is negligible, the bandwidth would not be distributed > according to a proportion of weight settings. I think that it would be > unclear for users to understand how bandwidth is distributed. And I > also think that seeky media would gradually become obsolete, > I can understand that if lesser the seek cost game starts changing and probably a size based policy also work decently. In that case at some point of time probably CFQ will also need to support another mode/policy where fairness is provided in terms of size of IO, if it detects a SSD with hardware queuing. Currently it seem to be disabling the idling in that case. But this is not very good from fairness point of view. I guess if CFQ wants to provide fairness in such cases, it needs to dynamically change the shape and start thinking in terms of size of IO. So far my testing has been very limited to hard disks connected to my computer. I will do some testing on high end enterprise storage and see how much do seek matter and how well both the implementations work. > > > > I am not against implementing things in higher layer as long as we can > > > > ensure tight control on latencies, strong isolation between groups and > > > > not break CFQ's class and ioprio model with-in group. > > > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > > > Can you elaborate little bit on this? > > > > > > bio is grabbed in generic_make_request() and throttled as well as > > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > > throttling policies will apply. So until and unless we figure out a > > better way, the issues I have pointed out will still exists even in > > new implementation. > > Yes, those still exist, but somehow I would like to try to solve them. > > > > The default value of io_limit on the previous test was 128 (not 192) > > > which is equall to the default value of nr_request. > > > > Hm..., I used following commands to create two ioband devices. > > > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > > "weight 0 :100" | dmsetup create ioband1 > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > > "weight 0 :100" | dmsetup create ioband2 > > > > Here io_limit value is zero so it should pick default value. Following is > > output of "dmsetup table" command. > > > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > > ^^^^ > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > > to be 192? > > The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > > writes. > > > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > > sync/async requests separately, and it solves this > > > buffered-write-starves-read problem. I would like to post it soon > > > after doing some more test. > > > > > > > On top of that can you please give some details how increasing the > > > > buffered queue length reduces the impact of writers? > > > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > > IOs are finished. But IO scheduler layer can accept IO requests more > > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > > > > Ok, so it should have been throughput bottleneck but how did it solve the > > issue of writer starving the reader as you had mentioned in the mail. > > As wrote above, I modified dm-ioband to handle sync/async requests > separately, so even if writers do a lot of buffered IOs, readers can > issue IOs regardless writers' busyness. Once the IOs are backlogged > for throttling, the both sync and async requests are issued according > to the other of arrival. > Ok, so if both the readers and writers are buffered and some tokens become available then these tokens will be divided half and half between readers or writer queues? > > Secondly, you mentioned that processes are made to sleep once we cross > > io_limit. This sounds like request descriptor facility on requeust queue > > where processes are made to sleep. > > > > There are threads in kernel which don't want to sleep while submitting > > bios. For example, btrfs has bio submitting thread which does not want > > to sleep hence it checks with device if it is congested or not and not > > submit the bio if it is congested. How would you handle such cases. Have > > you implemented any per group congestion kind of interface to make sure > > such IO's don't sleep if group is congested. > > > > Or this limit is per ioband device which every group on the device is > > sharing. If yes, then how would you provide isolation between groups > > because if one groups consumes io_limit tokens, then other will simply > > be serialized on that device? > > There are two kind of limit and both limit the number of IO requests > which can be issued simultaneously, but one is for per ioband device, > the other is for per ioband group. The per group limit assigned to > each group is calculated by dividing io_limit according to their > proportion of weight. > > The kernel thread is not made to sleep by the per group limit, because > several kinds of kernel threads submit IOs from multiple groups and > for multiple devices in a single thread. At this time, the kernel > thread is made to sleep by the per device limit only. > Interesting. Actually not blocking kernel threads on per group limit and instead blocking it only on per device limts sounds like a good idea. I can also do something similar and that will take away the need of exporting per group congestion interface to higher layers and reduce complexity. If some kernel thread does not want to block, these will continue to use existing per device/bdi congestion interface. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-01 13:31 ` Vivek Goyal @ 2009-10-02 2:57 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 2:57 UTC (permalink / raw) To: Ryo Tsuruta Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Thu, Oct 01, 2009 at 09:31:09AM -0400, Vivek Goyal wrote: > On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > > > Hi Vivek, > > > > > > > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > I was thinking that elevator layer will do the merge of bios. So IO > > > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > > > into the disk and again timestamp with finish time once request finishes. > > > > > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > > > then time accounting becomes an issue. > > > > > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > > > time elapsed between each of milestones is t. Also assume that all these > > > > > requests are from same queue/group. > > > > > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > > > > > Now higher layer will think that time consumed by group is: > > > > > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > > > > > But the time elapsed is only 7t. > > > > > > > > IO controller can know how many requests are issued and still in > > > > progress. Is it not enough to accumulate the time while in-flight IOs > > > > exist? > > > > > > > > > > That time would not reflect disk time used. It will be follwoing. > > > > > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > > > (time spent in disk) > > > > In the case where multiple IO requests are issued from IO controller, > > that time measurement is the time from when the first IO request is > > issued until when the endio is called for the last IO request. Does > > not it reflect disk time? > > > > Not accurately as it will be including the time spent in CFQ queues as > well as dispatch queue. I will not worry much about dispatch queue time > but time spent CFQ queues can be significant. > > This is assuming that you are using token based scheme and will be > dispatching requests from multiple groups at the same time. > Thinking more about it... Does time based fairness make sense at higher level logical devices? - Time based fairness generally helps with rotational devices which have high seek costs. At higher level we don't even know what is the nature of underlying device where IO will ultimately go. - For time based fairness to work accurately at higher level, most likely it will require dispatch from the single group at a time and wait for requests to complete from that group and then dispatch from next. Something like CFQ model of queue. Dispatching from single queue/group works well in case of a single underlying device where CFQ is operating but at higher level devices where typically there will be multiple physical devices under it, it might not make sense as it made things more linear and reduced parallel processing further. So dispatching from single group at a time and waiting before we dispatch from next group will most likely be killer for throughput in higher level devices and might not make sense. If we don't adopt the policy of dispatch from single group, then we run into all the issues of weak isolation between groups, higher latencies, preemptions across groups etc. More I think about the whole issue and desired set of requirements, more I am convinced that we probably need two io controlling mechanisms. One which focusses purely on providing bandwidth fairness numbers on high level devices and the other which works at low level devices with CFQ and provides good bandwidth shaping, strong isolation, preserves fairness with-in group and good control on latencies. Higher level controller will not worry about time based policies. It can implemente max bw and proportional bw control based on size of IO and number of IO. Lower level controller at CFQ level will implement time based group scheduling. Keeping it at low level will have the advantage of better utitlization of hardware in various dm/md configurations (as no throttling takes place at higher level) but at the cost of not so strict fairness numbers at higher level. So those who want strict fairness number policies at higher level devices irrespective of shortcomings, can use that. Others can stick to lower level controller. For buffered write control we anyway have to do either something in memory controller or come up with another cgroup controller which throttles IO before it goes into cache. Or, in fact we can have a re-look at Andrea Righi's controller which provided max BW and throttled buffered writes before they got into page cache and try to provide proportional BW also there. Basically I see the space for two IO controllers. At the moment can't think of a way of coming up with single controller which satisfies all the requirements. So instead provide two and let user choose one based on his need. Any thoughts? Before finishing this mail, will throw a whacky idea in the ring. I was going through the request based dm-multipath paper. Will it make sense to implement request based dm-ioband? So basically we implement all the group scheduling in CFQ and let dm-ioband implement a request function to take the request and break it back into bios. This way we can keep all the group control at one place and also meet most of the requirements. So request based dm-ioband will have a request in hand once that request has passed group control and prio control. Because dm-ioband is a device mapper target, one can put it on higher level devices (practically taking CFQ at higher level device), and provide fairness there. One can also put it on those SSDs which don't use IO scheduler (this is kind of forcing them to use the IO scheduler.) I am sure that will be many issues but one big issue I could think of that CFQ thinks that there is one device beneath it and dipsatches requests from one queue (in case of idling) and that would kill parallelism at higher layer and throughput will suffer on many of the dm/md configurations. Thanks Vivek > But if you figure out a way that you dispatch requests from one group only > at one time and wait for all requests to finish and then let next group > go, then above can work fairly accurately. In that case it will become > like CFQ with the only difference that effectively we have one queue per > group instread of per process. > > > > > > Secondly if a different group is running only single sequential reader, > > > > > there CFQ will be driving queue depth of 1 and time will not be running > > > > > faster and this inaccuracy in accounting will lead to unfair share between > > > > > groups. > > > > > > > > > > So we need something better to get a sense which group used how much of > > > > > disk time. > > > > > > > > It could be solved by implementing the way to pass on such information > > > > from IO scheduler to higher layer controller. > > > > > > > > > > How would you do that? Can you give some details exactly how and what > > > information IO scheduler will pass to higher level IO controller so that IO > > > controller can attribute right time to the group. > > > > If you would like to know when the idle timer is expired, how about > > adding a function to IO controller to be notified it from IO > > scheduler? IO scheduler calls the function when the timer is expired. > > > > This probably can be done. So this is like syncing between lower layers > and higher layers about when do we start idling and when do we stop it and > both the layers should be in sync. > > This is something my common layer approach does. Becuase it is so close to > IO scheuler, I can do it relatively easily. > > One probably can create interfaces to even propogate this information up. > But this all will probably come into the picture only if we don't use > token based schemes and come up with something where at one point of time > dispatch are from one group only. > > > > > > > How about making throttling policy be user selectable like the IO > > > > > > scheduler and putting it in the higher layer? So we could support > > > > > > all of policies (time-based, size-based and rate limiting). There > > > > > > seems not to only one solution which satisfies all users. But I agree > > > > > > with starting with proportional bandwidth control first. > > > > > > > > > > > > > > > > What are the cases where time based policy does not work and size based > > > > > policy works better and user would choose size based policy and not timed > > > > > based one? > > > > > > > > I think that disk time is not simply proportional to IO size. If there > > > > are two groups whose wights are equally assigned and they issue > > > > different sized IOs repsectively, the bandwidth of each group would > > > > not distributed equally as expected. > > > > > > > > > > If we are providing fairness in terms of time, it is fair. If we provide > > > equal time slots to two processes and if one got more IO done because it > > > was not wasting time seeking or it issued bigger size IO, it deserves that > > > higher BW. IO controller will make sure that process gets fair share in > > > terms of time and exactly how much BW one got will depend on the workload. > > > > > > That's the precise reason that fairness in terms of time is better on > > > seeky media. > > > > If the seek time is negligible, the bandwidth would not be distributed > > according to a proportion of weight settings. I think that it would be > > unclear for users to understand how bandwidth is distributed. And I > > also think that seeky media would gradually become obsolete, > > > > I can understand that if lesser the seek cost game starts changing and > probably a size based policy also work decently. > > In that case at some point of time probably CFQ will also need to support > another mode/policy where fairness is provided in terms of size of IO, if > it detects a SSD with hardware queuing. Currently it seem to be disabling > the idling in that case. But this is not very good from fairness point of > view. I guess if CFQ wants to provide fairness in such cases, it needs to > dynamically change the shape and start thinking in terms of size of IO. > > So far my testing has been very limited to hard disks connected to my > computer. I will do some testing on high end enterprise storage and see > how much do seek matter and how well both the implementations work. > > > > > > I am not against implementing things in higher layer as long as we can > > > > > ensure tight control on latencies, strong isolation between groups and > > > > > not break CFQ's class and ioprio model with-in group. > > > > > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > > > > > Can you elaborate little bit on this? > > > > > > > > bio is grabbed in generic_make_request() and throttled as well as > > > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > > > > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > > > throttling policies will apply. So until and unless we figure out a > > > better way, the issues I have pointed out will still exists even in > > > new implementation. > > > > Yes, those still exist, but somehow I would like to try to solve them. > > > > > > The default value of io_limit on the previous test was 128 (not 192) > > > > which is equall to the default value of nr_request. > > > > > > Hm..., I used following commands to create two ioband devices. > > > > > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > > > "weight 0 :100" | dmsetup create ioband1 > > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > > > "weight 0 :100" | dmsetup create ioband2 > > > > > > Here io_limit value is zero so it should pick default value. Following is > > > output of "dmsetup table" command. > > > > > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > > > ^^^^ > > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > > > to be 192? > > > > The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > > > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > > > writes. > > > > > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > > > sync/async requests separately, and it solves this > > > > buffered-write-starves-read problem. I would like to post it soon > > > > after doing some more test. > > > > > > > > > On top of that can you please give some details how increasing the > > > > > buffered queue length reduces the impact of writers? > > > > > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > > > IOs are finished. But IO scheduler layer can accept IO requests more > > > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > > > > > > > Ok, so it should have been throughput bottleneck but how did it solve the > > > issue of writer starving the reader as you had mentioned in the mail. > > > > As wrote above, I modified dm-ioband to handle sync/async requests > > separately, so even if writers do a lot of buffered IOs, readers can > > issue IOs regardless writers' busyness. Once the IOs are backlogged > > for throttling, the both sync and async requests are issued according > > to the other of arrival. > > > > Ok, so if both the readers and writers are buffered and some tokens become > available then these tokens will be divided half and half between readers > or writer queues? > > > > Secondly, you mentioned that processes are made to sleep once we cross > > > io_limit. This sounds like request descriptor facility on requeust queue > > > where processes are made to sleep. > > > > > > There are threads in kernel which don't want to sleep while submitting > > > bios. For example, btrfs has bio submitting thread which does not want > > > to sleep hence it checks with device if it is congested or not and not > > > submit the bio if it is congested. How would you handle such cases. Have > > > you implemented any per group congestion kind of interface to make sure > > > such IO's don't sleep if group is congested. > > > > > > Or this limit is per ioband device which every group on the device is > > > sharing. If yes, then how would you provide isolation between groups > > > because if one groups consumes io_limit tokens, then other will simply > > > be serialized on that device? > > > > There are two kind of limit and both limit the number of IO requests > > which can be issued simultaneously, but one is for per ioband device, > > the other is for per ioband group. The per group limit assigned to > > each group is calculated by dividing io_limit according to their > > proportion of weight. > > > > The kernel thread is not made to sleep by the per group limit, because > > several kinds of kernel threads submit IOs from multiple groups and > > for multiple devices in a single thread. At this time, the kernel > > thread is made to sleep by the per device limit only. > > > > Interesting. Actually not blocking kernel threads on per group limit > and instead blocking it only on per device limts sounds like a good idea. > > I can also do something similar and that will take away the need of > exporting per group congestion interface to higher layers and reduce > complexity. If some kernel thread does not want to block, these will > continue to use existing per device/bdi congestion interface. > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 2:57 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 2:57 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds On Thu, Oct 01, 2009 at 09:31:09AM -0400, Vivek Goyal wrote: > On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > > > Hi Vivek, > > > > > > > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > > I was thinking that elevator layer will do the merge of bios. So IO > > > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > > > into the disk and again timestamp with finish time once request finishes. > > > > > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > > > then time accounting becomes an issue. > > > > > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > > > time elapsed between each of milestones is t. Also assume that all these > > > > > requests are from same queue/group. > > > > > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > > > > > Now higher layer will think that time consumed by group is: > > > > > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > > > > > But the time elapsed is only 7t. > > > > > > > > IO controller can know how many requests are issued and still in > > > > progress. Is it not enough to accumulate the time while in-flight IOs > > > > exist? > > > > > > > > > > That time would not reflect disk time used. It will be follwoing. > > > > > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > > > (time spent in disk) > > > > In the case where multiple IO requests are issued from IO controller, > > that time measurement is the time from when the first IO request is > > issued until when the endio is called for the last IO request. Does > > not it reflect disk time? > > > > Not accurately as it will be including the time spent in CFQ queues as > well as dispatch queue. I will not worry much about dispatch queue time > but time spent CFQ queues can be significant. > > This is assuming that you are using token based scheme and will be > dispatching requests from multiple groups at the same time. > Thinking more about it... Does time based fairness make sense at higher level logical devices? - Time based fairness generally helps with rotational devices which have high seek costs. At higher level we don't even know what is the nature of underlying device where IO will ultimately go. - For time based fairness to work accurately at higher level, most likely it will require dispatch from the single group at a time and wait for requests to complete from that group and then dispatch from next. Something like CFQ model of queue. Dispatching from single queue/group works well in case of a single underlying device where CFQ is operating but at higher level devices where typically there will be multiple physical devices under it, it might not make sense as it made things more linear and reduced parallel processing further. So dispatching from single group at a time and waiting before we dispatch from next group will most likely be killer for throughput in higher level devices and might not make sense. If we don't adopt the policy of dispatch from single group, then we run into all the issues of weak isolation between groups, higher latencies, preemptions across groups etc. More I think about the whole issue and desired set of requirements, more I am convinced that we probably need two io controlling mechanisms. One which focusses purely on providing bandwidth fairness numbers on high level devices and the other which works at low level devices with CFQ and provides good bandwidth shaping, strong isolation, preserves fairness with-in group and good control on latencies. Higher level controller will not worry about time based policies. It can implemente max bw and proportional bw control based on size of IO and number of IO. Lower level controller at CFQ level will implement time based group scheduling. Keeping it at low level will have the advantage of better utitlization of hardware in various dm/md configurations (as no throttling takes place at higher level) but at the cost of not so strict fairness numbers at higher level. So those who want strict fairness number policies at higher level devices irrespective of shortcomings, can use that. Others can stick to lower level controller. For buffered write control we anyway have to do either something in memory controller or come up with another cgroup controller which throttles IO before it goes into cache. Or, in fact we can have a re-look at Andrea Righi's controller which provided max BW and throttled buffered writes before they got into page cache and try to provide proportional BW also there. Basically I see the space for two IO controllers. At the moment can't think of a way of coming up with single controller which satisfies all the requirements. So instead provide two and let user choose one based on his need. Any thoughts? Before finishing this mail, will throw a whacky idea in the ring. I was going through the request based dm-multipath paper. Will it make sense to implement request based dm-ioband? So basically we implement all the group scheduling in CFQ and let dm-ioband implement a request function to take the request and break it back into bios. This way we can keep all the group control at one place and also meet most of the requirements. So request based dm-ioband will have a request in hand once that request has passed group control and prio control. Because dm-ioband is a device mapper target, one can put it on higher level devices (practically taking CFQ at higher level device), and provide fairness there. One can also put it on those SSDs which don't use IO scheduler (this is kind of forcing them to use the IO scheduler.) I am sure that will be many issues but one big issue I could think of that CFQ thinks that there is one device beneath it and dipsatches requests from one queue (in case of idling) and that would kill parallelism at higher layer and throughput will suffer on many of the dm/md configurations. Thanks Vivek > But if you figure out a way that you dispatch requests from one group only > at one time and wait for all requests to finish and then let next group > go, then above can work fairly accurately. In that case it will become > like CFQ with the only difference that effectively we have one queue per > group instread of per process. > > > > > > Secondly if a different group is running only single sequential reader, > > > > > there CFQ will be driving queue depth of 1 and time will not be running > > > > > faster and this inaccuracy in accounting will lead to unfair share between > > > > > groups. > > > > > > > > > > So we need something better to get a sense which group used how much of > > > > > disk time. > > > > > > > > It could be solved by implementing the way to pass on such information > > > > from IO scheduler to higher layer controller. > > > > > > > > > > How would you do that? Can you give some details exactly how and what > > > information IO scheduler will pass to higher level IO controller so that IO > > > controller can attribute right time to the group. > > > > If you would like to know when the idle timer is expired, how about > > adding a function to IO controller to be notified it from IO > > scheduler? IO scheduler calls the function when the timer is expired. > > > > This probably can be done. So this is like syncing between lower layers > and higher layers about when do we start idling and when do we stop it and > both the layers should be in sync. > > This is something my common layer approach does. Becuase it is so close to > IO scheuler, I can do it relatively easily. > > One probably can create interfaces to even propogate this information up. > But this all will probably come into the picture only if we don't use > token based schemes and come up with something where at one point of time > dispatch are from one group only. > > > > > > > How about making throttling policy be user selectable like the IO > > > > > > scheduler and putting it in the higher layer? So we could support > > > > > > all of policies (time-based, size-based and rate limiting). There > > > > > > seems not to only one solution which satisfies all users. But I agree > > > > > > with starting with proportional bandwidth control first. > > > > > > > > > > > > > > > > What are the cases where time based policy does not work and size based > > > > > policy works better and user would choose size based policy and not timed > > > > > based one? > > > > > > > > I think that disk time is not simply proportional to IO size. If there > > > > are two groups whose wights are equally assigned and they issue > > > > different sized IOs repsectively, the bandwidth of each group would > > > > not distributed equally as expected. > > > > > > > > > > If we are providing fairness in terms of time, it is fair. If we provide > > > equal time slots to two processes and if one got more IO done because it > > > was not wasting time seeking or it issued bigger size IO, it deserves that > > > higher BW. IO controller will make sure that process gets fair share in > > > terms of time and exactly how much BW one got will depend on the workload. > > > > > > That's the precise reason that fairness in terms of time is better on > > > seeky media. > > > > If the seek time is negligible, the bandwidth would not be distributed > > according to a proportion of weight settings. I think that it would be > > unclear for users to understand how bandwidth is distributed. And I > > also think that seeky media would gradually become obsolete, > > > > I can understand that if lesser the seek cost game starts changing and > probably a size based policy also work decently. > > In that case at some point of time probably CFQ will also need to support > another mode/policy where fairness is provided in terms of size of IO, if > it detects a SSD with hardware queuing. Currently it seem to be disabling > the idling in that case. But this is not very good from fairness point of > view. I guess if CFQ wants to provide fairness in such cases, it needs to > dynamically change the shape and start thinking in terms of size of IO. > > So far my testing has been very limited to hard disks connected to my > computer. I will do some testing on high end enterprise storage and see > how much do seek matter and how well both the implementations work. > > > > > > I am not against implementing things in higher layer as long as we can > > > > > ensure tight control on latencies, strong isolation between groups and > > > > > not break CFQ's class and ioprio model with-in group. > > > > > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > > > > > Can you elaborate little bit on this? > > > > > > > > bio is grabbed in generic_make_request() and throttled as well as > > > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > > > > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > > > throttling policies will apply. So until and unless we figure out a > > > better way, the issues I have pointed out will still exists even in > > > new implementation. > > > > Yes, those still exist, but somehow I would like to try to solve them. > > > > > > The default value of io_limit on the previous test was 128 (not 192) > > > > which is equall to the default value of nr_request. > > > > > > Hm..., I used following commands to create two ioband devices. > > > > > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > > > "weight 0 :100" | dmsetup create ioband1 > > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > > > "weight 0 :100" | dmsetup create ioband2 > > > > > > Here io_limit value is zero so it should pick default value. Following is > > > output of "dmsetup table" command. > > > > > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > > > ^^^^ > > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > > > to be 192? > > > > The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > > > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > > > writes. > > > > > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > > > sync/async requests separately, and it solves this > > > > buffered-write-starves-read problem. I would like to post it soon > > > > after doing some more test. > > > > > > > > > On top of that can you please give some details how increasing the > > > > > buffered queue length reduces the impact of writers? > > > > > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > > > IOs are finished. But IO scheduler layer can accept IO requests more > > > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > > > > > > > Ok, so it should have been throughput bottleneck but how did it solve the > > > issue of writer starving the reader as you had mentioned in the mail. > > > > As wrote above, I modified dm-ioband to handle sync/async requests > > separately, so even if writers do a lot of buffered IOs, readers can > > issue IOs regardless writers' busyness. Once the IOs are backlogged > > for throttling, the both sync and async requests are issued according > > to the other of arrival. > > > > Ok, so if both the readers and writers are buffered and some tokens become > available then these tokens will be divided half and half between readers > or writer queues? > > > > Secondly, you mentioned that processes are made to sleep once we cross > > > io_limit. This sounds like request descriptor facility on requeust queue > > > where processes are made to sleep. > > > > > > There are threads in kernel which don't want to sleep while submitting > > > bios. For example, btrfs has bio submitting thread which does not want > > > to sleep hence it checks with device if it is congested or not and not > > > submit the bio if it is congested. How would you handle such cases. Have > > > you implemented any per group congestion kind of interface to make sure > > > such IO's don't sleep if group is congested. > > > > > > Or this limit is per ioband device which every group on the device is > > > sharing. If yes, then how would you provide isolation between groups > > > because if one groups consumes io_limit tokens, then other will simply > > > be serialized on that device? > > > > There are two kind of limit and both limit the number of IO requests > > which can be issued simultaneously, but one is for per ioband device, > > the other is for per ioband group. The per group limit assigned to > > each group is calculated by dividing io_limit according to their > > proportion of weight. > > > > The kernel thread is not made to sleep by the per group limit, because > > several kinds of kernel threads submit IOs from multiple groups and > > for multiple devices in a single thread. At this time, the kernel > > thread is made to sleep by the per device limit only. > > > > Interesting. Actually not blocking kernel threads on per group limit > and instead blocking it only on per device limts sounds like a good idea. > > I can also do something similar and that will take away the need of > exporting per group congestion interface to higher layers and reduce > complexity. If some kernel thread does not want to block, these will > continue to use existing per device/bdi congestion interface. > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 2:57 ` Vivek Goyal @ 2009-10-02 20:27 ` Munehiro Ikeda -1 siblings, 0 replies; 349+ messages in thread From: Munehiro Ikeda @ 2009-10-02 20:27 UTC (permalink / raw) To: Vivek Goyal Cc: Ryo Tsuruta, nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Vivek Goyal wrote, on 10/01/2009 10:57 PM: > Before finishing this mail, will throw a whacky idea in the ring. I was > going through the request based dm-multipath paper. Will it make sense > to implement request based dm-ioband? So basically we implement all the > group scheduling in CFQ and let dm-ioband implement a request function > to take the request and break it back into bios. This way we can keep > all the group control at one place and also meet most of the requirements. > > So request based dm-ioband will have a request in hand once that request > has passed group control and prio control. Because dm-ioband is a device > mapper target, one can put it on higher level devices (practically taking > CFQ at higher level device), and provide fairness there. One can also > put it on those SSDs which don't use IO scheduler (this is kind of forcing > them to use the IO scheduler.) > > I am sure that will be many issues but one big issue I could think of that > CFQ thinks that there is one device beneath it and dipsatches requests > from one queue (in case of idling) and that would kill parallelism at > higher layer and throughput will suffer on many of the dm/md configurations. > > Thanks > Vivek As long as using CFQ, your idea is reasonable for me. But how about for other IO schedulers? In my understanding, one of the keys to guarantee group isolation in your patch is to have per-group IO scheduler internal queue even with as, deadline, and noop scheduler. I think this is great idea, and to implement generic code for all IO schedulers was concluded when we had so many IO scheduler specific proposals. If we will still need per-group IO scheduler internal queues with request-based dm-ioband, we have to modify elevator layer. It seems out of scope of dm. I might miss something... -- IKEDA, Munehiro NEC Corporation of America m-ikeda@ds.jp.nec.com ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-02 20:27 ` Munehiro Ikeda 0 siblings, 0 replies; 349+ messages in thread From: Munehiro Ikeda @ 2009-10-02 20:27 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, torvalds Vivek Goyal wrote, on 10/01/2009 10:57 PM: > Before finishing this mail, will throw a whacky idea in the ring. I was > going through the request based dm-multipath paper. Will it make sense > to implement request based dm-ioband? So basically we implement all the > group scheduling in CFQ and let dm-ioband implement a request function > to take the request and break it back into bios. This way we can keep > all the group control at one place and also meet most of the requirements. > > So request based dm-ioband will have a request in hand once that request > has passed group control and prio control. Because dm-ioband is a device > mapper target, one can put it on higher level devices (practically taking > CFQ at higher level device), and provide fairness there. One can also > put it on those SSDs which don't use IO scheduler (this is kind of forcing > them to use the IO scheduler.) > > I am sure that will be many issues but one big issue I could think of that > CFQ thinks that there is one device beneath it and dipsatches requests > from one queue (in case of idling) and that would kill parallelism at > higher layer and throughput will suffer on many of the dm/md configurations. > > Thanks > Vivek As long as using CFQ, your idea is reasonable for me. But how about for other IO schedulers? In my understanding, one of the keys to guarantee group isolation in your patch is to have per-group IO scheduler internal queue even with as, deadline, and noop scheduler. I think this is great idea, and to implement generic code for all IO schedulers was concluded when we had so many IO scheduler specific proposals. If we will still need per-group IO scheduler internal queues with request-based dm-ioband, we have to modify elevator layer. It seems out of scope of dm. I might miss something... -- IKEDA, Munehiro NEC Corporation of America m-ikeda@ds.jp.nec.com ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4AC6623F.70600-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4AC6623F.70600-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> @ 2009-10-05 10:38 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-05 10:38 UTC (permalink / raw) To: m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi, Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote: > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > Before finishing this mail, will throw a whacky idea in the ring. I was > > going through the request based dm-multipath paper. Will it make sense > > to implement request based dm-ioband? So basically we implement all the > > group scheduling in CFQ and let dm-ioband implement a request function > > to take the request and break it back into bios. This way we can keep > > all the group control at one place and also meet most of the requirements. > > > > So request based dm-ioband will have a request in hand once that request > > has passed group control and prio control. Because dm-ioband is a device > > mapper target, one can put it on higher level devices (practically taking > > CFQ at higher level device), and provide fairness there. One can also > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > them to use the IO scheduler.) > > > > I am sure that will be many issues but one big issue I could think of that > > CFQ thinks that there is one device beneath it and dipsatches requests > > from one queue (in case of idling) and that would kill parallelism at > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > Thanks > > Vivek > > As long as using CFQ, your idea is reasonable for me. But how about for > other IO schedulers? In my understanding, one of the keys to guarantee > group isolation in your patch is to have per-group IO scheduler internal > queue even with as, deadline, and noop scheduler. I think this is > great idea, and to implement generic code for all IO schedulers was > concluded when we had so many IO scheduler specific proposals. > If we will still need per-group IO scheduler internal queues with > request-based dm-ioband, we have to modify elevator layer. It seems > out of scope of dm. > I might miss something... IIUC, the request based device-mapper could not break back a request into bio, so it could not work with block devices which don't use the IO scheduler. How about adding a callback function to the higher level controller? CFQ calls it when the active queue runs out of time, then the higer level controller use it as a trigger or a hint to move IO group, so I think a time-based controller could be implemented at higher level. My requirements for IO controller are: - Implement s a higher level controller, which is located at block layer and bio is grabbed in generic_make_request(). - Can work with any type of IO scheduler. - Can work with any type of block devices. - Support multiple policies, proportional wegiht, max rate, time based, ans so on. The IO controller mini-summit will be held in next week, and I'm looking forard to meet you all and discuss about IO controller. https://sourceforge.net/apps/trac/ioband/wiki/iosummit Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-02 20:27 ` Munehiro Ikeda @ 2009-10-05 10:38 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-05 10:38 UTC (permalink / raw) To: m-ikeda Cc: vgoyal, nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi, Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > Before finishing this mail, will throw a whacky idea in the ring. I was > > going through the request based dm-multipath paper. Will it make sense > > to implement request based dm-ioband? So basically we implement all the > > group scheduling in CFQ and let dm-ioband implement a request function > > to take the request and break it back into bios. This way we can keep > > all the group control at one place and also meet most of the requirements. > > > > So request based dm-ioband will have a request in hand once that request > > has passed group control and prio control. Because dm-ioband is a device > > mapper target, one can put it on higher level devices (practically taking > > CFQ at higher level device), and provide fairness there. One can also > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > them to use the IO scheduler.) > > > > I am sure that will be many issues but one big issue I could think of that > > CFQ thinks that there is one device beneath it and dipsatches requests > > from one queue (in case of idling) and that would kill parallelism at > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > Thanks > > Vivek > > As long as using CFQ, your idea is reasonable for me. But how about for > other IO schedulers? In my understanding, one of the keys to guarantee > group isolation in your patch is to have per-group IO scheduler internal > queue even with as, deadline, and noop scheduler. I think this is > great idea, and to implement generic code for all IO schedulers was > concluded when we had so many IO scheduler specific proposals. > If we will still need per-group IO scheduler internal queues with > request-based dm-ioband, we have to modify elevator layer. It seems > out of scope of dm. > I might miss something... IIUC, the request based device-mapper could not break back a request into bio, so it could not work with block devices which don't use the IO scheduler. How about adding a callback function to the higher level controller? CFQ calls it when the active queue runs out of time, then the higer level controller use it as a trigger or a hint to move IO group, so I think a time-based controller could be implemented at higher level. My requirements for IO controller are: - Implement s a higher level controller, which is located at block layer and bio is grabbed in generic_make_request(). - Can work with any type of IO scheduler. - Can work with any type of block devices. - Support multiple policies, proportional wegiht, max rate, time based, ans so on. The IO controller mini-summit will be held in next week, and I'm looking forard to meet you all and discuss about IO controller. https://sourceforge.net/apps/trac/ioband/wiki/iosummit Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-05 10:38 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-05 10:38 UTC (permalink / raw) To: m-ikeda Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, vgoyal, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, torvalds Hi, Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > Before finishing this mail, will throw a whacky idea in the ring. I was > > going through the request based dm-multipath paper. Will it make sense > > to implement request based dm-ioband? So basically we implement all the > > group scheduling in CFQ and let dm-ioband implement a request function > > to take the request and break it back into bios. This way we can keep > > all the group control at one place and also meet most of the requirements. > > > > So request based dm-ioband will have a request in hand once that request > > has passed group control and prio control. Because dm-ioband is a device > > mapper target, one can put it on higher level devices (practically taking > > CFQ at higher level device), and provide fairness there. One can also > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > them to use the IO scheduler.) > > > > I am sure that will be many issues but one big issue I could think of that > > CFQ thinks that there is one device beneath it and dipsatches requests > > from one queue (in case of idling) and that would kill parallelism at > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > Thanks > > Vivek > > As long as using CFQ, your idea is reasonable for me. But how about for > other IO schedulers? In my understanding, one of the keys to guarantee > group isolation in your patch is to have per-group IO scheduler internal > queue even with as, deadline, and noop scheduler. I think this is > great idea, and to implement generic code for all IO schedulers was > concluded when we had so many IO scheduler specific proposals. > If we will still need per-group IO scheduler internal queues with > request-based dm-ioband, we have to modify elevator layer. It seems > out of scope of dm. > I might miss something... IIUC, the request based device-mapper could not break back a request into bio, so it could not work with block devices which don't use the IO scheduler. How about adding a callback function to the higher level controller? CFQ calls it when the active queue runs out of time, then the higer level controller use it as a trigger or a hint to move IO group, so I think a time-based controller could be implemented at higher level. My requirements for IO controller are: - Implement s a higher level controller, which is located at block layer and bio is grabbed in generic_make_request(). - Can work with any type of IO scheduler. - Can work with any type of block devices. - Support multiple policies, proportional wegiht, max rate, time based, ans so on. The IO controller mini-summit will be held in next week, and I'm looking forard to meet you all and discuss about IO controller. https://sourceforge.net/apps/trac/ioband/wiki/iosummit Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091005.193808.104033719.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091005.193808.104033719.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2009-10-05 12:31 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-05 12:31 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > Hi, > > Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote: > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > going through the request based dm-multipath paper. Will it make sense > > > to implement request based dm-ioband? So basically we implement all the > > > group scheduling in CFQ and let dm-ioband implement a request function > > > to take the request and break it back into bios. This way we can keep > > > all the group control at one place and also meet most of the requirements. > > > > > > So request based dm-ioband will have a request in hand once that request > > > has passed group control and prio control. Because dm-ioband is a device > > > mapper target, one can put it on higher level devices (practically taking > > > CFQ at higher level device), and provide fairness there. One can also > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > them to use the IO scheduler.) > > > > > > I am sure that will be many issues but one big issue I could think of that > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > from one queue (in case of idling) and that would kill parallelism at > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > Thanks > > > Vivek > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > other IO schedulers? In my understanding, one of the keys to guarantee > > group isolation in your patch is to have per-group IO scheduler internal > > queue even with as, deadline, and noop scheduler. I think this is > > great idea, and to implement generic code for all IO schedulers was > > concluded when we had so many IO scheduler specific proposals. > > If we will still need per-group IO scheduler internal queues with > > request-based dm-ioband, we have to modify elevator layer. It seems > > out of scope of dm. > > I might miss something... > > IIUC, the request based device-mapper could not break back a request > into bio, so it could not work with block devices which don't use the > IO scheduler. > I think current request based multipath drvier does not do it but can't it be implemented that requests are broken back into bio? Anyway, I don't feel too strongly about this approach as it might introduce more serialization at higher layer. > How about adding a callback function to the higher level controller? > CFQ calls it when the active queue runs out of time, then the higer > level controller use it as a trigger or a hint to move IO group, so > I think a time-based controller could be implemented at higher level. > Adding a call back should not be a big issue. But that means you are planning to run only one group at higher layer at one time and I think that's the problem because than we are introducing serialization at higher layer. So any higher level device mapper target which has multiple physical disks under it, we might be underutilizing these even more and take a big hit on overall throughput. The whole design of doing proportional weight at lower layer is optimial usage of system. > My requirements for IO controller are: > - Implement s a higher level controller, which is located at block > layer and bio is grabbed in generic_make_request(). How are you planning to handle the issue of buffered writes Andrew raised? > - Can work with any type of IO scheduler. > - Can work with any type of block devices. > - Support multiple policies, proportional wegiht, max rate, time > based, ans so on. > > The IO controller mini-summit will be held in next week, and I'm > looking forard to meet you all and discuss about IO controller. > https://sourceforge.net/apps/trac/ioband/wiki/iosummit Is there a new version of dm-ioband now where you have solved the issue of sync/async dispatch with-in group? Before meeting at mini-summit, I am trying to run some tests and come up with numbers so that we have more clear picture of pros/cons. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-05 10:38 ` Ryo Tsuruta @ 2009-10-05 12:31 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-05 12:31 UTC (permalink / raw) To: Ryo Tsuruta Cc: m-ikeda, nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > Hi, > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > going through the request based dm-multipath paper. Will it make sense > > > to implement request based dm-ioband? So basically we implement all the > > > group scheduling in CFQ and let dm-ioband implement a request function > > > to take the request and break it back into bios. This way we can keep > > > all the group control at one place and also meet most of the requirements. > > > > > > So request based dm-ioband will have a request in hand once that request > > > has passed group control and prio control. Because dm-ioband is a device > > > mapper target, one can put it on higher level devices (practically taking > > > CFQ at higher level device), and provide fairness there. One can also > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > them to use the IO scheduler.) > > > > > > I am sure that will be many issues but one big issue I could think of that > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > from one queue (in case of idling) and that would kill parallelism at > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > Thanks > > > Vivek > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > other IO schedulers? In my understanding, one of the keys to guarantee > > group isolation in your patch is to have per-group IO scheduler internal > > queue even with as, deadline, and noop scheduler. I think this is > > great idea, and to implement generic code for all IO schedulers was > > concluded when we had so many IO scheduler specific proposals. > > If we will still need per-group IO scheduler internal queues with > > request-based dm-ioband, we have to modify elevator layer. It seems > > out of scope of dm. > > I might miss something... > > IIUC, the request based device-mapper could not break back a request > into bio, so it could not work with block devices which don't use the > IO scheduler. > I think current request based multipath drvier does not do it but can't it be implemented that requests are broken back into bio? Anyway, I don't feel too strongly about this approach as it might introduce more serialization at higher layer. > How about adding a callback function to the higher level controller? > CFQ calls it when the active queue runs out of time, then the higer > level controller use it as a trigger or a hint to move IO group, so > I think a time-based controller could be implemented at higher level. > Adding a call back should not be a big issue. But that means you are planning to run only one group at higher layer at one time and I think that's the problem because than we are introducing serialization at higher layer. So any higher level device mapper target which has multiple physical disks under it, we might be underutilizing these even more and take a big hit on overall throughput. The whole design of doing proportional weight at lower layer is optimial usage of system. > My requirements for IO controller are: > - Implement s a higher level controller, which is located at block > layer and bio is grabbed in generic_make_request(). How are you planning to handle the issue of buffered writes Andrew raised? > - Can work with any type of IO scheduler. > - Can work with any type of block devices. > - Support multiple policies, proportional wegiht, max rate, time > based, ans so on. > > The IO controller mini-summit will be held in next week, and I'm > looking forard to meet you all and discuss about IO controller. > https://sourceforge.net/apps/trac/ioband/wiki/iosummit Is there a new version of dm-ioband now where you have solved the issue of sync/async dispatch with-in group? Before meeting at mini-summit, I am trying to run some tests and come up with numbers so that we have more clear picture of pros/cons. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-05 12:31 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-05 12:31 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > Hi, > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > going through the request based dm-multipath paper. Will it make sense > > > to implement request based dm-ioband? So basically we implement all the > > > group scheduling in CFQ and let dm-ioband implement a request function > > > to take the request and break it back into bios. This way we can keep > > > all the group control at one place and also meet most of the requirements. > > > > > > So request based dm-ioband will have a request in hand once that request > > > has passed group control and prio control. Because dm-ioband is a device > > > mapper target, one can put it on higher level devices (practically taking > > > CFQ at higher level device), and provide fairness there. One can also > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > them to use the IO scheduler.) > > > > > > I am sure that will be many issues but one big issue I could think of that > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > from one queue (in case of idling) and that would kill parallelism at > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > Thanks > > > Vivek > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > other IO schedulers? In my understanding, one of the keys to guarantee > > group isolation in your patch is to have per-group IO scheduler internal > > queue even with as, deadline, and noop scheduler. I think this is > > great idea, and to implement generic code for all IO schedulers was > > concluded when we had so many IO scheduler specific proposals. > > If we will still need per-group IO scheduler internal queues with > > request-based dm-ioband, we have to modify elevator layer. It seems > > out of scope of dm. > > I might miss something... > > IIUC, the request based device-mapper could not break back a request > into bio, so it could not work with block devices which don't use the > IO scheduler. > I think current request based multipath drvier does not do it but can't it be implemented that requests are broken back into bio? Anyway, I don't feel too strongly about this approach as it might introduce more serialization at higher layer. > How about adding a callback function to the higher level controller? > CFQ calls it when the active queue runs out of time, then the higer > level controller use it as a trigger or a hint to move IO group, so > I think a time-based controller could be implemented at higher level. > Adding a call back should not be a big issue. But that means you are planning to run only one group at higher layer at one time and I think that's the problem because than we are introducing serialization at higher layer. So any higher level device mapper target which has multiple physical disks under it, we might be underutilizing these even more and take a big hit on overall throughput. The whole design of doing proportional weight at lower layer is optimial usage of system. > My requirements for IO controller are: > - Implement s a higher level controller, which is located at block > layer and bio is grabbed in generic_make_request(). How are you planning to handle the issue of buffered writes Andrew raised? > - Can work with any type of IO scheduler. > - Can work with any type of block devices. > - Support multiple policies, proportional wegiht, max rate, time > based, ans so on. > > The IO controller mini-summit will be held in next week, and I'm > looking forard to meet you all and discuss about IO controller. > https://sourceforge.net/apps/trac/ioband/wiki/iosummit Is there a new version of dm-ioband now where you have solved the issue of sync/async dispatch with-in group? Before meeting at mini-summit, I am trying to run some tests and come up with numbers so that we have more clear picture of pros/cons. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-05 12:31 ` Vivek Goyal @ 2009-10-05 14:55 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-05 14:55 UTC (permalink / raw) To: vgoyal Cc: m-ikeda, nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > > Hi, > > > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > > going through the request based dm-multipath paper. Will it make sense > > > > to implement request based dm-ioband? So basically we implement all the > > > > group scheduling in CFQ and let dm-ioband implement a request function > > > > to take the request and break it back into bios. This way we can keep > > > > all the group control at one place and also meet most of the requirements. > > > > > > > > So request based dm-ioband will have a request in hand once that request > > > > has passed group control and prio control. Because dm-ioband is a device > > > > mapper target, one can put it on higher level devices (practically taking > > > > CFQ at higher level device), and provide fairness there. One can also > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > > them to use the IO scheduler.) > > > > > > > > I am sure that will be many issues but one big issue I could think of that > > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > > from one queue (in case of idling) and that would kill parallelism at > > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > > > Thanks > > > > Vivek > > > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > > other IO schedulers? In my understanding, one of the keys to guarantee > > > group isolation in your patch is to have per-group IO scheduler internal > > > queue even with as, deadline, and noop scheduler. I think this is > > > great idea, and to implement generic code for all IO schedulers was > > > concluded when we had so many IO scheduler specific proposals. > > > If we will still need per-group IO scheduler internal queues with > > > request-based dm-ioband, we have to modify elevator layer. It seems > > > out of scope of dm. > > > I might miss something... > > > > IIUC, the request based device-mapper could not break back a request > > into bio, so it could not work with block devices which don't use the > > IO scheduler. > > > > I think current request based multipath drvier does not do it but can't it > be implemented that requests are broken back into bio? I guess it would be hard to implement it, and we need to hold requests and throttle them at there and it would break the ordering by CFQ. > Anyway, I don't feel too strongly about this approach as it might > introduce more serialization at higher layer. Yes, I know it. > > How about adding a callback function to the higher level controller? > > CFQ calls it when the active queue runs out of time, then the higer > > level controller use it as a trigger or a hint to move IO group, so > > I think a time-based controller could be implemented at higher level. > > > > Adding a call back should not be a big issue. But that means you are > planning to run only one group at higher layer at one time and I think > that's the problem because than we are introducing serialization at higher > layer. So any higher level device mapper target which has multiple > physical disks under it, we might be underutilizing these even more and > take a big hit on overall throughput. > > The whole design of doing proportional weight at lower layer is optimial > usage of system. But I think that the higher level approch makes easy to configure against striped software raid devices. If one would like to combine some physical disks into one logical device like a dm-linear, I think one should map the IO controller on each physical device and combine them into one logical device. > > My requirements for IO controller are: > > - Implement s a higher level controller, which is located at block > > layer and bio is grabbed in generic_make_request(). > > How are you planning to handle the issue of buffered writes Andrew raised? I think that it would be better to use the higher-level controller along with the memory controller and have limits memory usage for each cgroup. And as Kamezawa-san said, having limits of dirty pages would be better, too. > > - Can work with any type of IO scheduler. > > - Can work with any type of block devices. > > - Support multiple policies, proportional wegiht, max rate, time > > based, ans so on. > > > > The IO controller mini-summit will be held in next week, and I'm > > looking forard to meet you all and discuss about IO controller. > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > > Is there a new version of dm-ioband now where you have solved the issue of > sync/async dispatch with-in group? Before meeting at mini-summit, I am > trying to run some tests and come up with numbers so that we have more > clear picture of pros/cons. Yes, I've released new versions of dm-ioband and blkio-cgroup. The new dm-ioband handles sync/async IO requests separately and the write-starve-read issue you pointed out is fixed. I would appreciate it if you would try them. http://sourceforge.net/projects/ioband/files/ Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-05 14:55 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-05 14:55 UTC (permalink / raw) To: vgoyal Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > > Hi, > > > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > > going through the request based dm-multipath paper. Will it make sense > > > > to implement request based dm-ioband? So basically we implement all the > > > > group scheduling in CFQ and let dm-ioband implement a request function > > > > to take the request and break it back into bios. This way we can keep > > > > all the group control at one place and also meet most of the requirements. > > > > > > > > So request based dm-ioband will have a request in hand once that request > > > > has passed group control and prio control. Because dm-ioband is a device > > > > mapper target, one can put it on higher level devices (practically taking > > > > CFQ at higher level device), and provide fairness there. One can also > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > > them to use the IO scheduler.) > > > > > > > > I am sure that will be many issues but one big issue I could think of that > > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > > from one queue (in case of idling) and that would kill parallelism at > > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > > > Thanks > > > > Vivek > > > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > > other IO schedulers? In my understanding, one of the keys to guarantee > > > group isolation in your patch is to have per-group IO scheduler internal > > > queue even with as, deadline, and noop scheduler. I think this is > > > great idea, and to implement generic code for all IO schedulers was > > > concluded when we had so many IO scheduler specific proposals. > > > If we will still need per-group IO scheduler internal queues with > > > request-based dm-ioband, we have to modify elevator layer. It seems > > > out of scope of dm. > > > I might miss something... > > > > IIUC, the request based device-mapper could not break back a request > > into bio, so it could not work with block devices which don't use the > > IO scheduler. > > > > I think current request based multipath drvier does not do it but can't it > be implemented that requests are broken back into bio? I guess it would be hard to implement it, and we need to hold requests and throttle them at there and it would break the ordering by CFQ. > Anyway, I don't feel too strongly about this approach as it might > introduce more serialization at higher layer. Yes, I know it. > > How about adding a callback function to the higher level controller? > > CFQ calls it when the active queue runs out of time, then the higer > > level controller use it as a trigger or a hint to move IO group, so > > I think a time-based controller could be implemented at higher level. > > > > Adding a call back should not be a big issue. But that means you are > planning to run only one group at higher layer at one time and I think > that's the problem because than we are introducing serialization at higher > layer. So any higher level device mapper target which has multiple > physical disks under it, we might be underutilizing these even more and > take a big hit on overall throughput. > > The whole design of doing proportional weight at lower layer is optimial > usage of system. But I think that the higher level approch makes easy to configure against striped software raid devices. If one would like to combine some physical disks into one logical device like a dm-linear, I think one should map the IO controller on each physical device and combine them into one logical device. > > My requirements for IO controller are: > > - Implement s a higher level controller, which is located at block > > layer and bio is grabbed in generic_make_request(). > > How are you planning to handle the issue of buffered writes Andrew raised? I think that it would be better to use the higher-level controller along with the memory controller and have limits memory usage for each cgroup. And as Kamezawa-san said, having limits of dirty pages would be better, too. > > - Can work with any type of IO scheduler. > > - Can work with any type of block devices. > > - Support multiple policies, proportional wegiht, max rate, time > > based, ans so on. > > > > The IO controller mini-summit will be held in next week, and I'm > > looking forard to meet you all and discuss about IO controller. > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > > Is there a new version of dm-ioband now where you have solved the issue of > sync/async dispatch with-in group? Before meeting at mini-summit, I am > trying to run some tests and come up with numbers so that we have more > clear picture of pros/cons. Yes, I've released new versions of dm-ioband and blkio-cgroup. The new dm-ioband handles sync/async IO requests separately and the write-starve-read issue you pointed out is fixed. I would appreciate it if you would try them. http://sourceforge.net/projects/ioband/files/ Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-05 14:55 ` Ryo Tsuruta @ 2009-10-05 17:10 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-05 17:10 UTC (permalink / raw) To: Ryo Tsuruta Cc: m-ikeda, nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > > > Hi, > > > > > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: > > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > > > going through the request based dm-multipath paper. Will it make sense > > > > > to implement request based dm-ioband? So basically we implement all the > > > > > group scheduling in CFQ and let dm-ioband implement a request function > > > > > to take the request and break it back into bios. This way we can keep > > > > > all the group control at one place and also meet most of the requirements. > > > > > > > > > > So request based dm-ioband will have a request in hand once that request > > > > > has passed group control and prio control. Because dm-ioband is a device > > > > > mapper target, one can put it on higher level devices (practically taking > > > > > CFQ at higher level device), and provide fairness there. One can also > > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > > > them to use the IO scheduler.) > > > > > > > > > > I am sure that will be many issues but one big issue I could think of that > > > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > > > from one queue (in case of idling) and that would kill parallelism at > > > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > > > > > Thanks > > > > > Vivek > > > > > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > > > other IO schedulers? In my understanding, one of the keys to guarantee > > > > group isolation in your patch is to have per-group IO scheduler internal > > > > queue even with as, deadline, and noop scheduler. I think this is > > > > great idea, and to implement generic code for all IO schedulers was > > > > concluded when we had so many IO scheduler specific proposals. > > > > If we will still need per-group IO scheduler internal queues with > > > > request-based dm-ioband, we have to modify elevator layer. It seems > > > > out of scope of dm. > > > > I might miss something... > > > > > > IIUC, the request based device-mapper could not break back a request > > > into bio, so it could not work with block devices which don't use the > > > IO scheduler. > > > > > > > I think current request based multipath drvier does not do it but can't it > > be implemented that requests are broken back into bio? > > I guess it would be hard to implement it, and we need to hold requests > and throttle them at there and it would break the ordering by CFQ. > > > Anyway, I don't feel too strongly about this approach as it might > > introduce more serialization at higher layer. > > Yes, I know it. > > > > How about adding a callback function to the higher level controller? > > > CFQ calls it when the active queue runs out of time, then the higer > > > level controller use it as a trigger or a hint to move IO group, so > > > I think a time-based controller could be implemented at higher level. > > > > > > > Adding a call back should not be a big issue. But that means you are > > planning to run only one group at higher layer at one time and I think > > that's the problem because than we are introducing serialization at higher > > layer. So any higher level device mapper target which has multiple > > physical disks under it, we might be underutilizing these even more and > > take a big hit on overall throughput. > > > > The whole design of doing proportional weight at lower layer is optimial > > usage of system. > > But I think that the higher level approch makes easy to configure > against striped software raid devices. How does it make easier to configure in case of higher level controller? In case of lower level design, one just have to create cgroups and assign weights to cgroups. This mininum step will be required in higher level controller also. (Even if you get rid of dm-ioband device setup step). > If one would like to > combine some physical disks into one logical device like a dm-linear, > I think one should map the IO controller on each physical device and > combine them into one logical device. > In fact this sounds like a more complicated step where one has to setup one dm-ioband device on top of each physical device. But I am assuming that this will go away once you move to per reuqest queue like implementation. I think it should be same in principal as my initial implementation of IO controller on request queue and I stopped development on it because of FIFO dispatch. So you seem to be suggesting that you will move dm-ioband to request queue so that setting up additional device setup is gone. You will also enable it to do time based groups policy, so that we don't run into issues on seeky media. Will also enable dispatch from one group only at a time so that we don't run into isolation issues and can do time accounting accruately. If yes, then that has the potential to solve the issue. At higher layer one can think of enabling size of IO/number of IO policy both for proportional BW and max BW type of control. At lower level one can enable pure time based control on seeky media. I think this will still left with the issue of prio with-in group as group control is separate and you will not be maintatinig separate queues for each process. Similarly you will also have isseus with read vs write ratios as IO schedulers underneath change. So I will be curious to see that implementation. > > > My requirements for IO controller are: > > > - Implement s a higher level controller, which is located at block > > > layer and bio is grabbed in generic_make_request(). > > > > How are you planning to handle the issue of buffered writes Andrew raised? > > I think that it would be better to use the higher-level controller > along with the memory controller and have limits memory usage for each > cgroup. And as Kamezawa-san said, having limits of dirty pages would > be better, too. > Ok. So if we plan to co-mount memory controller with per memory group dirty_ratio implemented, that can work with both higher level as well as low level controller. Not sure if we also require some kind of a per memory group flusher thread infrastructure also to make sure higher weight group gets more job done. > > > - Can work with any type of IO scheduler. > > > - Can work with any type of block devices. > > > - Support multiple policies, proportional wegiht, max rate, time > > > based, ans so on. > > > > > > The IO controller mini-summit will be held in next week, and I'm > > > looking forard to meet you all and discuss about IO controller. > > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > > > > Is there a new version of dm-ioband now where you have solved the issue of > > sync/async dispatch with-in group? Before meeting at mini-summit, I am > > trying to run some tests and come up with numbers so that we have more > > clear picture of pros/cons. > > Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > dm-ioband handles sync/async IO requests separately and > the write-starve-read issue you pointed out is fixed. I would > appreciate it if you would try them. > http://sourceforge.net/projects/ioband/files/ Cool. Will get to testing it. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-05 17:10 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-05 17:10 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > > > Hi, > > > > > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: > > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > > > going through the request based dm-multipath paper. Will it make sense > > > > > to implement request based dm-ioband? So basically we implement all the > > > > > group scheduling in CFQ and let dm-ioband implement a request function > > > > > to take the request and break it back into bios. This way we can keep > > > > > all the group control at one place and also meet most of the requirements. > > > > > > > > > > So request based dm-ioband will have a request in hand once that request > > > > > has passed group control and prio control. Because dm-ioband is a device > > > > > mapper target, one can put it on higher level devices (practically taking > > > > > CFQ at higher level device), and provide fairness there. One can also > > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > > > them to use the IO scheduler.) > > > > > > > > > > I am sure that will be many issues but one big issue I could think of that > > > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > > > from one queue (in case of idling) and that would kill parallelism at > > > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > > > > > Thanks > > > > > Vivek > > > > > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > > > other IO schedulers? In my understanding, one of the keys to guarantee > > > > group isolation in your patch is to have per-group IO scheduler internal > > > > queue even with as, deadline, and noop scheduler. I think this is > > > > great idea, and to implement generic code for all IO schedulers was > > > > concluded when we had so many IO scheduler specific proposals. > > > > If we will still need per-group IO scheduler internal queues with > > > > request-based dm-ioband, we have to modify elevator layer. It seems > > > > out of scope of dm. > > > > I might miss something... > > > > > > IIUC, the request based device-mapper could not break back a request > > > into bio, so it could not work with block devices which don't use the > > > IO scheduler. > > > > > > > I think current request based multipath drvier does not do it but can't it > > be implemented that requests are broken back into bio? > > I guess it would be hard to implement it, and we need to hold requests > and throttle them at there and it would break the ordering by CFQ. > > > Anyway, I don't feel too strongly about this approach as it might > > introduce more serialization at higher layer. > > Yes, I know it. > > > > How about adding a callback function to the higher level controller? > > > CFQ calls it when the active queue runs out of time, then the higer > > > level controller use it as a trigger or a hint to move IO group, so > > > I think a time-based controller could be implemented at higher level. > > > > > > > Adding a call back should not be a big issue. But that means you are > > planning to run only one group at higher layer at one time and I think > > that's the problem because than we are introducing serialization at higher > > layer. So any higher level device mapper target which has multiple > > physical disks under it, we might be underutilizing these even more and > > take a big hit on overall throughput. > > > > The whole design of doing proportional weight at lower layer is optimial > > usage of system. > > But I think that the higher level approch makes easy to configure > against striped software raid devices. How does it make easier to configure in case of higher level controller? In case of lower level design, one just have to create cgroups and assign weights to cgroups. This mininum step will be required in higher level controller also. (Even if you get rid of dm-ioband device setup step). > If one would like to > combine some physical disks into one logical device like a dm-linear, > I think one should map the IO controller on each physical device and > combine them into one logical device. > In fact this sounds like a more complicated step where one has to setup one dm-ioband device on top of each physical device. But I am assuming that this will go away once you move to per reuqest queue like implementation. I think it should be same in principal as my initial implementation of IO controller on request queue and I stopped development on it because of FIFO dispatch. So you seem to be suggesting that you will move dm-ioband to request queue so that setting up additional device setup is gone. You will also enable it to do time based groups policy, so that we don't run into issues on seeky media. Will also enable dispatch from one group only at a time so that we don't run into isolation issues and can do time accounting accruately. If yes, then that has the potential to solve the issue. At higher layer one can think of enabling size of IO/number of IO policy both for proportional BW and max BW type of control. At lower level one can enable pure time based control on seeky media. I think this will still left with the issue of prio with-in group as group control is separate and you will not be maintatinig separate queues for each process. Similarly you will also have isseus with read vs write ratios as IO schedulers underneath change. So I will be curious to see that implementation. > > > My requirements for IO controller are: > > > - Implement s a higher level controller, which is located at block > > > layer and bio is grabbed in generic_make_request(). > > > > How are you planning to handle the issue of buffered writes Andrew raised? > > I think that it would be better to use the higher-level controller > along with the memory controller and have limits memory usage for each > cgroup. And as Kamezawa-san said, having limits of dirty pages would > be better, too. > Ok. So if we plan to co-mount memory controller with per memory group dirty_ratio implemented, that can work with both higher level as well as low level controller. Not sure if we also require some kind of a per memory group flusher thread infrastructure also to make sure higher weight group gets more job done. > > > - Can work with any type of IO scheduler. > > > - Can work with any type of block devices. > > > - Support multiple policies, proportional wegiht, max rate, time > > > based, ans so on. > > > > > > The IO controller mini-summit will be held in next week, and I'm > > > looking forard to meet you all and discuss about IO controller. > > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > > > > Is there a new version of dm-ioband now where you have solved the issue of > > sync/async dispatch with-in group? Before meeting at mini-summit, I am > > trying to run some tests and come up with numbers so that we have more > > clear picture of pros/cons. > > Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > dm-ioband handles sync/async IO requests separately and > the write-starve-read issue you pointed out is fixed. I would > appreciate it if you would try them. > http://sourceforge.net/projects/ioband/files/ Cool. Will get to testing it. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-05 17:10 ` Vivek Goyal @ 2009-10-05 18:11 ` Nauman Rafique -1 siblings, 0 replies; 349+ messages in thread From: Nauman Rafique @ 2009-10-05 18:11 UTC (permalink / raw) To: Vivek Goyal Cc: Ryo Tsuruta, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Mon, Oct 5, 2009 at 10:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote: >> Hi Vivek, >> >> Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: >> > > Hi, >> > > >> > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: >> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: >> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was >> > > > > going through the request based dm-multipath paper. Will it make sense >> > > > > to implement request based dm-ioband? So basically we implement all the >> > > > > group scheduling in CFQ and let dm-ioband implement a request function >> > > > > to take the request and break it back into bios. This way we can keep >> > > > > all the group control at one place and also meet most of the requirements. >> > > > > >> > > > > So request based dm-ioband will have a request in hand once that request >> > > > > has passed group control and prio control. Because dm-ioband is a device >> > > > > mapper target, one can put it on higher level devices (practically taking >> > > > > CFQ at higher level device), and provide fairness there. One can also >> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing >> > > > > them to use the IO scheduler.) >> > > > > >> > > > > I am sure that will be many issues but one big issue I could think of that >> > > > > CFQ thinks that there is one device beneath it and dipsatches requests >> > > > > from one queue (in case of idling) and that would kill parallelism at >> > > > > higher layer and throughput will suffer on many of the dm/md configurations. >> > > > > >> > > > > Thanks >> > > > > Vivek >> > > > >> > > > As long as using CFQ, your idea is reasonable for me. But how about for >> > > > other IO schedulers? In my understanding, one of the keys to guarantee >> > > > group isolation in your patch is to have per-group IO scheduler internal >> > > > queue even with as, deadline, and noop scheduler. I think this is >> > > > great idea, and to implement generic code for all IO schedulers was >> > > > concluded when we had so many IO scheduler specific proposals. >> > > > If we will still need per-group IO scheduler internal queues with >> > > > request-based dm-ioband, we have to modify elevator layer. It seems >> > > > out of scope of dm. >> > > > I might miss something... >> > > >> > > IIUC, the request based device-mapper could not break back a request >> > > into bio, so it could not work with block devices which don't use the >> > > IO scheduler. >> > > >> > >> > I think current request based multipath drvier does not do it but can't it >> > be implemented that requests are broken back into bio? >> >> I guess it would be hard to implement it, and we need to hold requests >> and throttle them at there and it would break the ordering by CFQ. >> >> > Anyway, I don't feel too strongly about this approach as it might >> > introduce more serialization at higher layer. >> >> Yes, I know it. >> >> > > How about adding a callback function to the higher level controller? >> > > CFQ calls it when the active queue runs out of time, then the higer >> > > level controller use it as a trigger or a hint to move IO group, so >> > > I think a time-based controller could be implemented at higher level. >> > > >> > >> > Adding a call back should not be a big issue. But that means you are >> > planning to run only one group at higher layer at one time and I think >> > that's the problem because than we are introducing serialization at higher >> > layer. So any higher level device mapper target which has multiple >> > physical disks under it, we might be underutilizing these even more and >> > take a big hit on overall throughput. >> > >> > The whole design of doing proportional weight at lower layer is optimial >> > usage of system. >> >> But I think that the higher level approch makes easy to configure >> against striped software raid devices. > > How does it make easier to configure in case of higher level controller? > > In case of lower level design, one just have to create cgroups and assign > weights to cgroups. This mininum step will be required in higher level > controller also. (Even if you get rid of dm-ioband device setup step). > >> If one would like to >> combine some physical disks into one logical device like a dm-linear, >> I think one should map the IO controller on each physical device and >> combine them into one logical device. >> > > In fact this sounds like a more complicated step where one has to setup > one dm-ioband device on top of each physical device. But I am assuming > that this will go away once you move to per reuqest queue like implementation. > > I think it should be same in principal as my initial implementation of IO > controller on request queue and I stopped development on it because of FIFO > dispatch. > > So you seem to be suggesting that you will move dm-ioband to request queue > so that setting up additional device setup is gone. You will also enable > it to do time based groups policy, so that we don't run into issues on > seeky media. Will also enable dispatch from one group only at a time so > that we don't run into isolation issues and can do time accounting > accruately. Will that approach solve the problem of doing bandwidth control on logical devices? What would be the advantages compared to Vivek's current patches? > > If yes, then that has the potential to solve the issue. At higher layer one > can think of enabling size of IO/number of IO policy both for proportional > BW and max BW type of control. At lower level one can enable pure time > based control on seeky media. > > I think this will still left with the issue of prio with-in group as group > control is separate and you will not be maintatinig separate queues for > each process. Similarly you will also have isseus with read vs write > ratios as IO schedulers underneath change. > > So I will be curious to see that implementation. > >> > > My requirements for IO controller are: >> > > - Implement s a higher level controller, which is located at block >> > > layer and bio is grabbed in generic_make_request(). >> > >> > How are you planning to handle the issue of buffered writes Andrew raised? >> >> I think that it would be better to use the higher-level controller >> along with the memory controller and have limits memory usage for each >> cgroup. And as Kamezawa-san said, having limits of dirty pages would >> be better, too. >> > > Ok. So if we plan to co-mount memory controller with per memory group > dirty_ratio implemented, that can work with both higher level as well as > low level controller. Not sure if we also require some kind of a per > memory group flusher thread infrastructure also to make sure higher weight > group gets more job done. > >> > > - Can work with any type of IO scheduler. >> > > - Can work with any type of block devices. >> > > - Support multiple policies, proportional wegiht, max rate, time >> > > based, ans so on. >> > > >> > > The IO controller mini-summit will be held in next week, and I'm >> > > looking forard to meet you all and discuss about IO controller. >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit >> > >> > Is there a new version of dm-ioband now where you have solved the issue of >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am >> > trying to run some tests and come up with numbers so that we have more >> > clear picture of pros/cons. >> >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new >> dm-ioband handles sync/async IO requests separately and >> the write-starve-read issue you pointed out is fixed. I would >> appreciate it if you would try them. >> http://sourceforge.net/projects/ioband/files/ > > Cool. Will get to testing it. > > Thanks > Vivek > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-05 18:11 ` Nauman Rafique 0 siblings, 0 replies; 349+ messages in thread From: Nauman Rafique @ 2009-10-05 18:11 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds On Mon, Oct 5, 2009 at 10:10 AM, Vivek Goyal <vgoyal@redhat.com> wrote: > On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote: >> Hi Vivek, >> >> Vivek Goyal <vgoyal@redhat.com> wrote: >> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: >> > > Hi, >> > > >> > > Munehiro Ikeda <m-ikeda@ds.jp.nec.com> wrote: >> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: >> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was >> > > > > going through the request based dm-multipath paper. Will it make sense >> > > > > to implement request based dm-ioband? So basically we implement all the >> > > > > group scheduling in CFQ and let dm-ioband implement a request function >> > > > > to take the request and break it back into bios. This way we can keep >> > > > > all the group control at one place and also meet most of the requirements. >> > > > > >> > > > > So request based dm-ioband will have a request in hand once that request >> > > > > has passed group control and prio control. Because dm-ioband is a device >> > > > > mapper target, one can put it on higher level devices (practically taking >> > > > > CFQ at higher level device), and provide fairness there. One can also >> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing >> > > > > them to use the IO scheduler.) >> > > > > >> > > > > I am sure that will be many issues but one big issue I could think of that >> > > > > CFQ thinks that there is one device beneath it and dipsatches requests >> > > > > from one queue (in case of idling) and that would kill parallelism at >> > > > > higher layer and throughput will suffer on many of the dm/md configurations. >> > > > > >> > > > > Thanks >> > > > > Vivek >> > > > >> > > > As long as using CFQ, your idea is reasonable for me. But how about for >> > > > other IO schedulers? In my understanding, one of the keys to guarantee >> > > > group isolation in your patch is to have per-group IO scheduler internal >> > > > queue even with as, deadline, and noop scheduler. I think this is >> > > > great idea, and to implement generic code for all IO schedulers was >> > > > concluded when we had so many IO scheduler specific proposals. >> > > > If we will still need per-group IO scheduler internal queues with >> > > > request-based dm-ioband, we have to modify elevator layer. It seems >> > > > out of scope of dm. >> > > > I might miss something... >> > > >> > > IIUC, the request based device-mapper could not break back a request >> > > into bio, so it could not work with block devices which don't use the >> > > IO scheduler. >> > > >> > >> > I think current request based multipath drvier does not do it but can't it >> > be implemented that requests are broken back into bio? >> >> I guess it would be hard to implement it, and we need to hold requests >> and throttle them at there and it would break the ordering by CFQ. >> >> > Anyway, I don't feel too strongly about this approach as it might >> > introduce more serialization at higher layer. >> >> Yes, I know it. >> >> > > How about adding a callback function to the higher level controller? >> > > CFQ calls it when the active queue runs out of time, then the higer >> > > level controller use it as a trigger or a hint to move IO group, so >> > > I think a time-based controller could be implemented at higher level. >> > > >> > >> > Adding a call back should not be a big issue. But that means you are >> > planning to run only one group at higher layer at one time and I think >> > that's the problem because than we are introducing serialization at higher >> > layer. So any higher level device mapper target which has multiple >> > physical disks under it, we might be underutilizing these even more and >> > take a big hit on overall throughput. >> > >> > The whole design of doing proportional weight at lower layer is optimial >> > usage of system. >> >> But I think that the higher level approch makes easy to configure >> against striped software raid devices. > > How does it make easier to configure in case of higher level controller? > > In case of lower level design, one just have to create cgroups and assign > weights to cgroups. This mininum step will be required in higher level > controller also. (Even if you get rid of dm-ioband device setup step). > >> If one would like to >> combine some physical disks into one logical device like a dm-linear, >> I think one should map the IO controller on each physical device and >> combine them into one logical device. >> > > In fact this sounds like a more complicated step where one has to setup > one dm-ioband device on top of each physical device. But I am assuming > that this will go away once you move to per reuqest queue like implementation. > > I think it should be same in principal as my initial implementation of IO > controller on request queue and I stopped development on it because of FIFO > dispatch. > > So you seem to be suggesting that you will move dm-ioband to request queue > so that setting up additional device setup is gone. You will also enable > it to do time based groups policy, so that we don't run into issues on > seeky media. Will also enable dispatch from one group only at a time so > that we don't run into isolation issues and can do time accounting > accruately. Will that approach solve the problem of doing bandwidth control on logical devices? What would be the advantages compared to Vivek's current patches? > > If yes, then that has the potential to solve the issue. At higher layer one > can think of enabling size of IO/number of IO policy both for proportional > BW and max BW type of control. At lower level one can enable pure time > based control on seeky media. > > I think this will still left with the issue of prio with-in group as group > control is separate and you will not be maintatinig separate queues for > each process. Similarly you will also have isseus with read vs write > ratios as IO schedulers underneath change. > > So I will be curious to see that implementation. > >> > > My requirements for IO controller are: >> > > - Implement s a higher level controller, which is located at block >> > > layer and bio is grabbed in generic_make_request(). >> > >> > How are you planning to handle the issue of buffered writes Andrew raised? >> >> I think that it would be better to use the higher-level controller >> along with the memory controller and have limits memory usage for each >> cgroup. And as Kamezawa-san said, having limits of dirty pages would >> be better, too. >> > > Ok. So if we plan to co-mount memory controller with per memory group > dirty_ratio implemented, that can work with both higher level as well as > low level controller. Not sure if we also require some kind of a per > memory group flusher thread infrastructure also to make sure higher weight > group gets more job done. > >> > > - Can work with any type of IO scheduler. >> > > - Can work with any type of block devices. >> > > - Support multiple policies, proportional wegiht, max rate, time >> > > based, ans so on. >> > > >> > > The IO controller mini-summit will be held in next week, and I'm >> > > looking forard to meet you all and discuss about IO controller. >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit >> > >> > Is there a new version of dm-ioband now where you have solved the issue of >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am >> > trying to run some tests and come up with numbers so that we have more >> > clear picture of pros/cons. >> >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new >> dm-ioband handles sync/async IO requests separately and >> the write-starve-read issue you pointed out is fixed. I would >> appreciate it if you would try them. >> http://sourceforge.net/projects/ioband/files/ > > Cool. Will get to testing it. > > Thanks > Vivek > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <e98e18940910051111r110dc776l5105bf931761b842-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <e98e18940910051111r110dc776l5105bf931761b842-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-10-06 7:17 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-06 7:17 UTC (permalink / raw) To: nauman-hpIqsD4AKlfQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek and Nauman, Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > >> > > How about adding a callback function to the higher level controller? > >> > > CFQ calls it when the active queue runs out of time, then the higer > >> > > level controller use it as a trigger or a hint to move IO group, so > >> > > I think a time-based controller could be implemented at higher level. > >> > > > >> > > >> > Adding a call back should not be a big issue. But that means you are > >> > planning to run only one group at higher layer at one time and I think > >> > that's the problem because than we are introducing serialization at higher > >> > layer. So any higher level device mapper target which has multiple > >> > physical disks under it, we might be underutilizing these even more and > >> > take a big hit on overall throughput. > >> > > >> > The whole design of doing proportional weight at lower layer is optimial > >> > usage of system. > >> > >> But I think that the higher level approch makes easy to configure > >> against striped software raid devices. > > > > How does it make easier to configure in case of higher level controller? > > > > In case of lower level design, one just have to create cgroups and assign > > weights to cgroups. This mininum step will be required in higher level > > controller also. (Even if you get rid of dm-ioband device setup step). In the case of lower level controller, if we need to assign weights on a per device basis, we have to assign weights to all devices of which a raid device consists, but in the case of higher level controller, we just assign weights to the raid device only. > >> If one would like to > >> combine some physical disks into one logical device like a dm-linear, > >> I think one should map the IO controller on each physical device and > >> combine them into one logical device. > >> > > > > In fact this sounds like a more complicated step where one has to setup > > one dm-ioband device on top of each physical device. But I am assuming > > that this will go away once you move to per reuqest queue like implementation. I don't understand why the per request queue implementation makes it go away. If dm-ioband is integrated into the LVM tools, it could allow users to skip the complicated steps to configure dm-linear devices. > > I think it should be same in principal as my initial implementation of IO > > controller on request queue and I stopped development on it because of FIFO > > dispatch. I think that FIFO dispatch seldom lead to prioviry inversion, because holding period for throttling is not too long to break the IO priority. I did some tests to see whether priority inversion is happened. The first test ran fio sequential readers on the same group. The BE0 reader got the highest throughput as I expected. nr_threads 16 | 16 | 1 ionice BE7 | BE7 | BE0 ------------------------+------------+------------- vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s The second test ran fio sequential readers on two different groups and give weights of 20 and 10 to each group respectively. The bandwidth was distributed according to their weights and the BE0 reader got higher throughput than the BE7 readers in the same group. IO priority was preserved within the IO group. group group1 | group2 weight 20 | 10 ------------------------+-------------------------- nr_threads 16 | 16 | 1 ionice BE7 | BE7 | BE0 ------------------------+-------------------------- ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s | Total = 13,772KiB/s Here is my test script. ------------------------------------------------------------------------- arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ --group_reporting" sync echo 3 > /proc/sys/vm/drop_caches echo $$ > /cgroup/1/tasks ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & echo $$ > /cgroup/2/tasks ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & echo $$ > /cgroup/tasks wait ------------------------------------------------------------------------- Be that as it way, I think that if every bio can point the iocontext of the process, then it makes it possible to handle IO priority in the higher level controller. A patchse has already posted by Takhashi-san. What do you think about this idea? Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) Subject [RFC][PATCH 1/10] I/O context inheritance From Hirokazu Takahashi <> http://lkml.org/lkml/2008/4/22/195 > > So you seem to be suggesting that you will move dm-ioband to request queue > > so that setting up additional device setup is gone. You will also enable > > it to do time based groups policy, so that we don't run into issues on > > seeky media. Will also enable dispatch from one group only at a time so > > that we don't run into isolation issues and can do time accounting > > accruately. > > Will that approach solve the problem of doing bandwidth control on > logical devices? What would be the advantages compared to Vivek's > current patches? I will only move the point where dm-ioband grabs bios, other dm-ioband's mechanism and functionality will stll be the same. The advantages against to scheduler based controllers are: - can work with any type of block devices - can work with any type of IO scheduler and no need a big change. > > If yes, then that has the potential to solve the issue. At higher layer one > > can think of enabling size of IO/number of IO policy both for proportional > > BW and max BW type of control. At lower level one can enable pure time > > based control on seeky media. > > > > I think this will still left with the issue of prio with-in group as group > > control is separate and you will not be maintatinig separate queues for > > each process. Similarly you will also have isseus with read vs write > > ratios as IO schedulers underneath change. > > > > So I will be curious to see that implementation. > > > >> > > My requirements for IO controller are: > >> > > - Implement s a higher level controller, which is located at block > >> > > layer and bio is grabbed in generic_make_request(). > >> > > >> > How are you planning to handle the issue of buffered writes Andrew raised? > >> > >> I think that it would be better to use the higher-level controller > >> along with the memory controller and have limits memory usage for each > >> cgroup. And as Kamezawa-san said, having limits of dirty pages would > >> be better, too. > >> > > > > Ok. So if we plan to co-mount memory controller with per memory group > > dirty_ratio implemented, that can work with both higher level as well as > > low level controller. Not sure if we also require some kind of a per > > memory group flusher thread infrastructure also to make sure higher weight > > group gets more job done. I'm not sure either that a per memory group flusher is necessary. An we have to consider not only pdflush but also other threads which issue IOs from multiple groups. > >> > > - Can work with any type of IO scheduler. > >> > > - Can work with any type of block devices. > >> > > - Support multiple policies, proportional wegiht, max rate, time > >> > > based, ans so on. > >> > > > >> > > The IO controller mini-summit will be held in next week, and I'm > >> > > looking forard to meet you all and discuss about IO controller. > >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > >> > > >> > Is there a new version of dm-ioband now where you have solved the issue of > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am > >> > trying to run some tests and come up with numbers so that we have more > >> > clear picture of pros/cons. > >> > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > >> dm-ioband handles sync/async IO requests separately and > >> the write-starve-read issue you pointed out is fixed. I would > >> appreciate it if you would try them. > >> http://sourceforge.net/projects/ioband/files/ > > > > Cool. Will get to testing it. Thanks for your help in advance. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-05 18:11 ` Nauman Rafique @ 2009-10-06 7:17 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-06 7:17 UTC (permalink / raw) To: nauman Cc: vgoyal, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi Vivek and Nauman, Nauman Rafique <nauman@google.com> wrote: > >> > > How about adding a callback function to the higher level controller? > >> > > CFQ calls it when the active queue runs out of time, then the higer > >> > > level controller use it as a trigger or a hint to move IO group, so > >> > > I think a time-based controller could be implemented at higher level. > >> > > > >> > > >> > Adding a call back should not be a big issue. But that means you are > >> > planning to run only one group at higher layer at one time and I think > >> > that's the problem because than we are introducing serialization at higher > >> > layer. So any higher level device mapper target which has multiple > >> > physical disks under it, we might be underutilizing these even more and > >> > take a big hit on overall throughput. > >> > > >> > The whole design of doing proportional weight at lower layer is optimial > >> > usage of system. > >> > >> But I think that the higher level approch makes easy to configure > >> against striped software raid devices. > > > > How does it make easier to configure in case of higher level controller? > > > > In case of lower level design, one just have to create cgroups and assign > > weights to cgroups. This mininum step will be required in higher level > > controller also. (Even if you get rid of dm-ioband device setup step). In the case of lower level controller, if we need to assign weights on a per device basis, we have to assign weights to all devices of which a raid device consists, but in the case of higher level controller, we just assign weights to the raid device only. > >> If one would like to > >> combine some physical disks into one logical device like a dm-linear, > >> I think one should map the IO controller on each physical device and > >> combine them into one logical device. > >> > > > > In fact this sounds like a more complicated step where one has to setup > > one dm-ioband device on top of each physical device. But I am assuming > > that this will go away once you move to per reuqest queue like implementation. I don't understand why the per request queue implementation makes it go away. If dm-ioband is integrated into the LVM tools, it could allow users to skip the complicated steps to configure dm-linear devices. > > I think it should be same in principal as my initial implementation of IO > > controller on request queue and I stopped development on it because of FIFO > > dispatch. I think that FIFO dispatch seldom lead to prioviry inversion, because holding period for throttling is not too long to break the IO priority. I did some tests to see whether priority inversion is happened. The first test ran fio sequential readers on the same group. The BE0 reader got the highest throughput as I expected. nr_threads 16 | 16 | 1 ionice BE7 | BE7 | BE0 ------------------------+------------+------------- vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s The second test ran fio sequential readers on two different groups and give weights of 20 and 10 to each group respectively. The bandwidth was distributed according to their weights and the BE0 reader got higher throughput than the BE7 readers in the same group. IO priority was preserved within the IO group. group group1 | group2 weight 20 | 10 ------------------------+-------------------------- nr_threads 16 | 16 | 1 ionice BE7 | BE7 | BE0 ------------------------+-------------------------- ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s | Total = 13,772KiB/s Here is my test script. ------------------------------------------------------------------------- arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ --group_reporting" sync echo 3 > /proc/sys/vm/drop_caches echo $$ > /cgroup/1/tasks ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & echo $$ > /cgroup/2/tasks ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & echo $$ > /cgroup/tasks wait ------------------------------------------------------------------------- Be that as it way, I think that if every bio can point the iocontext of the process, then it makes it possible to handle IO priority in the higher level controller. A patchse has already posted by Takhashi-san. What do you think about this idea? Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) Subject [RFC][PATCH 1/10] I/O context inheritance From Hirokazu Takahashi <> http://lkml.org/lkml/2008/4/22/195 > > So you seem to be suggesting that you will move dm-ioband to request queue > > so that setting up additional device setup is gone. You will also enable > > it to do time based groups policy, so that we don't run into issues on > > seeky media. Will also enable dispatch from one group only at a time so > > that we don't run into isolation issues and can do time accounting > > accruately. > > Will that approach solve the problem of doing bandwidth control on > logical devices? What would be the advantages compared to Vivek's > current patches? I will only move the point where dm-ioband grabs bios, other dm-ioband's mechanism and functionality will stll be the same. The advantages against to scheduler based controllers are: - can work with any type of block devices - can work with any type of IO scheduler and no need a big change. > > If yes, then that has the potential to solve the issue. At higher layer one > > can think of enabling size of IO/number of IO policy both for proportional > > BW and max BW type of control. At lower level one can enable pure time > > based control on seeky media. > > > > I think this will still left with the issue of prio with-in group as group > > control is separate and you will not be maintatinig separate queues for > > each process. Similarly you will also have isseus with read vs write > > ratios as IO schedulers underneath change. > > > > So I will be curious to see that implementation. > > > >> > > My requirements for IO controller are: > >> > > - Implement s a higher level controller, which is located at block > >> > > layer and bio is grabbed in generic_make_request(). > >> > > >> > How are you planning to handle the issue of buffered writes Andrew raised? > >> > >> I think that it would be better to use the higher-level controller > >> along with the memory controller and have limits memory usage for each > >> cgroup. And as Kamezawa-san said, having limits of dirty pages would > >> be better, too. > >> > > > > Ok. So if we plan to co-mount memory controller with per memory group > > dirty_ratio implemented, that can work with both higher level as well as > > low level controller. Not sure if we also require some kind of a per > > memory group flusher thread infrastructure also to make sure higher weight > > group gets more job done. I'm not sure either that a per memory group flusher is necessary. An we have to consider not only pdflush but also other threads which issue IOs from multiple groups. > >> > > - Can work with any type of IO scheduler. > >> > > - Can work with any type of block devices. > >> > > - Support multiple policies, proportional wegiht, max rate, time > >> > > based, ans so on. > >> > > > >> > > The IO controller mini-summit will be held in next week, and I'm > >> > > looking forard to meet you all and discuss about IO controller. > >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > >> > > >> > Is there a new version of dm-ioband now where you have solved the issue of > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am > >> > trying to run some tests and come up with numbers so that we have more > >> > clear picture of pros/cons. > >> > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > >> dm-ioband handles sync/async IO requests separately and > >> the write-starve-read issue you pointed out is fixed. I would > >> appreciate it if you would try them. > >> http://sourceforge.net/projects/ioband/files/ > > > > Cool. Will get to testing it. Thanks for your help in advance. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-06 7:17 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-06 7:17 UTC (permalink / raw) To: nauman Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, mingo, vgoyal, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds Hi Vivek and Nauman, Nauman Rafique <nauman@google.com> wrote: > >> > > How about adding a callback function to the higher level controller? > >> > > CFQ calls it when the active queue runs out of time, then the higer > >> > > level controller use it as a trigger or a hint to move IO group, so > >> > > I think a time-based controller could be implemented at higher level. > >> > > > >> > > >> > Adding a call back should not be a big issue. But that means you are > >> > planning to run only one group at higher layer at one time and I think > >> > that's the problem because than we are introducing serialization at higher > >> > layer. So any higher level device mapper target which has multiple > >> > physical disks under it, we might be underutilizing these even more and > >> > take a big hit on overall throughput. > >> > > >> > The whole design of doing proportional weight at lower layer is optimial > >> > usage of system. > >> > >> But I think that the higher level approch makes easy to configure > >> against striped software raid devices. > > > > How does it make easier to configure in case of higher level controller? > > > > In case of lower level design, one just have to create cgroups and assign > > weights to cgroups. This mininum step will be required in higher level > > controller also. (Even if you get rid of dm-ioband device setup step). In the case of lower level controller, if we need to assign weights on a per device basis, we have to assign weights to all devices of which a raid device consists, but in the case of higher level controller, we just assign weights to the raid device only. > >> If one would like to > >> combine some physical disks into one logical device like a dm-linear, > >> I think one should map the IO controller on each physical device and > >> combine them into one logical device. > >> > > > > In fact this sounds like a more complicated step where one has to setup > > one dm-ioband device on top of each physical device. But I am assuming > > that this will go away once you move to per reuqest queue like implementation. I don't understand why the per request queue implementation makes it go away. If dm-ioband is integrated into the LVM tools, it could allow users to skip the complicated steps to configure dm-linear devices. > > I think it should be same in principal as my initial implementation of IO > > controller on request queue and I stopped development on it because of FIFO > > dispatch. I think that FIFO dispatch seldom lead to prioviry inversion, because holding period for throttling is not too long to break the IO priority. I did some tests to see whether priority inversion is happened. The first test ran fio sequential readers on the same group. The BE0 reader got the highest throughput as I expected. nr_threads 16 | 16 | 1 ionice BE7 | BE7 | BE0 ------------------------+------------+------------- vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s The second test ran fio sequential readers on two different groups and give weights of 20 and 10 to each group respectively. The bandwidth was distributed according to their weights and the BE0 reader got higher throughput than the BE7 readers in the same group. IO priority was preserved within the IO group. group group1 | group2 weight 20 | 10 ------------------------+-------------------------- nr_threads 16 | 16 | 1 ionice BE7 | BE7 | BE0 ------------------------+-------------------------- ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s | Total = 13,772KiB/s Here is my test script. ------------------------------------------------------------------------- arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ --group_reporting" sync echo 3 > /proc/sys/vm/drop_caches echo $$ > /cgroup/1/tasks ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & echo $$ > /cgroup/2/tasks ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & echo $$ > /cgroup/tasks wait ------------------------------------------------------------------------- Be that as it way, I think that if every bio can point the iocontext of the process, then it makes it possible to handle IO priority in the higher level controller. A patchse has already posted by Takhashi-san. What do you think about this idea? Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) Subject [RFC][PATCH 1/10] I/O context inheritance From Hirokazu Takahashi <> http://lkml.org/lkml/2008/4/22/195 > > So you seem to be suggesting that you will move dm-ioband to request queue > > so that setting up additional device setup is gone. You will also enable > > it to do time based groups policy, so that we don't run into issues on > > seeky media. Will also enable dispatch from one group only at a time so > > that we don't run into isolation issues and can do time accounting > > accruately. > > Will that approach solve the problem of doing bandwidth control on > logical devices? What would be the advantages compared to Vivek's > current patches? I will only move the point where dm-ioband grabs bios, other dm-ioband's mechanism and functionality will stll be the same. The advantages against to scheduler based controllers are: - can work with any type of block devices - can work with any type of IO scheduler and no need a big change. > > If yes, then that has the potential to solve the issue. At higher layer one > > can think of enabling size of IO/number of IO policy both for proportional > > BW and max BW type of control. At lower level one can enable pure time > > based control on seeky media. > > > > I think this will still left with the issue of prio with-in group as group > > control is separate and you will not be maintatinig separate queues for > > each process. Similarly you will also have isseus with read vs write > > ratios as IO schedulers underneath change. > > > > So I will be curious to see that implementation. > > > >> > > My requirements for IO controller are: > >> > > - Implement s a higher level controller, which is located at block > >> > > layer and bio is grabbed in generic_make_request(). > >> > > >> > How are you planning to handle the issue of buffered writes Andrew raised? > >> > >> I think that it would be better to use the higher-level controller > >> along with the memory controller and have limits memory usage for each > >> cgroup. And as Kamezawa-san said, having limits of dirty pages would > >> be better, too. > >> > > > > Ok. So if we plan to co-mount memory controller with per memory group > > dirty_ratio implemented, that can work with both higher level as well as > > low level controller. Not sure if we also require some kind of a per > > memory group flusher thread infrastructure also to make sure higher weight > > group gets more job done. I'm not sure either that a per memory group flusher is necessary. An we have to consider not only pdflush but also other threads which issue IOs from multiple groups. > >> > > - Can work with any type of IO scheduler. > >> > > - Can work with any type of block devices. > >> > > - Support multiple policies, proportional wegiht, max rate, time > >> > > based, ans so on. > >> > > > >> > > The IO controller mini-summit will be held in next week, and I'm > >> > > looking forard to meet you all and discuss about IO controller. > >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > >> > > >> > Is there a new version of dm-ioband now where you have solved the issue of > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am > >> > trying to run some tests and come up with numbers so that we have more > >> > clear picture of pros/cons. > >> > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > >> dm-ioband handles sync/async IO requests separately and > >> the write-starve-read issue you pointed out is fixed. I would > >> appreciate it if you would try them. > >> http://sourceforge.net/projects/ioband/files/ > > > > Cool. Will get to testing it. Thanks for your help in advance. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-06 7:17 ` Ryo Tsuruta @ 2009-10-06 11:22 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-06 11:22 UTC (permalink / raw) To: Ryo Tsuruta Cc: nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Tue, Oct 06, 2009 at 04:17:44PM +0900, Ryo Tsuruta wrote: > Hi Vivek and Nauman, > > Nauman Rafique <nauman@google.com> wrote: > > >> > > How about adding a callback function to the higher level controller? > > >> > > CFQ calls it when the active queue runs out of time, then the higer > > >> > > level controller use it as a trigger or a hint to move IO group, so > > >> > > I think a time-based controller could be implemented at higher level. > > >> > > > > >> > > > >> > Adding a call back should not be a big issue. But that means you are > > >> > planning to run only one group at higher layer at one time and I think > > >> > that's the problem because than we are introducing serialization at higher > > >> > layer. So any higher level device mapper target which has multiple > > >> > physical disks under it, we might be underutilizing these even more and > > >> > take a big hit on overall throughput. > > >> > > > >> > The whole design of doing proportional weight at lower layer is optimial > > >> > usage of system. > > >> > > >> But I think that the higher level approch makes easy to configure > > >> against striped software raid devices. > > > > > > How does it make easier to configure in case of higher level controller? > > > > > > In case of lower level design, one just have to create cgroups and assign > > > weights to cgroups. This mininum step will be required in higher level > > > controller also. (Even if you get rid of dm-ioband device setup step). > > In the case of lower level controller, if we need to assign weights on > a per device basis, we have to assign weights to all devices of which > a raid device consists, but in the case of higher level controller, > we just assign weights to the raid device only. > This is required only if you need to assign different weights to different devices. This is just additional facility and not a requirement. Normally you will not be required to do that and devices will inherit the cgroup weights automatically. So one has to only assign the cgroup weights. > > >> If one would like to > > >> combine some physical disks into one logical device like a dm-linear, > > >> I think one should map the IO controller on each physical device and > > >> combine them into one logical device. > > >> > > > > > > In fact this sounds like a more complicated step where one has to setup > > > one dm-ioband device on top of each physical device. But I am assuming > > > that this will go away once you move to per reuqest queue like implementation. > > I don't understand why the per request queue implementation makes it > go away. If dm-ioband is integrated into the LVM tools, it could allow > users to skip the complicated steps to configure dm-linear devices. > Those who are not using dm-tools will be forced to use dm-tools for bandwidth control features. > > > I think it should be same in principal as my initial implementation of IO > > > controller on request queue and I stopped development on it because of FIFO > > > dispatch. > > I think that FIFO dispatch seldom lead to prioviry inversion, because > holding period for throttling is not too long to break the IO priority. > I did some tests to see whether priority inversion is happened. > > The first test ran fio sequential readers on the same group. The BE0 > reader got the highest throughput as I expected. > > nr_threads 16 | 16 | 1 > ionice BE7 | BE7 | BE0 > ------------------------+------------+------------- > vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s > ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s > > The second test ran fio sequential readers on two different groups and > give weights of 20 and 10 to each group respectively. The bandwidth > was distributed according to their weights and the BE0 reader got > higher throughput than the BE7 readers in the same group. IO priority > was preserved within the IO group. > > group group1 | group2 > weight 20 | 10 > ------------------------+-------------------------- > nr_threads 16 | 16 | 1 > ionice BE7 | BE7 | BE0 > ------------------------+-------------------------- > ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s > | Total = 13,772KiB/s > Interesting. In all the test cases you always test with sequential readers. I have changed the test case a bit (I have already reported the results in another mail, now running the same test again with dm-version 1.14). I made all the readers doing direct IO and in other group I put a buffered writer. So setup looks as follows. In group1, I launch 1 prio 0 reader and increasing number of prio4 readers. In group 2 I just run a dd doing buffered writes. Weights of both the groups are 100 each. Following are the results on 2.6.31 kernel. With-dm-ioband ============== <------------prio4 readers----------------------> <---prio0 reader------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec With vanilla CFQ ================ <------------prio4 readers----------------------> <---prio0 reader------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec Above results are showing how bandwidth got distributed between prio4 and prio1 readers with-in group as we increased number of prio4 readers in the group. In another group a buffered writer is continuously going on as competitor. Notice, with dm-ioband how bandwidth allocation is broken. With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. With 2 prio4 readers, looks like prio4 got almost same BW as prio1. With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 readers starve. As we incresae number of prio4 readers in the group, their total aggregate BW share should increase. Instread it is decreasing. So to me in the face of competition with a writer in other group, BW is all over the place. Some of these might be dm-ioband bugs and some of these might be coming from the fact that buffering takes place in higher layer and dispatch is FIFO? > Here is my test script. > ------------------------------------------------------------------------- > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > --group_reporting" > > sync > echo 3 > /proc/sys/vm/drop_caches > > echo $$ > /cgroup/1/tasks > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > echo $$ > /cgroup/2/tasks > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > echo $$ > /cgroup/tasks > wait > ------------------------------------------------------------------------- > > Be that as it way, I think that if every bio can point the iocontext > of the process, then it makes it possible to handle IO priority in the > higher level controller. A patchse has already posted by Takhashi-san. > What do you think about this idea? > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > Subject [RFC][PATCH 1/10] I/O context inheritance > From Hirokazu Takahashi <> > http://lkml.org/lkml/2008/4/22/195 So far you have been denying that there are issues with ioprio with-in group in higher level controller. Here you seems to be saying that there are issues with ioprio and we need to take this patch in to solve the issue? I am confused? Anyway, if you think that above patch is needed to solve the issue of ioprio in higher level controller, why are you not posting it as part of your patch series regularly, so that we can also apply this patch along with other patches and test the effects? > > > > So you seem to be suggesting that you will move dm-ioband to request queue > > > so that setting up additional device setup is gone. You will also enable > > > it to do time based groups policy, so that we don't run into issues on > > > seeky media. Will also enable dispatch from one group only at a time so > > > that we don't run into isolation issues and can do time accounting > > > accruately. > > > > Will that approach solve the problem of doing bandwidth control on > > logical devices? What would be the advantages compared to Vivek's > > current patches? > > I will only move the point where dm-ioband grabs bios, other > dm-ioband's mechanism and functionality will stll be the same. > The advantages against to scheduler based controllers are: > - can work with any type of block devices > - can work with any type of IO scheduler and no need a big change. > The big change thing we will come to know for sure when we have implementation for the timed groups done and shown that it works as well as my patches. There are so many subtle things with time based approach. [..] > > >> > Is there a new version of dm-ioband now where you have solved the issue of > > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am > > >> > trying to run some tests and come up with numbers so that we have more > > >> > clear picture of pros/cons. > > >> > > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > > >> dm-ioband handles sync/async IO requests separately and > > >> the write-starve-read issue you pointed out is fixed. I would > > >> appreciate it if you would try them. > > >> http://sourceforge.net/projects/ioband/files/ > > > > > > Cool. Will get to testing it. > > Thanks for your help in advance. Against what kernel version above patches apply. The biocgroup patches I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly against any of these? So for the time being I am doing testing with biocgroup patches. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-06 11:22 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-06 11:22 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds On Tue, Oct 06, 2009 at 04:17:44PM +0900, Ryo Tsuruta wrote: > Hi Vivek and Nauman, > > Nauman Rafique <nauman@google.com> wrote: > > >> > > How about adding a callback function to the higher level controller? > > >> > > CFQ calls it when the active queue runs out of time, then the higer > > >> > > level controller use it as a trigger or a hint to move IO group, so > > >> > > I think a time-based controller could be implemented at higher level. > > >> > > > > >> > > > >> > Adding a call back should not be a big issue. But that means you are > > >> > planning to run only one group at higher layer at one time and I think > > >> > that's the problem because than we are introducing serialization at higher > > >> > layer. So any higher level device mapper target which has multiple > > >> > physical disks under it, we might be underutilizing these even more and > > >> > take a big hit on overall throughput. > > >> > > > >> > The whole design of doing proportional weight at lower layer is optimial > > >> > usage of system. > > >> > > >> But I think that the higher level approch makes easy to configure > > >> against striped software raid devices. > > > > > > How does it make easier to configure in case of higher level controller? > > > > > > In case of lower level design, one just have to create cgroups and assign > > > weights to cgroups. This mininum step will be required in higher level > > > controller also. (Even if you get rid of dm-ioband device setup step). > > In the case of lower level controller, if we need to assign weights on > a per device basis, we have to assign weights to all devices of which > a raid device consists, but in the case of higher level controller, > we just assign weights to the raid device only. > This is required only if you need to assign different weights to different devices. This is just additional facility and not a requirement. Normally you will not be required to do that and devices will inherit the cgroup weights automatically. So one has to only assign the cgroup weights. > > >> If one would like to > > >> combine some physical disks into one logical device like a dm-linear, > > >> I think one should map the IO controller on each physical device and > > >> combine them into one logical device. > > >> > > > > > > In fact this sounds like a more complicated step where one has to setup > > > one dm-ioband device on top of each physical device. But I am assuming > > > that this will go away once you move to per reuqest queue like implementation. > > I don't understand why the per request queue implementation makes it > go away. If dm-ioband is integrated into the LVM tools, it could allow > users to skip the complicated steps to configure dm-linear devices. > Those who are not using dm-tools will be forced to use dm-tools for bandwidth control features. > > > I think it should be same in principal as my initial implementation of IO > > > controller on request queue and I stopped development on it because of FIFO > > > dispatch. > > I think that FIFO dispatch seldom lead to prioviry inversion, because > holding period for throttling is not too long to break the IO priority. > I did some tests to see whether priority inversion is happened. > > The first test ran fio sequential readers on the same group. The BE0 > reader got the highest throughput as I expected. > > nr_threads 16 | 16 | 1 > ionice BE7 | BE7 | BE0 > ------------------------+------------+------------- > vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s > ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s > > The second test ran fio sequential readers on two different groups and > give weights of 20 and 10 to each group respectively. The bandwidth > was distributed according to their weights and the BE0 reader got > higher throughput than the BE7 readers in the same group. IO priority > was preserved within the IO group. > > group group1 | group2 > weight 20 | 10 > ------------------------+-------------------------- > nr_threads 16 | 16 | 1 > ionice BE7 | BE7 | BE0 > ------------------------+-------------------------- > ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s > | Total = 13,772KiB/s > Interesting. In all the test cases you always test with sequential readers. I have changed the test case a bit (I have already reported the results in another mail, now running the same test again with dm-version 1.14). I made all the readers doing direct IO and in other group I put a buffered writer. So setup looks as follows. In group1, I launch 1 prio 0 reader and increasing number of prio4 readers. In group 2 I just run a dd doing buffered writes. Weights of both the groups are 100 each. Following are the results on 2.6.31 kernel. With-dm-ioband ============== <------------prio4 readers----------------------> <---prio0 reader------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec With vanilla CFQ ================ <------------prio4 readers----------------------> <---prio0 reader------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec Above results are showing how bandwidth got distributed between prio4 and prio1 readers with-in group as we increased number of prio4 readers in the group. In another group a buffered writer is continuously going on as competitor. Notice, with dm-ioband how bandwidth allocation is broken. With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. With 2 prio4 readers, looks like prio4 got almost same BW as prio1. With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 readers starve. As we incresae number of prio4 readers in the group, their total aggregate BW share should increase. Instread it is decreasing. So to me in the face of competition with a writer in other group, BW is all over the place. Some of these might be dm-ioband bugs and some of these might be coming from the fact that buffering takes place in higher layer and dispatch is FIFO? > Here is my test script. > ------------------------------------------------------------------------- > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > --group_reporting" > > sync > echo 3 > /proc/sys/vm/drop_caches > > echo $$ > /cgroup/1/tasks > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > echo $$ > /cgroup/2/tasks > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > echo $$ > /cgroup/tasks > wait > ------------------------------------------------------------------------- > > Be that as it way, I think that if every bio can point the iocontext > of the process, then it makes it possible to handle IO priority in the > higher level controller. A patchse has already posted by Takhashi-san. > What do you think about this idea? > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > Subject [RFC][PATCH 1/10] I/O context inheritance > From Hirokazu Takahashi <> > http://lkml.org/lkml/2008/4/22/195 So far you have been denying that there are issues with ioprio with-in group in higher level controller. Here you seems to be saying that there are issues with ioprio and we need to take this patch in to solve the issue? I am confused? Anyway, if you think that above patch is needed to solve the issue of ioprio in higher level controller, why are you not posting it as part of your patch series regularly, so that we can also apply this patch along with other patches and test the effects? > > > > So you seem to be suggesting that you will move dm-ioband to request queue > > > so that setting up additional device setup is gone. You will also enable > > > it to do time based groups policy, so that we don't run into issues on > > > seeky media. Will also enable dispatch from one group only at a time so > > > that we don't run into isolation issues and can do time accounting > > > accruately. > > > > Will that approach solve the problem of doing bandwidth control on > > logical devices? What would be the advantages compared to Vivek's > > current patches? > > I will only move the point where dm-ioband grabs bios, other > dm-ioband's mechanism and functionality will stll be the same. > The advantages against to scheduler based controllers are: > - can work with any type of block devices > - can work with any type of IO scheduler and no need a big change. > The big change thing we will come to know for sure when we have implementation for the timed groups done and shown that it works as well as my patches. There are so many subtle things with time based approach. [..] > > >> > Is there a new version of dm-ioband now where you have solved the issue of > > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am > > >> > trying to run some tests and come up with numbers so that we have more > > >> > clear picture of pros/cons. > > >> > > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > > >> dm-ioband handles sync/async IO requests separately and > > >> the write-starve-read issue you pointed out is fixed. I would > > >> appreciate it if you would try them. > > >> http://sourceforge.net/projects/ioband/files/ > > > > > > Cool. Will get to testing it. > > Thanks for your help in advance. Against what kernel version above patches apply. The biocgroup patches I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly against any of these? So for the time being I am doing testing with biocgroup patches. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-06 11:22 ` Vivek Goyal @ 2009-10-07 14:38 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-07 14:38 UTC (permalink / raw) To: vgoyal Cc: nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > > > >> If one would like to > > > >> combine some physical disks into one logical device like a dm-linear, > > > >> I think one should map the IO controller on each physical device and > > > >> combine them into one logical device. > > > >> > > > > > > > > In fact this sounds like a more complicated step where one has to setup > > > > one dm-ioband device on top of each physical device. But I am assuming > > > > that this will go away once you move to per reuqest queue like implementation. > > > > I don't understand why the per request queue implementation makes it > > go away. If dm-ioband is integrated into the LVM tools, it could allow > > users to skip the complicated steps to configure dm-linear devices. > > > > Those who are not using dm-tools will be forced to use dm-tools for > bandwidth control features. If once dm-ioband is integrated into the LVM tools and bandwidth can be assigned per device by lvcreate, the use of dm-tools is no longer required for users. > Interesting. In all the test cases you always test with sequential > readers. I have changed the test case a bit (I have already reported the > results in another mail, now running the same test again with dm-version > 1.14). I made all the readers doing direct IO and in other group I put > a buffered writer. So setup looks as follows. > > In group1, I launch 1 prio 0 reader and increasing number of prio4 > readers. In group 2 I just run a dd doing buffered writes. Weights of > both the groups are 100 each. > > Following are the results on 2.6.31 kernel. > > With-dm-ioband > ============== > <------------prio4 readers----------------------> <---prio0 reader------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec > 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec > 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec > 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec > 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec > > With vanilla CFQ > ================ > <------------prio4 readers----------------------> <---prio0 reader------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec > 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec > 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec > 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec > 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec > > > Above results are showing how bandwidth got distributed between prio4 and > prio1 readers with-in group as we increased number of prio4 readers in > the group. In another group a buffered writer is continuously going on > as competitor. > > Notice, with dm-ioband how bandwidth allocation is broken. > > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. > > With 2 prio4 readers, looks like prio4 got almost same BW as prio1. > > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 > readers starve. > > As we incresae number of prio4 readers in the group, their total aggregate > BW share should increase. Instread it is decreasing. > > So to me in the face of competition with a writer in other group, BW is > all over the place. Some of these might be dm-ioband bugs and some of > these might be coming from the fact that buffering takes place in higher > layer and dispatch is FIFO? Thank you for testing. I did the same test and here are the results. with vanilla CFQ <------------prio4 readers------------------> prio0 group2 maxbw minbw aggrbw maxlat aggrbw bufwrite 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s 1,923KiB/s 2 3,967KiB/s 3,930KiB/s 7,897KiB/s 30001msec 14,213KiB/s 1,586KiB/s 4 3,399KiB/s 3,066KiB/s 13,031KiB/s 30082msec 8,930KiB/s 1,296KiB/s 8 2,086KiB/s 1,720KiB/s 15,266KiB/s 30003msec 7,546KiB/s 517KiB/s 16 1,156KiB/s 837KiB/s 15,377KiB/s 30033msec 4,282KiB/s 600KiB/s with dm-ioband weight-iosize policy <------------prio4 readers------------------> prio0 group2 maxbw minbw aggrbw maxlat aggrbw bufwrite 1 107KiB/s 107KiB/s 107KiB/s 30007msec 12,242KiB/s 12,320KiB/s 2 1,259KiB/s 702KiB/s 1,961KiB/s 30037msec 9,657KiB/s 11,657KiB/s 4 2,705KiB/s 29KiB/s 5,186KiB/s 30026msec 5,927KiB/s 11,300KiB/s 8 2,428KiB/s 27KiB/s 5,629KiB/s 30054msec 5,057KiB/s 10,704KiB/s 16 2,465KiB/s 23KiB/s 4,309KiB/s 30032msec 4,750KiB/s 9,088KiB/s The results are somewhat different from yours. The bandwidth is distributed to each group equally, but CFQ priority is broken as you said. I think that the reason is not because of FIFO, but because some IO requests are issued from dm-ioband's kernel thread on behalf of processes which origirante the IO requests, then CFQ assumes that the kernel thread is the originator and uses its io_context. > > Here is my test script. > > ------------------------------------------------------------------------- > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > > --group_reporting" > > > > sync > > echo 3 > /proc/sys/vm/drop_caches > > > > echo $$ > /cgroup/1/tasks > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > > echo $$ > /cgroup/2/tasks > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > > echo $$ > /cgroup/tasks > > wait > > ------------------------------------------------------------------------- > > > > Be that as it way, I think that if every bio can point the iocontext > > of the process, then it makes it possible to handle IO priority in the > > higher level controller. A patchse has already posted by Takhashi-san. > > What do you think about this idea? > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > Subject [RFC][PATCH 1/10] I/O context inheritance > > From Hirokazu Takahashi <> > > http://lkml.org/lkml/2008/4/22/195 > > So far you have been denying that there are issues with ioprio with-in > group in higher level controller. Here you seems to be saying that there are > issues with ioprio and we need to take this patch in to solve the issue? I am > confused? The true intention of this patch is to preserve the io-context of a process which originate it, but I think that we could also make use of this patch for one of the way to solve this issue. > Anyway, if you think that above patch is needed to solve the issue of > ioprio in higher level controller, why are you not posting it as part of > your patch series regularly, so that we can also apply this patch along > with other patches and test the effects? I will post the patch, but I would like to find out and understand the reason of above test results before posting the patch. > Against what kernel version above patches apply. The biocgroup patches > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly > against any of these? > > So for the time being I am doing testing with biocgroup patches. I created those patches against 2.6.32-rc1 and made sure the patches can be cleanly applied to that version. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-07 14:38 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-07 14:38 UTC (permalink / raw) To: vgoyal Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > > > >> If one would like to > > > >> combine some physical disks into one logical device like a dm-linear, > > > >> I think one should map the IO controller on each physical device and > > > >> combine them into one logical device. > > > >> > > > > > > > > In fact this sounds like a more complicated step where one has to setup > > > > one dm-ioband device on top of each physical device. But I am assuming > > > > that this will go away once you move to per reuqest queue like implementation. > > > > I don't understand why the per request queue implementation makes it > > go away. If dm-ioband is integrated into the LVM tools, it could allow > > users to skip the complicated steps to configure dm-linear devices. > > > > Those who are not using dm-tools will be forced to use dm-tools for > bandwidth control features. If once dm-ioband is integrated into the LVM tools and bandwidth can be assigned per device by lvcreate, the use of dm-tools is no longer required for users. > Interesting. In all the test cases you always test with sequential > readers. I have changed the test case a bit (I have already reported the > results in another mail, now running the same test again with dm-version > 1.14). I made all the readers doing direct IO and in other group I put > a buffered writer. So setup looks as follows. > > In group1, I launch 1 prio 0 reader and increasing number of prio4 > readers. In group 2 I just run a dd doing buffered writes. Weights of > both the groups are 100 each. > > Following are the results on 2.6.31 kernel. > > With-dm-ioband > ============== > <------------prio4 readers----------------------> <---prio0 reader------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec > 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec > 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec > 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec > 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec > > With vanilla CFQ > ================ > <------------prio4 readers----------------------> <---prio0 reader------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec > 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec > 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec > 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec > 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec > > > Above results are showing how bandwidth got distributed between prio4 and > prio1 readers with-in group as we increased number of prio4 readers in > the group. In another group a buffered writer is continuously going on > as competitor. > > Notice, with dm-ioband how bandwidth allocation is broken. > > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. > > With 2 prio4 readers, looks like prio4 got almost same BW as prio1. > > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 > readers starve. > > As we incresae number of prio4 readers in the group, their total aggregate > BW share should increase. Instread it is decreasing. > > So to me in the face of competition with a writer in other group, BW is > all over the place. Some of these might be dm-ioband bugs and some of > these might be coming from the fact that buffering takes place in higher > layer and dispatch is FIFO? Thank you for testing. I did the same test and here are the results. with vanilla CFQ <------------prio4 readers------------------> prio0 group2 maxbw minbw aggrbw maxlat aggrbw bufwrite 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s 1,923KiB/s 2 3,967KiB/s 3,930KiB/s 7,897KiB/s 30001msec 14,213KiB/s 1,586KiB/s 4 3,399KiB/s 3,066KiB/s 13,031KiB/s 30082msec 8,930KiB/s 1,296KiB/s 8 2,086KiB/s 1,720KiB/s 15,266KiB/s 30003msec 7,546KiB/s 517KiB/s 16 1,156KiB/s 837KiB/s 15,377KiB/s 30033msec 4,282KiB/s 600KiB/s with dm-ioband weight-iosize policy <------------prio4 readers------------------> prio0 group2 maxbw minbw aggrbw maxlat aggrbw bufwrite 1 107KiB/s 107KiB/s 107KiB/s 30007msec 12,242KiB/s 12,320KiB/s 2 1,259KiB/s 702KiB/s 1,961KiB/s 30037msec 9,657KiB/s 11,657KiB/s 4 2,705KiB/s 29KiB/s 5,186KiB/s 30026msec 5,927KiB/s 11,300KiB/s 8 2,428KiB/s 27KiB/s 5,629KiB/s 30054msec 5,057KiB/s 10,704KiB/s 16 2,465KiB/s 23KiB/s 4,309KiB/s 30032msec 4,750KiB/s 9,088KiB/s The results are somewhat different from yours. The bandwidth is distributed to each group equally, but CFQ priority is broken as you said. I think that the reason is not because of FIFO, but because some IO requests are issued from dm-ioband's kernel thread on behalf of processes which origirante the IO requests, then CFQ assumes that the kernel thread is the originator and uses its io_context. > > Here is my test script. > > ------------------------------------------------------------------------- > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > > --group_reporting" > > > > sync > > echo 3 > /proc/sys/vm/drop_caches > > > > echo $$ > /cgroup/1/tasks > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > > echo $$ > /cgroup/2/tasks > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > > echo $$ > /cgroup/tasks > > wait > > ------------------------------------------------------------------------- > > > > Be that as it way, I think that if every bio can point the iocontext > > of the process, then it makes it possible to handle IO priority in the > > higher level controller. A patchse has already posted by Takhashi-san. > > What do you think about this idea? > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > Subject [RFC][PATCH 1/10] I/O context inheritance > > From Hirokazu Takahashi <> > > http://lkml.org/lkml/2008/4/22/195 > > So far you have been denying that there are issues with ioprio with-in > group in higher level controller. Here you seems to be saying that there are > issues with ioprio and we need to take this patch in to solve the issue? I am > confused? The true intention of this patch is to preserve the io-context of a process which originate it, but I think that we could also make use of this patch for one of the way to solve this issue. > Anyway, if you think that above patch is needed to solve the issue of > ioprio in higher level controller, why are you not posting it as part of > your patch series regularly, so that we can also apply this patch along > with other patches and test the effects? I will post the patch, but I would like to find out and understand the reason of above test results before posting the patch. > Against what kernel version above patches apply. The biocgroup patches > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly > against any of these? > > So for the time being I am doing testing with biocgroup patches. I created those patches against 2.6.32-rc1 and made sure the patches can be cleanly applied to that version. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-07 14:38 ` Ryo Tsuruta @ 2009-10-07 15:09 ` Vivek Goyal -1 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-07 15:09 UTC (permalink / raw) To: Ryo Tsuruta Cc: nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Wed, Oct 07, 2009 at 11:38:05PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > >> If one would like to > > > > >> combine some physical disks into one logical device like a dm-linear, > > > > >> I think one should map the IO controller on each physical device and > > > > >> combine them into one logical device. > > > > >> > > > > > > > > > > In fact this sounds like a more complicated step where one has to setup > > > > > one dm-ioband device on top of each physical device. But I am assuming > > > > > that this will go away once you move to per reuqest queue like implementation. > > > > > > I don't understand why the per request queue implementation makes it > > > go away. If dm-ioband is integrated into the LVM tools, it could allow > > > users to skip the complicated steps to configure dm-linear devices. > > > > > > > Those who are not using dm-tools will be forced to use dm-tools for > > bandwidth control features. > > If once dm-ioband is integrated into the LVM tools and bandwidth can > be assigned per device by lvcreate, the use of dm-tools is no longer > required for users. But it is same thing. Now LVM tools is mandatory to use? > > > Interesting. In all the test cases you always test with sequential > > readers. I have changed the test case a bit (I have already reported the > > results in another mail, now running the same test again with dm-version > > 1.14). I made all the readers doing direct IO and in other group I put > > a buffered writer. So setup looks as follows. > > > > In group1, I launch 1 prio 0 reader and increasing number of prio4 > > readers. In group 2 I just run a dd doing buffered writes. Weights of > > both the groups are 100 each. > > > > Following are the results on 2.6.31 kernel. > > > > With-dm-ioband > > ============== > > <------------prio4 readers----------------------> <---prio0 reader------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec > > 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec > > 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec > > 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec > > 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec > > > > With vanilla CFQ > > ================ > > <------------prio4 readers----------------------> <---prio0 reader------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec > > 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec > > 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec > > 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec > > 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec > > > > > > Above results are showing how bandwidth got distributed between prio4 and > > prio1 readers with-in group as we increased number of prio4 readers in > > the group. In another group a buffered writer is continuously going on > > as competitor. > > > > Notice, with dm-ioband how bandwidth allocation is broken. > > > > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. > > > > With 2 prio4 readers, looks like prio4 got almost same BW as prio1. > > > > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 > > readers starve. > > > > As we incresae number of prio4 readers in the group, their total aggregate > > BW share should increase. Instread it is decreasing. > > > > So to me in the face of competition with a writer in other group, BW is > > all over the place. Some of these might be dm-ioband bugs and some of > > these might be coming from the fact that buffering takes place in higher > > layer and dispatch is FIFO? > > Thank you for testing. I did the same test and here are the results. > > with vanilla CFQ > <------------prio4 readers------------------> prio0 group2 > maxbw minbw aggrbw maxlat aggrbw bufwrite > 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s 1,923KiB/s > 2 3,967KiB/s 3,930KiB/s 7,897KiB/s 30001msec 14,213KiB/s 1,586KiB/s > 4 3,399KiB/s 3,066KiB/s 13,031KiB/s 30082msec 8,930KiB/s 1,296KiB/s > 8 2,086KiB/s 1,720KiB/s 15,266KiB/s 30003msec 7,546KiB/s 517KiB/s > 16 1,156KiB/s 837KiB/s 15,377KiB/s 30033msec 4,282KiB/s 600KiB/s > > with dm-ioband weight-iosize policy > <------------prio4 readers------------------> prio0 group2 > maxbw minbw aggrbw maxlat aggrbw bufwrite > 1 107KiB/s 107KiB/s 107KiB/s 30007msec 12,242KiB/s 12,320KiB/s > 2 1,259KiB/s 702KiB/s 1,961KiB/s 30037msec 9,657KiB/s 11,657KiB/s > 4 2,705KiB/s 29KiB/s 5,186KiB/s 30026msec 5,927KiB/s 11,300KiB/s > 8 2,428KiB/s 27KiB/s 5,629KiB/s 30054msec 5,057KiB/s 10,704KiB/s > 16 2,465KiB/s 23KiB/s 4,309KiB/s 30032msec 4,750KiB/s 9,088KiB/s > > The results are somewhat different from yours. The bandwidth is > distributed to each group equally, but CFQ priority is broken as you > said. I think that the reason is not because of FIFO, but because > some IO requests are issued from dm-ioband's kernel thread on behalf of > processes which origirante the IO requests, then CFQ assumes that the > kernel thread is the originator and uses its io_context. Ok. Our numbers can vary a bit depending on fio settings like block size and underlying storage also. But that's not the important thing. Currently with this test I just wanted to point out that model of ioprio with-in group is currently broken with dm-ioband and good that you can reproduce that. One minor nit, for max latency you need to look at "clat " row and "max=" field in fio output. Most of the time "max latency" will matter most. You seem to be currently grepping for "maxt" which is just seems to be telling how long did test run and in this case 30 seconds. Assigning reads to right context in CFQ and not to dm-ioband thread might help a bit, but I am bit skeptical and following is the reason. CFQ relies on time providing longer time slice length for higher priority process and if one does not use time slice, it looses its share. So the moment you buffer even single bio of a process in dm-layer, if CFQ was servicing that process at same time, that process will loose its share. CFQ will at max anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the queue and move on to next queue. Later if you submit same bio and with dm-ioband helper thread and even if CFQ attributes it to right process, it is not going to help much as process already lost it slice and now a new slice will start. > > > > Here is my test script. > > > ------------------------------------------------------------------------- > > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > > > --group_reporting" > > > > > > sync > > > echo 3 > /proc/sys/vm/drop_caches > > > > > > echo $$ > /cgroup/1/tasks > > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > > > echo $$ > /cgroup/2/tasks > > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > > > echo $$ > /cgroup/tasks > > > wait > > > ------------------------------------------------------------------------- > > > > > > Be that as it way, I think that if every bio can point the iocontext > > > of the process, then it makes it possible to handle IO priority in the > > > higher level controller. A patchse has already posted by Takhashi-san. > > > What do you think about this idea? > > > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > > Subject [RFC][PATCH 1/10] I/O context inheritance > > > From Hirokazu Takahashi <> > > > http://lkml.org/lkml/2008/4/22/195 > > > > So far you have been denying that there are issues with ioprio with-in > > group in higher level controller. Here you seems to be saying that there are > > issues with ioprio and we need to take this patch in to solve the issue? I am > > confused? > > The true intention of this patch is to preserve the io-context of a > process which originate it, but I think that we could also make use of > this patch for one of the way to solve this issue. > Ok. Did you run the same test with this patch applied and how do numbers look like? Can you please forward port it to 2.6.31 and I will also like to play with it? I am running more tests/numbers with 2.6.31 for all the IO controllers and planning to post it to lkml before we meet for IO mini summit. Numbers can help us understand the issue better. In first phase I am planning to post numbers for IO scheudler controller and dm-ioband. Then will get to max bw controller of Andrea Righi. > > Anyway, if you think that above patch is needed to solve the issue of > > ioprio in higher level controller, why are you not posting it as part of > > your patch series regularly, so that we can also apply this patch along > > with other patches and test the effects? > > I will post the patch, but I would like to find out and understand the > reason of above test results before posting the patch. > Ok. So in the mean time, I will continue to do testing with dm-ioband version 1.14.0 and post the numbers. > > Against what kernel version above patches apply. The biocgroup patches > > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly > > against any of these? > > > > So for the time being I am doing testing with biocgroup patches. > > I created those patches against 2.6.32-rc1 and made sure the patches > can be cleanly applied to that version. I am applying dm-ioband patch first and then bio cgroup patches. Is this right order? Will try again. Anyway, don't have too much time for IO mini summit, so will stick to 2.6.31 for the time being. If time permits, will venture into 32-rc1 also. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-07 15:09 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-07 15:09 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds On Wed, Oct 07, 2009 at 11:38:05PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > > > >> If one would like to > > > > >> combine some physical disks into one logical device like a dm-linear, > > > > >> I think one should map the IO controller on each physical device and > > > > >> combine them into one logical device. > > > > >> > > > > > > > > > > In fact this sounds like a more complicated step where one has to setup > > > > > one dm-ioband device on top of each physical device. But I am assuming > > > > > that this will go away once you move to per reuqest queue like implementation. > > > > > > I don't understand why the per request queue implementation makes it > > > go away. If dm-ioband is integrated into the LVM tools, it could allow > > > users to skip the complicated steps to configure dm-linear devices. > > > > > > > Those who are not using dm-tools will be forced to use dm-tools for > > bandwidth control features. > > If once dm-ioband is integrated into the LVM tools and bandwidth can > be assigned per device by lvcreate, the use of dm-tools is no longer > required for users. But it is same thing. Now LVM tools is mandatory to use? > > > Interesting. In all the test cases you always test with sequential > > readers. I have changed the test case a bit (I have already reported the > > results in another mail, now running the same test again with dm-version > > 1.14). I made all the readers doing direct IO and in other group I put > > a buffered writer. So setup looks as follows. > > > > In group1, I launch 1 prio 0 reader and increasing number of prio4 > > readers. In group 2 I just run a dd doing buffered writes. Weights of > > both the groups are 100 each. > > > > Following are the results on 2.6.31 kernel. > > > > With-dm-ioband > > ============== > > <------------prio4 readers----------------------> <---prio0 reader------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec > > 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec > > 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec > > 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec > > 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec > > > > With vanilla CFQ > > ================ > > <------------prio4 readers----------------------> <---prio0 reader------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec > > 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec > > 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec > > 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec > > 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec > > > > > > Above results are showing how bandwidth got distributed between prio4 and > > prio1 readers with-in group as we increased number of prio4 readers in > > the group. In another group a buffered writer is continuously going on > > as competitor. > > > > Notice, with dm-ioband how bandwidth allocation is broken. > > > > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. > > > > With 2 prio4 readers, looks like prio4 got almost same BW as prio1. > > > > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 > > readers starve. > > > > As we incresae number of prio4 readers in the group, their total aggregate > > BW share should increase. Instread it is decreasing. > > > > So to me in the face of competition with a writer in other group, BW is > > all over the place. Some of these might be dm-ioband bugs and some of > > these might be coming from the fact that buffering takes place in higher > > layer and dispatch is FIFO? > > Thank you for testing. I did the same test and here are the results. > > with vanilla CFQ > <------------prio4 readers------------------> prio0 group2 > maxbw minbw aggrbw maxlat aggrbw bufwrite > 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s 1,923KiB/s > 2 3,967KiB/s 3,930KiB/s 7,897KiB/s 30001msec 14,213KiB/s 1,586KiB/s > 4 3,399KiB/s 3,066KiB/s 13,031KiB/s 30082msec 8,930KiB/s 1,296KiB/s > 8 2,086KiB/s 1,720KiB/s 15,266KiB/s 30003msec 7,546KiB/s 517KiB/s > 16 1,156KiB/s 837KiB/s 15,377KiB/s 30033msec 4,282KiB/s 600KiB/s > > with dm-ioband weight-iosize policy > <------------prio4 readers------------------> prio0 group2 > maxbw minbw aggrbw maxlat aggrbw bufwrite > 1 107KiB/s 107KiB/s 107KiB/s 30007msec 12,242KiB/s 12,320KiB/s > 2 1,259KiB/s 702KiB/s 1,961KiB/s 30037msec 9,657KiB/s 11,657KiB/s > 4 2,705KiB/s 29KiB/s 5,186KiB/s 30026msec 5,927KiB/s 11,300KiB/s > 8 2,428KiB/s 27KiB/s 5,629KiB/s 30054msec 5,057KiB/s 10,704KiB/s > 16 2,465KiB/s 23KiB/s 4,309KiB/s 30032msec 4,750KiB/s 9,088KiB/s > > The results are somewhat different from yours. The bandwidth is > distributed to each group equally, but CFQ priority is broken as you > said. I think that the reason is not because of FIFO, but because > some IO requests are issued from dm-ioband's kernel thread on behalf of > processes which origirante the IO requests, then CFQ assumes that the > kernel thread is the originator and uses its io_context. Ok. Our numbers can vary a bit depending on fio settings like block size and underlying storage also. But that's not the important thing. Currently with this test I just wanted to point out that model of ioprio with-in group is currently broken with dm-ioband and good that you can reproduce that. One minor nit, for max latency you need to look at "clat " row and "max=" field in fio output. Most of the time "max latency" will matter most. You seem to be currently grepping for "maxt" which is just seems to be telling how long did test run and in this case 30 seconds. Assigning reads to right context in CFQ and not to dm-ioband thread might help a bit, but I am bit skeptical and following is the reason. CFQ relies on time providing longer time slice length for higher priority process and if one does not use time slice, it looses its share. So the moment you buffer even single bio of a process in dm-layer, if CFQ was servicing that process at same time, that process will loose its share. CFQ will at max anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the queue and move on to next queue. Later if you submit same bio and with dm-ioband helper thread and even if CFQ attributes it to right process, it is not going to help much as process already lost it slice and now a new slice will start. > > > > Here is my test script. > > > ------------------------------------------------------------------------- > > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > > > --group_reporting" > > > > > > sync > > > echo 3 > /proc/sys/vm/drop_caches > > > > > > echo $$ > /cgroup/1/tasks > > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > > > echo $$ > /cgroup/2/tasks > > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > > > echo $$ > /cgroup/tasks > > > wait > > > ------------------------------------------------------------------------- > > > > > > Be that as it way, I think that if every bio can point the iocontext > > > of the process, then it makes it possible to handle IO priority in the > > > higher level controller. A patchse has already posted by Takhashi-san. > > > What do you think about this idea? > > > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > > Subject [RFC][PATCH 1/10] I/O context inheritance > > > From Hirokazu Takahashi <> > > > http://lkml.org/lkml/2008/4/22/195 > > > > So far you have been denying that there are issues with ioprio with-in > > group in higher level controller. Here you seems to be saying that there are > > issues with ioprio and we need to take this patch in to solve the issue? I am > > confused? > > The true intention of this patch is to preserve the io-context of a > process which originate it, but I think that we could also make use of > this patch for one of the way to solve this issue. > Ok. Did you run the same test with this patch applied and how do numbers look like? Can you please forward port it to 2.6.31 and I will also like to play with it? I am running more tests/numbers with 2.6.31 for all the IO controllers and planning to post it to lkml before we meet for IO mini summit. Numbers can help us understand the issue better. In first phase I am planning to post numbers for IO scheudler controller and dm-ioband. Then will get to max bw controller of Andrea Righi. > > Anyway, if you think that above patch is needed to solve the issue of > > ioprio in higher level controller, why are you not posting it as part of > > your patch series regularly, so that we can also apply this patch along > > with other patches and test the effects? > > I will post the patch, but I would like to find out and understand the > reason of above test results before posting the patch. > Ok. So in the mean time, I will continue to do testing with dm-ioband version 1.14.0 and post the numbers. > > Against what kernel version above patches apply. The biocgroup patches > > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly > > against any of these? > > > > So for the time being I am doing testing with biocgroup patches. > > I created those patches against 2.6.32-rc1 and made sure the patches > can be cleanly applied to that version. I am applying dm-ioband patch first and then bio cgroup patches. Is this right order? Will try again. Anyway, don't have too much time for IO mini summit, so will stick to 2.6.31 for the time being. If time permits, will venture into 32-rc1 also. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-07 15:09 ` Vivek Goyal @ 2009-10-08 2:18 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-08 2:18 UTC (permalink / raw) To: vgoyal Cc: nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > Ok. Our numbers can vary a bit depending on fio settings like block size > and underlying storage also. But that's not the important thing. Currently > with this test I just wanted to point out that model of ioprio with-in group > is currently broken with dm-ioband and good that you can reproduce that. > > One minor nit, for max latency you need to look at "clat " row and "max=" field > in fio output. Most of the time "max latency" will matter most. You seem to > be currently grepping for "maxt" which is just seems to be telling how > long did test run and in this case 30 seconds. > > Assigning reads to right context in CFQ and not to dm-ioband thread might > help a bit, but I am bit skeptical and following is the reason. > > CFQ relies on time providing longer time slice length for higher priority > process and if one does not use time slice, it looses its share. So the moment > you buffer even single bio of a process in dm-layer, if CFQ was servicing that > process at same time, that process will loose its share. CFQ will at max > anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the > queue and move on to next queue. Later if you submit same bio and with > dm-ioband helper thread and even if CFQ attributes it to right process, it is > not going to help much as process already lost it slice and now a new slice > will start. O.K. I would like to figure something out this issue. > > > > Be that as it way, I think that if every bio can point the iocontext > > > > of the process, then it makes it possible to handle IO priority in the > > > > higher level controller. A patchse has already posted by Takhashi-san. > > > > What do you think about this idea? > > > > > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > > > Subject [RFC][PATCH 1/10] I/O context inheritance > > > > From Hirokazu Takahashi <> > > > > http://lkml.org/lkml/2008/4/22/195 > > > > > > So far you have been denying that there are issues with ioprio with-in > > > group in higher level controller. Here you seems to be saying that there are > > > issues with ioprio and we need to take this patch in to solve the issue? I am > > > confused? > > > > The true intention of this patch is to preserve the io-context of a > > process which originate it, but I think that we could also make use of > > this patch for one of the way to solve this issue. > > > > Ok. Did you run the same test with this patch applied and how do numbers look > like? Can you please forward port it to 2.6.31 and I will also like to > play with it? I'm sorry, I have no time to do that this week. I would like to do the forward porting and test with it by the mini-summit when poissible. > I am running more tests/numbers with 2.6.31 for all the IO controllers and > planning to post it to lkml before we meet for IO mini summit. Numbers can > help us understand the issue better. > > In first phase I am planning to post numbers for IO scheudler controller > and dm-ioband. Then will get to max bw controller of Andrea Righi. That sounds good. Thank you for your work. > > I created those patches against 2.6.32-rc1 and made sure the patches > > can be cleanly applied to that version. > > I am applying dm-ioband patch first and then bio cgroup patches. Is this > right order? Will try again. Yes, the order is right. Here are the sha1sums. 9f4e50878d77922c84a29be9913a8b5c3f66e6ec linux-2.6.32-rc1.tar.bz2 15d7cc9d801805327204296a2454d6c5346dd2ae dm-ioband-1.14.0.patch 5e0626c14a40c319fb79f2f78378d2de5cc97b02 blkio-cgroup-v13.tar.bz2 Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-08 2:18 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-08 2:18 UTC (permalink / raw) To: vgoyal Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, righi.andrea, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds Hi Vivek, Vivek Goyal <vgoyal@redhat.com> wrote: > Ok. Our numbers can vary a bit depending on fio settings like block size > and underlying storage also. But that's not the important thing. Currently > with this test I just wanted to point out that model of ioprio with-in group > is currently broken with dm-ioband and good that you can reproduce that. > > One minor nit, for max latency you need to look at "clat " row and "max=" field > in fio output. Most of the time "max latency" will matter most. You seem to > be currently grepping for "maxt" which is just seems to be telling how > long did test run and in this case 30 seconds. > > Assigning reads to right context in CFQ and not to dm-ioband thread might > help a bit, but I am bit skeptical and following is the reason. > > CFQ relies on time providing longer time slice length for higher priority > process and if one does not use time slice, it looses its share. So the moment > you buffer even single bio of a process in dm-layer, if CFQ was servicing that > process at same time, that process will loose its share. CFQ will at max > anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the > queue and move on to next queue. Later if you submit same bio and with > dm-ioband helper thread and even if CFQ attributes it to right process, it is > not going to help much as process already lost it slice and now a new slice > will start. O.K. I would like to figure something out this issue. > > > > Be that as it way, I think that if every bio can point the iocontext > > > > of the process, then it makes it possible to handle IO priority in the > > > > higher level controller. A patchse has already posted by Takhashi-san. > > > > What do you think about this idea? > > > > > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > > > Subject [RFC][PATCH 1/10] I/O context inheritance > > > > From Hirokazu Takahashi <> > > > > http://lkml.org/lkml/2008/4/22/195 > > > > > > So far you have been denying that there are issues with ioprio with-in > > > group in higher level controller. Here you seems to be saying that there are > > > issues with ioprio and we need to take this patch in to solve the issue? I am > > > confused? > > > > The true intention of this patch is to preserve the io-context of a > > process which originate it, but I think that we could also make use of > > this patch for one of the way to solve this issue. > > > > Ok. Did you run the same test with this patch applied and how do numbers look > like? Can you please forward port it to 2.6.31 and I will also like to > play with it? I'm sorry, I have no time to do that this week. I would like to do the forward porting and test with it by the mini-summit when poissible. > I am running more tests/numbers with 2.6.31 for all the IO controllers and > planning to post it to lkml before we meet for IO mini summit. Numbers can > help us understand the issue better. > > In first phase I am planning to post numbers for IO scheudler controller > and dm-ioband. Then will get to max bw controller of Andrea Righi. That sounds good. Thank you for your work. > > I created those patches against 2.6.32-rc1 and made sure the patches > > can be cleanly applied to that version. > > I am applying dm-ioband patch first and then bio cgroup patches. Is this > right order? Will try again. Yes, the order is right. Here are the sha1sums. 9f4e50878d77922c84a29be9913a8b5c3f66e6ec linux-2.6.32-rc1.tar.bz2 15d7cc9d801805327204296a2454d6c5346dd2ae dm-ioband-1.14.0.patch 5e0626c14a40c319fb79f2f78378d2de5cc97b02 blkio-cgroup-v13.tar.bz2 Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091007150929.GB3674-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091007150929.GB3674-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-08 2:18 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-08 2:18 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Ok. Our numbers can vary a bit depending on fio settings like block size > and underlying storage also. But that's not the important thing. Currently > with this test I just wanted to point out that model of ioprio with-in group > is currently broken with dm-ioband and good that you can reproduce that. > > One minor nit, for max latency you need to look at "clat " row and "max=" field > in fio output. Most of the time "max latency" will matter most. You seem to > be currently grepping for "maxt" which is just seems to be telling how > long did test run and in this case 30 seconds. > > Assigning reads to right context in CFQ and not to dm-ioband thread might > help a bit, but I am bit skeptical and following is the reason. > > CFQ relies on time providing longer time slice length for higher priority > process and if one does not use time slice, it looses its share. So the moment > you buffer even single bio of a process in dm-layer, if CFQ was servicing that > process at same time, that process will loose its share. CFQ will at max > anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the > queue and move on to next queue. Later if you submit same bio and with > dm-ioband helper thread and even if CFQ attributes it to right process, it is > not going to help much as process already lost it slice and now a new slice > will start. O.K. I would like to figure something out this issue. > > > > Be that as it way, I think that if every bio can point the iocontext > > > > of the process, then it makes it possible to handle IO priority in the > > > > higher level controller. A patchse has already posted by Takhashi-san. > > > > What do you think about this idea? > > > > > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > > > Subject [RFC][PATCH 1/10] I/O context inheritance > > > > From Hirokazu Takahashi <> > > > > http://lkml.org/lkml/2008/4/22/195 > > > > > > So far you have been denying that there are issues with ioprio with-in > > > group in higher level controller. Here you seems to be saying that there are > > > issues with ioprio and we need to take this patch in to solve the issue? I am > > > confused? > > > > The true intention of this patch is to preserve the io-context of a > > process which originate it, but I think that we could also make use of > > this patch for one of the way to solve this issue. > > > > Ok. Did you run the same test with this patch applied and how do numbers look > like? Can you please forward port it to 2.6.31 and I will also like to > play with it? I'm sorry, I have no time to do that this week. I would like to do the forward porting and test with it by the mini-summit when poissible. > I am running more tests/numbers with 2.6.31 for all the IO controllers and > planning to post it to lkml before we meet for IO mini summit. Numbers can > help us understand the issue better. > > In first phase I am planning to post numbers for IO scheudler controller > and dm-ioband. Then will get to max bw controller of Andrea Righi. That sounds good. Thank you for your work. > > I created those patches against 2.6.32-rc1 and made sure the patches > > can be cleanly applied to that version. > > I am applying dm-ioband patch first and then bio cgroup patches. Is this > right order? Will try again. Yes, the order is right. Here are the sha1sums. 9f4e50878d77922c84a29be9913a8b5c3f66e6ec linux-2.6.32-rc1.tar.bz2 15d7cc9d801805327204296a2454d6c5346dd2ae dm-ioband-1.14.0.patch 5e0626c14a40c319fb79f2f78378d2de5cc97b02 blkio-cgroup-v13.tar.bz2 Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-07 14:38 ` Ryo Tsuruta @ 2009-10-07 16:41 ` Rik van Riel -1 siblings, 0 replies; 349+ messages in thread From: Rik van Riel @ 2009-10-07 16:41 UTC (permalink / raw) To: Ryo Tsuruta Cc: vgoyal, nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, yoshikawa.takuya Ryo Tsuruta wrote: > If once dm-ioband is integrated into the LVM tools and bandwidth can > be assigned per device by lvcreate, the use of dm-tools is no longer > required for users. A lot of large data center users have a SAN, with volume management handled SAN-side and dedicated LUNs for different applications or groups of applications. Because of alignment issues, they typically use filesystems directly on top of the LUNs, without partitions or LVM layers. We cannot rely on LVM for these systems, because people prefer not to use that. Besides ... isn't the goal of the cgroups io bandwidth controller to control the IO used by PROCESSES? If we want to control processes, why would we want the configuration to be applied to any other kind of object in the system? -- All rights reversed. ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-07 16:41 ` Rik van Riel 0 siblings, 0 replies; 349+ messages in thread From: Rik van Riel @ 2009-10-07 16:41 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, vgoyal, righi.andrea, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds Ryo Tsuruta wrote: > If once dm-ioband is integrated into the LVM tools and bandwidth can > be assigned per device by lvcreate, the use of dm-tools is no longer > required for users. A lot of large data center users have a SAN, with volume management handled SAN-side and dedicated LUNs for different applications or groups of applications. Because of alignment issues, they typically use filesystems directly on top of the LUNs, without partitions or LVM layers. We cannot rely on LVM for these systems, because people prefer not to use that. Besides ... isn't the goal of the cgroups io bandwidth controller to control the IO used by PROCESSES? If we want to control processes, why would we want the configuration to be applied to any other kind of object in the system? -- All rights reversed. ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <4ACCC4B7.4050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <4ACCC4B7.4050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-08 10:22 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-08 10:22 UTC (permalink / raw) To: riel-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Rik, Rik van Riel <riel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > Ryo Tsuruta wrote: > > > If once dm-ioband is integrated into the LVM tools and bandwidth can > > be assigned per device by lvcreate, the use of dm-tools is no longer > > required for users. > > A lot of large data center users have a SAN, with volume management > handled SAN-side and dedicated LUNs for different applications or > groups of applications. > > Because of alignment issues, they typically use filesystems directly > on top of the LUNs, without partitions or LVM layers. We cannot rely > on LVM for these systems, because people prefer not to use that. Thank you for your explanation. So I have a plan to reimplement dm-ioband into the block layer to make dm-tools no longer required. My opinion I wrote above assumes if dm-ioband is used for a logical volume which consists of multiple physical devices. If dm-ioband is integrated into the LVM tools, then the use of the dm-tools is not required and the underlying physical devices can be automatically deteced and configured to use dm-ioband. Thanks, Ryo Tsuruta > Besides ... isn't the goal of the cgroups io bandwidth controller > to control the IO used by PROCESSES? > > If we want to control processes, why would we want the configuration > to be applied to any other kind of object in the system? ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-10-07 16:41 ` Rik van Riel @ 2009-10-08 10:22 ` Ryo Tsuruta -1 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-08 10:22 UTC (permalink / raw) To: riel Cc: vgoyal, nauman, m-ikeda, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, agk, akpm, peterz, jmarchan, torvalds, mingo, yoshikawa.takuya Hi Rik, Rik van Riel <riel@redhat.com> wrote: > Ryo Tsuruta wrote: > > > If once dm-ioband is integrated into the LVM tools and bandwidth can > > be assigned per device by lvcreate, the use of dm-tools is no longer > > required for users. > > A lot of large data center users have a SAN, with volume management > handled SAN-side and dedicated LUNs for different applications or > groups of applications. > > Because of alignment issues, they typically use filesystems directly > on top of the LUNs, without partitions or LVM layers. We cannot rely > on LVM for these systems, because people prefer not to use that. Thank you for your explanation. So I have a plan to reimplement dm-ioband into the block layer to make dm-tools no longer required. My opinion I wrote above assumes if dm-ioband is used for a logical volume which consists of multiple physical devices. If dm-ioband is integrated into the LVM tools, then the use of the dm-tools is not required and the underlying physical devices can be automatically deteced and configured to use dm-ioband. Thanks, Ryo Tsuruta > Besides ... isn't the goal of the cgroups io bandwidth controller > to control the IO used by PROCESSES? > > If we want to control processes, why would we want the configuration > to be applied to any other kind of object in the system? ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-10-08 10:22 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-08 10:22 UTC (permalink / raw) To: riel Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, vgoyal, righi.andrea, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, m-ikeda, torvalds Hi Rik, Rik van Riel <riel@redhat.com> wrote: > Ryo Tsuruta wrote: > > > If once dm-ioband is integrated into the LVM tools and bandwidth can > > be assigned per device by lvcreate, the use of dm-tools is no longer > > required for users. > > A lot of large data center users have a SAN, with volume management > handled SAN-side and dedicated LUNs for different applications or > groups of applications. > > Because of alignment issues, they typically use filesystems directly > on top of the LUNs, without partitions or LVM layers. We cannot rely > on LVM for these systems, because people prefer not to use that. Thank you for your explanation. So I have a plan to reimplement dm-ioband into the block layer to make dm-tools no longer required. My opinion I wrote above assumes if dm-ioband is used for a logical volume which consists of multiple physical devices. If dm-ioband is integrated into the LVM tools, then the use of the dm-tools is not required and the underlying physical devices can be automatically deteced and configured to use dm-ioband. Thanks, Ryo Tsuruta > Besides ... isn't the goal of the cgroups io bandwidth controller > to control the IO used by PROCESSES? > > If we want to control processes, why would we want the configuration > to be applied to any other kind of object in the system? ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091007.233805.183040347.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091007.233805.183040347.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2009-10-07 15:09 ` Vivek Goyal 2009-10-07 16:41 ` Rik van Riel 1 sibling, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-07 15:09 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Wed, Oct 07, 2009 at 11:38:05PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > >> If one would like to > > > > >> combine some physical disks into one logical device like a dm-linear, > > > > >> I think one should map the IO controller on each physical device and > > > > >> combine them into one logical device. > > > > >> > > > > > > > > > > In fact this sounds like a more complicated step where one has to setup > > > > > one dm-ioband device on top of each physical device. But I am assuming > > > > > that this will go away once you move to per reuqest queue like implementation. > > > > > > I don't understand why the per request queue implementation makes it > > > go away. If dm-ioband is integrated into the LVM tools, it could allow > > > users to skip the complicated steps to configure dm-linear devices. > > > > > > > Those who are not using dm-tools will be forced to use dm-tools for > > bandwidth control features. > > If once dm-ioband is integrated into the LVM tools and bandwidth can > be assigned per device by lvcreate, the use of dm-tools is no longer > required for users. But it is same thing. Now LVM tools is mandatory to use? > > > Interesting. In all the test cases you always test with sequential > > readers. I have changed the test case a bit (I have already reported the > > results in another mail, now running the same test again with dm-version > > 1.14). I made all the readers doing direct IO and in other group I put > > a buffered writer. So setup looks as follows. > > > > In group1, I launch 1 prio 0 reader and increasing number of prio4 > > readers. In group 2 I just run a dd doing buffered writes. Weights of > > both the groups are 100 each. > > > > Following are the results on 2.6.31 kernel. > > > > With-dm-ioband > > ============== > > <------------prio4 readers----------------------> <---prio0 reader------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec > > 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec > > 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec > > 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec > > 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec > > > > With vanilla CFQ > > ================ > > <------------prio4 readers----------------------> <---prio0 reader------> > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > > 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec > > 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec > > 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec > > 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec > > 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec > > > > > > Above results are showing how bandwidth got distributed between prio4 and > > prio1 readers with-in group as we increased number of prio4 readers in > > the group. In another group a buffered writer is continuously going on > > as competitor. > > > > Notice, with dm-ioband how bandwidth allocation is broken. > > > > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. > > > > With 2 prio4 readers, looks like prio4 got almost same BW as prio1. > > > > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 > > readers starve. > > > > As we incresae number of prio4 readers in the group, their total aggregate > > BW share should increase. Instread it is decreasing. > > > > So to me in the face of competition with a writer in other group, BW is > > all over the place. Some of these might be dm-ioband bugs and some of > > these might be coming from the fact that buffering takes place in higher > > layer and dispatch is FIFO? > > Thank you for testing. I did the same test and here are the results. > > with vanilla CFQ > <------------prio4 readers------------------> prio0 group2 > maxbw minbw aggrbw maxlat aggrbw bufwrite > 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s 1,923KiB/s > 2 3,967KiB/s 3,930KiB/s 7,897KiB/s 30001msec 14,213KiB/s 1,586KiB/s > 4 3,399KiB/s 3,066KiB/s 13,031KiB/s 30082msec 8,930KiB/s 1,296KiB/s > 8 2,086KiB/s 1,720KiB/s 15,266KiB/s 30003msec 7,546KiB/s 517KiB/s > 16 1,156KiB/s 837KiB/s 15,377KiB/s 30033msec 4,282KiB/s 600KiB/s > > with dm-ioband weight-iosize policy > <------------prio4 readers------------------> prio0 group2 > maxbw minbw aggrbw maxlat aggrbw bufwrite > 1 107KiB/s 107KiB/s 107KiB/s 30007msec 12,242KiB/s 12,320KiB/s > 2 1,259KiB/s 702KiB/s 1,961KiB/s 30037msec 9,657KiB/s 11,657KiB/s > 4 2,705KiB/s 29KiB/s 5,186KiB/s 30026msec 5,927KiB/s 11,300KiB/s > 8 2,428KiB/s 27KiB/s 5,629KiB/s 30054msec 5,057KiB/s 10,704KiB/s > 16 2,465KiB/s 23KiB/s 4,309KiB/s 30032msec 4,750KiB/s 9,088KiB/s > > The results are somewhat different from yours. The bandwidth is > distributed to each group equally, but CFQ priority is broken as you > said. I think that the reason is not because of FIFO, but because > some IO requests are issued from dm-ioband's kernel thread on behalf of > processes which origirante the IO requests, then CFQ assumes that the > kernel thread is the originator and uses its io_context. Ok. Our numbers can vary a bit depending on fio settings like block size and underlying storage also. But that's not the important thing. Currently with this test I just wanted to point out that model of ioprio with-in group is currently broken with dm-ioband and good that you can reproduce that. One minor nit, for max latency you need to look at "clat " row and "max=" field in fio output. Most of the time "max latency" will matter most. You seem to be currently grepping for "maxt" which is just seems to be telling how long did test run and in this case 30 seconds. Assigning reads to right context in CFQ and not to dm-ioband thread might help a bit, but I am bit skeptical and following is the reason. CFQ relies on time providing longer time slice length for higher priority process and if one does not use time slice, it looses its share. So the moment you buffer even single bio of a process in dm-layer, if CFQ was servicing that process at same time, that process will loose its share. CFQ will at max anticipate for 8 ms and if buffering is longer than 8ms, CFQ will expire the queue and move on to next queue. Later if you submit same bio and with dm-ioband helper thread and even if CFQ attributes it to right process, it is not going to help much as process already lost it slice and now a new slice will start. > > > > Here is my test script. > > > ------------------------------------------------------------------------- > > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > > > --group_reporting" > > > > > > sync > > > echo 3 > /proc/sys/vm/drop_caches > > > > > > echo $$ > /cgroup/1/tasks > > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > > > echo $$ > /cgroup/2/tasks > > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > > > echo $$ > /cgroup/tasks > > > wait > > > ------------------------------------------------------------------------- > > > > > > Be that as it way, I think that if every bio can point the iocontext > > > of the process, then it makes it possible to handle IO priority in the > > > higher level controller. A patchse has already posted by Takhashi-san. > > > What do you think about this idea? > > > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > > Subject [RFC][PATCH 1/10] I/O context inheritance > > > From Hirokazu Takahashi <> > > > http://lkml.org/lkml/2008/4/22/195 > > > > So far you have been denying that there are issues with ioprio with-in > > group in higher level controller. Here you seems to be saying that there are > > issues with ioprio and we need to take this patch in to solve the issue? I am > > confused? > > The true intention of this patch is to preserve the io-context of a > process which originate it, but I think that we could also make use of > this patch for one of the way to solve this issue. > Ok. Did you run the same test with this patch applied and how do numbers look like? Can you please forward port it to 2.6.31 and I will also like to play with it? I am running more tests/numbers with 2.6.31 for all the IO controllers and planning to post it to lkml before we meet for IO mini summit. Numbers can help us understand the issue better. In first phase I am planning to post numbers for IO scheudler controller and dm-ioband. Then will get to max bw controller of Andrea Righi. > > Anyway, if you think that above patch is needed to solve the issue of > > ioprio in higher level controller, why are you not posting it as part of > > your patch series regularly, so that we can also apply this patch along > > with other patches and test the effects? > > I will post the patch, but I would like to find out and understand the > reason of above test results before posting the patch. > Ok. So in the mean time, I will continue to do testing with dm-ioband version 1.14.0 and post the numbers. > > Against what kernel version above patches apply. The biocgroup patches > > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly > > against any of these? > > > > So for the time being I am doing testing with biocgroup patches. > > I created those patches against 2.6.32-rc1 and made sure the patches > can be cleanly applied to that version. I am applying dm-ioband patch first and then bio cgroup patches. Is this right order? Will try again. Anyway, don't have too much time for IO mini summit, so will stick to 2.6.31 for the time being. If time permits, will venture into 32-rc1 also. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20091007.233805.183040347.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-10-07 15:09 ` Vivek Goyal @ 2009-10-07 16:41 ` Rik van Riel 1 sibling, 0 replies; 349+ messages in thread From: Rik van Riel @ 2009-10-07 16:41 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Ryo Tsuruta wrote: > If once dm-ioband is integrated into the LVM tools and bandwidth can > be assigned per device by lvcreate, the use of dm-tools is no longer > required for users. A lot of large data center users have a SAN, with volume management handled SAN-side and dedicated LUNs for different applications or groups of applications. Because of alignment issues, they typically use filesystems directly on top of the LUNs, without partitions or LVM layers. We cannot rely on LVM for these systems, because people prefer not to use that. Besides ... isn't the goal of the cgroups io bandwidth controller to control the IO used by PROCESSES? If we want to control processes, why would we want the configuration to be applied to any other kind of object in the system? -- All rights reversed. ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091006112201.GA27866-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091006112201.GA27866-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-07 14:38 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-07 14:38 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > >> If one would like to > > > >> combine some physical disks into one logical device like a dm-linear, > > > >> I think one should map the IO controller on each physical device and > > > >> combine them into one logical device. > > > >> > > > > > > > > In fact this sounds like a more complicated step where one has to setup > > > > one dm-ioband device on top of each physical device. But I am assuming > > > > that this will go away once you move to per reuqest queue like implementation. > > > > I don't understand why the per request queue implementation makes it > > go away. If dm-ioband is integrated into the LVM tools, it could allow > > users to skip the complicated steps to configure dm-linear devices. > > > > Those who are not using dm-tools will be forced to use dm-tools for > bandwidth control features. If once dm-ioband is integrated into the LVM tools and bandwidth can be assigned per device by lvcreate, the use of dm-tools is no longer required for users. > Interesting. In all the test cases you always test with sequential > readers. I have changed the test case a bit (I have already reported the > results in another mail, now running the same test again with dm-version > 1.14). I made all the readers doing direct IO and in other group I put > a buffered writer. So setup looks as follows. > > In group1, I launch 1 prio 0 reader and increasing number of prio4 > readers. In group 2 I just run a dd doing buffered writes. Weights of > both the groups are 100 each. > > Following are the results on 2.6.31 kernel. > > With-dm-ioband > ============== > <------------prio4 readers----------------------> <---prio0 reader------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec > 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec > 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec > 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec > 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec > > With vanilla CFQ > ================ > <------------prio4 readers----------------------> <---prio0 reader------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec > 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec > 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec > 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec > 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec > > > Above results are showing how bandwidth got distributed between prio4 and > prio1 readers with-in group as we increased number of prio4 readers in > the group. In another group a buffered writer is continuously going on > as competitor. > > Notice, with dm-ioband how bandwidth allocation is broken. > > With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. > > With 2 prio4 readers, looks like prio4 got almost same BW as prio1. > > With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 > readers starve. > > As we incresae number of prio4 readers in the group, their total aggregate > BW share should increase. Instread it is decreasing. > > So to me in the face of competition with a writer in other group, BW is > all over the place. Some of these might be dm-ioband bugs and some of > these might be coming from the fact that buffering takes place in higher > layer and dispatch is FIFO? Thank you for testing. I did the same test and here are the results. with vanilla CFQ <------------prio4 readers------------------> prio0 group2 maxbw minbw aggrbw maxlat aggrbw bufwrite 1 12,140KiB/s 12,140KiB/s 12,140KiB/s 30001msec 11,125KiB/s 1,923KiB/s 2 3,967KiB/s 3,930KiB/s 7,897KiB/s 30001msec 14,213KiB/s 1,586KiB/s 4 3,399KiB/s 3,066KiB/s 13,031KiB/s 30082msec 8,930KiB/s 1,296KiB/s 8 2,086KiB/s 1,720KiB/s 15,266KiB/s 30003msec 7,546KiB/s 517KiB/s 16 1,156KiB/s 837KiB/s 15,377KiB/s 30033msec 4,282KiB/s 600KiB/s with dm-ioband weight-iosize policy <------------prio4 readers------------------> prio0 group2 maxbw minbw aggrbw maxlat aggrbw bufwrite 1 107KiB/s 107KiB/s 107KiB/s 30007msec 12,242KiB/s 12,320KiB/s 2 1,259KiB/s 702KiB/s 1,961KiB/s 30037msec 9,657KiB/s 11,657KiB/s 4 2,705KiB/s 29KiB/s 5,186KiB/s 30026msec 5,927KiB/s 11,300KiB/s 8 2,428KiB/s 27KiB/s 5,629KiB/s 30054msec 5,057KiB/s 10,704KiB/s 16 2,465KiB/s 23KiB/s 4,309KiB/s 30032msec 4,750KiB/s 9,088KiB/s The results are somewhat different from yours. The bandwidth is distributed to each group equally, but CFQ priority is broken as you said. I think that the reason is not because of FIFO, but because some IO requests are issued from dm-ioband's kernel thread on behalf of processes which origirante the IO requests, then CFQ assumes that the kernel thread is the originator and uses its io_context. > > Here is my test script. > > ------------------------------------------------------------------------- > > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > > --group_reporting" > > > > sync > > echo 3 > /proc/sys/vm/drop_caches > > > > echo $$ > /cgroup/1/tasks > > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > > echo $$ > /cgroup/2/tasks > > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > > echo $$ > /cgroup/tasks > > wait > > ------------------------------------------------------------------------- > > > > Be that as it way, I think that if every bio can point the iocontext > > of the process, then it makes it possible to handle IO priority in the > > higher level controller. A patchse has already posted by Takhashi-san. > > What do you think about this idea? > > > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > > Subject [RFC][PATCH 1/10] I/O context inheritance > > From Hirokazu Takahashi <> > > http://lkml.org/lkml/2008/4/22/195 > > So far you have been denying that there are issues with ioprio with-in > group in higher level controller. Here you seems to be saying that there are > issues with ioprio and we need to take this patch in to solve the issue? I am > confused? The true intention of this patch is to preserve the io-context of a process which originate it, but I think that we could also make use of this patch for one of the way to solve this issue. > Anyway, if you think that above patch is needed to solve the issue of > ioprio in higher level controller, why are you not posting it as part of > your patch series regularly, so that we can also apply this patch along > with other patches and test the effects? I will post the patch, but I would like to find out and understand the reason of above test results before posting the patch. > Against what kernel version above patches apply. The biocgroup patches > I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly > against any of these? > > So for the time being I am doing testing with biocgroup patches. I created those patches against 2.6.32-rc1 and made sure the patches can be cleanly applied to that version. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091006.161744.189719641.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091006.161744.189719641.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2009-10-06 11:22 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-06 11:22 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Tue, Oct 06, 2009 at 04:17:44PM +0900, Ryo Tsuruta wrote: > Hi Vivek and Nauman, > > Nauman Rafique <nauman-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > > >> > > How about adding a callback function to the higher level controller? > > >> > > CFQ calls it when the active queue runs out of time, then the higer > > >> > > level controller use it as a trigger or a hint to move IO group, so > > >> > > I think a time-based controller could be implemented at higher level. > > >> > > > > >> > > > >> > Adding a call back should not be a big issue. But that means you are > > >> > planning to run only one group at higher layer at one time and I think > > >> > that's the problem because than we are introducing serialization at higher > > >> > layer. So any higher level device mapper target which has multiple > > >> > physical disks under it, we might be underutilizing these even more and > > >> > take a big hit on overall throughput. > > >> > > > >> > The whole design of doing proportional weight at lower layer is optimial > > >> > usage of system. > > >> > > >> But I think that the higher level approch makes easy to configure > > >> against striped software raid devices. > > > > > > How does it make easier to configure in case of higher level controller? > > > > > > In case of lower level design, one just have to create cgroups and assign > > > weights to cgroups. This mininum step will be required in higher level > > > controller also. (Even if you get rid of dm-ioband device setup step). > > In the case of lower level controller, if we need to assign weights on > a per device basis, we have to assign weights to all devices of which > a raid device consists, but in the case of higher level controller, > we just assign weights to the raid device only. > This is required only if you need to assign different weights to different devices. This is just additional facility and not a requirement. Normally you will not be required to do that and devices will inherit the cgroup weights automatically. So one has to only assign the cgroup weights. > > >> If one would like to > > >> combine some physical disks into one logical device like a dm-linear, > > >> I think one should map the IO controller on each physical device and > > >> combine them into one logical device. > > >> > > > > > > In fact this sounds like a more complicated step where one has to setup > > > one dm-ioband device on top of each physical device. But I am assuming > > > that this will go away once you move to per reuqest queue like implementation. > > I don't understand why the per request queue implementation makes it > go away. If dm-ioband is integrated into the LVM tools, it could allow > users to skip the complicated steps to configure dm-linear devices. > Those who are not using dm-tools will be forced to use dm-tools for bandwidth control features. > > > I think it should be same in principal as my initial implementation of IO > > > controller on request queue and I stopped development on it because of FIFO > > > dispatch. > > I think that FIFO dispatch seldom lead to prioviry inversion, because > holding period for throttling is not too long to break the IO priority. > I did some tests to see whether priority inversion is happened. > > The first test ran fio sequential readers on the same group. The BE0 > reader got the highest throughput as I expected. > > nr_threads 16 | 16 | 1 > ionice BE7 | BE7 | BE0 > ------------------------+------------+------------- > vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s > ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s > > The second test ran fio sequential readers on two different groups and > give weights of 20 and 10 to each group respectively. The bandwidth > was distributed according to their weights and the BE0 reader got > higher throughput than the BE7 readers in the same group. IO priority > was preserved within the IO group. > > group group1 | group2 > weight 20 | 10 > ------------------------+-------------------------- > nr_threads 16 | 16 | 1 > ionice BE7 | BE7 | BE0 > ------------------------+-------------------------- > ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s > | Total = 13,772KiB/s > Interesting. In all the test cases you always test with sequential readers. I have changed the test case a bit (I have already reported the results in another mail, now running the same test again with dm-version 1.14). I made all the readers doing direct IO and in other group I put a buffered writer. So setup looks as follows. In group1, I launch 1 prio 0 reader and increasing number of prio4 readers. In group 2 I just run a dd doing buffered writes. Weights of both the groups are 100 each. Following are the results on 2.6.31 kernel. With-dm-ioband ============== <------------prio4 readers----------------------> <---prio0 reader------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec With vanilla CFQ ================ <------------prio4 readers----------------------> <---prio0 reader------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec Above results are showing how bandwidth got distributed between prio4 and prio1 readers with-in group as we increased number of prio4 readers in the group. In another group a buffered writer is continuously going on as competitor. Notice, with dm-ioband how bandwidth allocation is broken. With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. With 2 prio4 readers, looks like prio4 got almost same BW as prio1. With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 readers starve. As we incresae number of prio4 readers in the group, their total aggregate BW share should increase. Instread it is decreasing. So to me in the face of competition with a writer in other group, BW is all over the place. Some of these might be dm-ioband bugs and some of these might be coming from the fact that buffering takes place in higher layer and dispatch is FIFO? > Here is my test script. > ------------------------------------------------------------------------- > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > --group_reporting" > > sync > echo 3 > /proc/sys/vm/drop_caches > > echo $$ > /cgroup/1/tasks > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > echo $$ > /cgroup/2/tasks > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > echo $$ > /cgroup/tasks > wait > ------------------------------------------------------------------------- > > Be that as it way, I think that if every bio can point the iocontext > of the process, then it makes it possible to handle IO priority in the > higher level controller. A patchse has already posted by Takhashi-san. > What do you think about this idea? > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > Subject [RFC][PATCH 1/10] I/O context inheritance > From Hirokazu Takahashi <> > http://lkml.org/lkml/2008/4/22/195 So far you have been denying that there are issues with ioprio with-in group in higher level controller. Here you seems to be saying that there are issues with ioprio and we need to take this patch in to solve the issue? I am confused? Anyway, if you think that above patch is needed to solve the issue of ioprio in higher level controller, why are you not posting it as part of your patch series regularly, so that we can also apply this patch along with other patches and test the effects? > > > > So you seem to be suggesting that you will move dm-ioband to request queue > > > so that setting up additional device setup is gone. You will also enable > > > it to do time based groups policy, so that we don't run into issues on > > > seeky media. Will also enable dispatch from one group only at a time so > > > that we don't run into isolation issues and can do time accounting > > > accruately. > > > > Will that approach solve the problem of doing bandwidth control on > > logical devices? What would be the advantages compared to Vivek's > > current patches? > > I will only move the point where dm-ioband grabs bios, other > dm-ioband's mechanism and functionality will stll be the same. > The advantages against to scheduler based controllers are: > - can work with any type of block devices > - can work with any type of IO scheduler and no need a big change. > The big change thing we will come to know for sure when we have implementation for the timed groups done and shown that it works as well as my patches. There are so many subtle things with time based approach. [..] > > >> > Is there a new version of dm-ioband now where you have solved the issue of > > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am > > >> > trying to run some tests and come up with numbers so that we have more > > >> > clear picture of pros/cons. > > >> > > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > > >> dm-ioband handles sync/async IO requests separately and > > >> the write-starve-read issue you pointed out is fixed. I would > > >> appreciate it if you would try them. > > >> http://sourceforge.net/projects/ioband/files/ > > > > > > Cool. Will get to testing it. > > Thanks for your help in advance. Against what kernel version above patches apply. The biocgroup patches I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly against any of these? So for the time being I am doing testing with biocgroup patches. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091005171023.GG22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091005171023.GG22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-05 18:11 ` Nauman Rafique 0 siblings, 0 replies; 349+ messages in thread From: Nauman Rafique @ 2009-10-05 18:11 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Oct 5, 2009 at 10:10 AM, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote: >> Hi Vivek, >> >> Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: >> > > Hi, >> > > >> > > Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote: >> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: >> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was >> > > > > going through the request based dm-multipath paper. Will it make sense >> > > > > to implement request based dm-ioband? So basically we implement all the >> > > > > group scheduling in CFQ and let dm-ioband implement a request function >> > > > > to take the request and break it back into bios. This way we can keep >> > > > > all the group control at one place and also meet most of the requirements. >> > > > > >> > > > > So request based dm-ioband will have a request in hand once that request >> > > > > has passed group control and prio control. Because dm-ioband is a device >> > > > > mapper target, one can put it on higher level devices (practically taking >> > > > > CFQ at higher level device), and provide fairness there. One can also >> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing >> > > > > them to use the IO scheduler.) >> > > > > >> > > > > I am sure that will be many issues but one big issue I could think of that >> > > > > CFQ thinks that there is one device beneath it and dipsatches requests >> > > > > from one queue (in case of idling) and that would kill parallelism at >> > > > > higher layer and throughput will suffer on many of the dm/md configurations. >> > > > > >> > > > > Thanks >> > > > > Vivek >> > > > >> > > > As long as using CFQ, your idea is reasonable for me. But how about for >> > > > other IO schedulers? In my understanding, one of the keys to guarantee >> > > > group isolation in your patch is to have per-group IO scheduler internal >> > > > queue even with as, deadline, and noop scheduler. I think this is >> > > > great idea, and to implement generic code for all IO schedulers was >> > > > concluded when we had so many IO scheduler specific proposals. >> > > > If we will still need per-group IO scheduler internal queues with >> > > > request-based dm-ioband, we have to modify elevator layer. It seems >> > > > out of scope of dm. >> > > > I might miss something... >> > > >> > > IIUC, the request based device-mapper could not break back a request >> > > into bio, so it could not work with block devices which don't use the >> > > IO scheduler. >> > > >> > >> > I think current request based multipath drvier does not do it but can't it >> > be implemented that requests are broken back into bio? >> >> I guess it would be hard to implement it, and we need to hold requests >> and throttle them at there and it would break the ordering by CFQ. >> >> > Anyway, I don't feel too strongly about this approach as it might >> > introduce more serialization at higher layer. >> >> Yes, I know it. >> >> > > How about adding a callback function to the higher level controller? >> > > CFQ calls it when the active queue runs out of time, then the higer >> > > level controller use it as a trigger or a hint to move IO group, so >> > > I think a time-based controller could be implemented at higher level. >> > > >> > >> > Adding a call back should not be a big issue. But that means you are >> > planning to run only one group at higher layer at one time and I think >> > that's the problem because than we are introducing serialization at higher >> > layer. So any higher level device mapper target which has multiple >> > physical disks under it, we might be underutilizing these even more and >> > take a big hit on overall throughput. >> > >> > The whole design of doing proportional weight at lower layer is optimial >> > usage of system. >> >> But I think that the higher level approch makes easy to configure >> against striped software raid devices. > > How does it make easier to configure in case of higher level controller? > > In case of lower level design, one just have to create cgroups and assign > weights to cgroups. This mininum step will be required in higher level > controller also. (Even if you get rid of dm-ioband device setup step). > >> If one would like to >> combine some physical disks into one logical device like a dm-linear, >> I think one should map the IO controller on each physical device and >> combine them into one logical device. >> > > In fact this sounds like a more complicated step where one has to setup > one dm-ioband device on top of each physical device. But I am assuming > that this will go away once you move to per reuqest queue like implementation. > > I think it should be same in principal as my initial implementation of IO > controller on request queue and I stopped development on it because of FIFO > dispatch. > > So you seem to be suggesting that you will move dm-ioband to request queue > so that setting up additional device setup is gone. You will also enable > it to do time based groups policy, so that we don't run into issues on > seeky media. Will also enable dispatch from one group only at a time so > that we don't run into isolation issues and can do time accounting > accruately. Will that approach solve the problem of doing bandwidth control on logical devices? What would be the advantages compared to Vivek's current patches? > > If yes, then that has the potential to solve the issue. At higher layer one > can think of enabling size of IO/number of IO policy both for proportional > BW and max BW type of control. At lower level one can enable pure time > based control on seeky media. > > I think this will still left with the issue of prio with-in group as group > control is separate and you will not be maintatinig separate queues for > each process. Similarly you will also have isseus with read vs write > ratios as IO schedulers underneath change. > > So I will be curious to see that implementation. > >> > > My requirements for IO controller are: >> > > - Implement s a higher level controller, which is located at block >> > > layer and bio is grabbed in generic_make_request(). >> > >> > How are you planning to handle the issue of buffered writes Andrew raised? >> >> I think that it would be better to use the higher-level controller >> along with the memory controller and have limits memory usage for each >> cgroup. And as Kamezawa-san said, having limits of dirty pages would >> be better, too. >> > > Ok. So if we plan to co-mount memory controller with per memory group > dirty_ratio implemented, that can work with both higher level as well as > low level controller. Not sure if we also require some kind of a per > memory group flusher thread infrastructure also to make sure higher weight > group gets more job done. > >> > > - Can work with any type of IO scheduler. >> > > - Can work with any type of block devices. >> > > - Support multiple policies, proportional wegiht, max rate, time >> > > based, ans so on. >> > > >> > > The IO controller mini-summit will be held in next week, and I'm >> > > looking forard to meet you all and discuss about IO controller. >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit >> > >> > Is there a new version of dm-ioband now where you have solved the issue of >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am >> > trying to run some tests and come up with numbers so that we have more >> > clear picture of pros/cons. >> >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new >> dm-ioband handles sync/async IO requests separately and >> the write-starve-read issue you pointed out is fixed. I would >> appreciate it if you would try them. >> http://sourceforge.net/projects/ioband/files/ > > Cool. Will get to testing it. > > Thanks > Vivek > ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091005.235535.193690928.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091005.235535.193690928.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2009-10-05 17:10 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-05 17:10 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > > > Hi, > > > > > > Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote: > > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > > > going through the request based dm-multipath paper. Will it make sense > > > > > to implement request based dm-ioband? So basically we implement all the > > > > > group scheduling in CFQ and let dm-ioband implement a request function > > > > > to take the request and break it back into bios. This way we can keep > > > > > all the group control at one place and also meet most of the requirements. > > > > > > > > > > So request based dm-ioband will have a request in hand once that request > > > > > has passed group control and prio control. Because dm-ioband is a device > > > > > mapper target, one can put it on higher level devices (practically taking > > > > > CFQ at higher level device), and provide fairness there. One can also > > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > > > them to use the IO scheduler.) > > > > > > > > > > I am sure that will be many issues but one big issue I could think of that > > > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > > > from one queue (in case of idling) and that would kill parallelism at > > > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > > > > > Thanks > > > > > Vivek > > > > > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > > > other IO schedulers? In my understanding, one of the keys to guarantee > > > > group isolation in your patch is to have per-group IO scheduler internal > > > > queue even with as, deadline, and noop scheduler. I think this is > > > > great idea, and to implement generic code for all IO schedulers was > > > > concluded when we had so many IO scheduler specific proposals. > > > > If we will still need per-group IO scheduler internal queues with > > > > request-based dm-ioband, we have to modify elevator layer. It seems > > > > out of scope of dm. > > > > I might miss something... > > > > > > IIUC, the request based device-mapper could not break back a request > > > into bio, so it could not work with block devices which don't use the > > > IO scheduler. > > > > > > > I think current request based multipath drvier does not do it but can't it > > be implemented that requests are broken back into bio? > > I guess it would be hard to implement it, and we need to hold requests > and throttle them at there and it would break the ordering by CFQ. > > > Anyway, I don't feel too strongly about this approach as it might > > introduce more serialization at higher layer. > > Yes, I know it. > > > > How about adding a callback function to the higher level controller? > > > CFQ calls it when the active queue runs out of time, then the higer > > > level controller use it as a trigger or a hint to move IO group, so > > > I think a time-based controller could be implemented at higher level. > > > > > > > Adding a call back should not be a big issue. But that means you are > > planning to run only one group at higher layer at one time and I think > > that's the problem because than we are introducing serialization at higher > > layer. So any higher level device mapper target which has multiple > > physical disks under it, we might be underutilizing these even more and > > take a big hit on overall throughput. > > > > The whole design of doing proportional weight at lower layer is optimial > > usage of system. > > But I think that the higher level approch makes easy to configure > against striped software raid devices. How does it make easier to configure in case of higher level controller? In case of lower level design, one just have to create cgroups and assign weights to cgroups. This mininum step will be required in higher level controller also. (Even if you get rid of dm-ioband device setup step). > If one would like to > combine some physical disks into one logical device like a dm-linear, > I think one should map the IO controller on each physical device and > combine them into one logical device. > In fact this sounds like a more complicated step where one has to setup one dm-ioband device on top of each physical device. But I am assuming that this will go away once you move to per reuqest queue like implementation. I think it should be same in principal as my initial implementation of IO controller on request queue and I stopped development on it because of FIFO dispatch. So you seem to be suggesting that you will move dm-ioband to request queue so that setting up additional device setup is gone. You will also enable it to do time based groups policy, so that we don't run into issues on seeky media. Will also enable dispatch from one group only at a time so that we don't run into isolation issues and can do time accounting accruately. If yes, then that has the potential to solve the issue. At higher layer one can think of enabling size of IO/number of IO policy both for proportional BW and max BW type of control. At lower level one can enable pure time based control on seeky media. I think this will still left with the issue of prio with-in group as group control is separate and you will not be maintatinig separate queues for each process. Similarly you will also have isseus with read vs write ratios as IO schedulers underneath change. So I will be curious to see that implementation. > > > My requirements for IO controller are: > > > - Implement s a higher level controller, which is located at block > > > layer and bio is grabbed in generic_make_request(). > > > > How are you planning to handle the issue of buffered writes Andrew raised? > > I think that it would be better to use the higher-level controller > along with the memory controller and have limits memory usage for each > cgroup. And as Kamezawa-san said, having limits of dirty pages would > be better, too. > Ok. So if we plan to co-mount memory controller with per memory group dirty_ratio implemented, that can work with both higher level as well as low level controller. Not sure if we also require some kind of a per memory group flusher thread infrastructure also to make sure higher weight group gets more job done. > > > - Can work with any type of IO scheduler. > > > - Can work with any type of block devices. > > > - Support multiple policies, proportional wegiht, max rate, time > > > based, ans so on. > > > > > > The IO controller mini-summit will be held in next week, and I'm > > > looking forard to meet you all and discuss about IO controller. > > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > > > > Is there a new version of dm-ioband now where you have solved the issue of > > sync/async dispatch with-in group? Before meeting at mini-summit, I am > > trying to run some tests and come up with numbers so that we have more > > clear picture of pros/cons. > > Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > dm-ioband handles sync/async IO requests separately and > the write-starve-read issue you pointed out is fixed. I would > appreciate it if you would try them. > http://sourceforge.net/projects/ioband/files/ Cool. Will get to testing it. Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091005123148.GB22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091005123148.GB22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-05 14:55 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-05 14:55 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > > Hi, > > > > Munehiro Ikeda <m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> wrote: > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > > going through the request based dm-multipath paper. Will it make sense > > > > to implement request based dm-ioband? So basically we implement all the > > > > group scheduling in CFQ and let dm-ioband implement a request function > > > > to take the request and break it back into bios. This way we can keep > > > > all the group control at one place and also meet most of the requirements. > > > > > > > > So request based dm-ioband will have a request in hand once that request > > > > has passed group control and prio control. Because dm-ioband is a device > > > > mapper target, one can put it on higher level devices (practically taking > > > > CFQ at higher level device), and provide fairness there. One can also > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > > them to use the IO scheduler.) > > > > > > > > I am sure that will be many issues but one big issue I could think of that > > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > > from one queue (in case of idling) and that would kill parallelism at > > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > > > Thanks > > > > Vivek > > > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > > other IO schedulers? In my understanding, one of the keys to guarantee > > > group isolation in your patch is to have per-group IO scheduler internal > > > queue even with as, deadline, and noop scheduler. I think this is > > > great idea, and to implement generic code for all IO schedulers was > > > concluded when we had so many IO scheduler specific proposals. > > > If we will still need per-group IO scheduler internal queues with > > > request-based dm-ioband, we have to modify elevator layer. It seems > > > out of scope of dm. > > > I might miss something... > > > > IIUC, the request based device-mapper could not break back a request > > into bio, so it could not work with block devices which don't use the > > IO scheduler. > > > > I think current request based multipath drvier does not do it but can't it > be implemented that requests are broken back into bio? I guess it would be hard to implement it, and we need to hold requests and throttle them at there and it would break the ordering by CFQ. > Anyway, I don't feel too strongly about this approach as it might > introduce more serialization at higher layer. Yes, I know it. > > How about adding a callback function to the higher level controller? > > CFQ calls it when the active queue runs out of time, then the higer > > level controller use it as a trigger or a hint to move IO group, so > > I think a time-based controller could be implemented at higher level. > > > > Adding a call back should not be a big issue. But that means you are > planning to run only one group at higher layer at one time and I think > that's the problem because than we are introducing serialization at higher > layer. So any higher level device mapper target which has multiple > physical disks under it, we might be underutilizing these even more and > take a big hit on overall throughput. > > The whole design of doing proportional weight at lower layer is optimial > usage of system. But I think that the higher level approch makes easy to configure against striped software raid devices. If one would like to combine some physical disks into one logical device like a dm-linear, I think one should map the IO controller on each physical device and combine them into one logical device. > > My requirements for IO controller are: > > - Implement s a higher level controller, which is located at block > > layer and bio is grabbed in generic_make_request(). > > How are you planning to handle the issue of buffered writes Andrew raised? I think that it would be better to use the higher-level controller along with the memory controller and have limits memory usage for each cgroup. And as Kamezawa-san said, having limits of dirty pages would be better, too. > > - Can work with any type of IO scheduler. > > - Can work with any type of block devices. > > - Support multiple policies, proportional wegiht, max rate, time > > based, ans so on. > > > > The IO controller mini-summit will be held in next week, and I'm > > looking forard to meet you all and discuss about IO controller. > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > > Is there a new version of dm-ioband now where you have solved the issue of > sync/async dispatch with-in group? Before meeting at mini-summit, I am > trying to run some tests and come up with numbers so that we have more > clear picture of pros/cons. Yes, I've released new versions of dm-ioband and blkio-cgroup. The new dm-ioband handles sync/async IO requests separately and the write-starve-read issue you pointed out is fixed. I would appreciate it if you would try them. http://sourceforge.net/projects/ioband/files/ Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091002025731.GA2738-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091002025731.GA2738-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-02 20:27 ` Munehiro Ikeda 0 siblings, 0 replies; 349+ messages in thread From: Munehiro Ikeda @ 2009-10-02 20:27 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Vivek Goyal wrote, on 10/01/2009 10:57 PM: > Before finishing this mail, will throw a whacky idea in the ring. I was > going through the request based dm-multipath paper. Will it make sense > to implement request based dm-ioband? So basically we implement all the > group scheduling in CFQ and let dm-ioband implement a request function > to take the request and break it back into bios. This way we can keep > all the group control at one place and also meet most of the requirements. > > So request based dm-ioband will have a request in hand once that request > has passed group control and prio control. Because dm-ioband is a device > mapper target, one can put it on higher level devices (practically taking > CFQ at higher level device), and provide fairness there. One can also > put it on those SSDs which don't use IO scheduler (this is kind of forcing > them to use the IO scheduler.) > > I am sure that will be many issues but one big issue I could think of that > CFQ thinks that there is one device beneath it and dipsatches requests > from one queue (in case of idling) and that would kill parallelism at > higher layer and throughput will suffer on many of the dm/md configurations. > > Thanks > Vivek As long as using CFQ, your idea is reasonable for me. But how about for other IO schedulers? In my understanding, one of the keys to guarantee group isolation in your patch is to have per-group IO scheduler internal queue even with as, deadline, and noop scheduler. I think this is great idea, and to implement generic code for all IO schedulers was concluded when we had so many IO scheduler specific proposals. If we will still need per-group IO scheduler internal queues with request-based dm-ioband, we have to modify elevator layer. It seems out of scope of dm. I might miss something... -- IKEDA, Munehiro NEC Corporation of America m-ikeda-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20091001133109.GA4058-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20091001133109.GA4058-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-02 2:57 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-10-02 2:57 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, Oct 01, 2009 at 09:31:09AM -0400, Vivek Goyal wrote: > On Thu, Oct 01, 2009 at 03:41:25PM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > > > Hi Vivek, > > > > > > > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > > > I was thinking that elevator layer will do the merge of bios. So IO > > > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > > > into the disk and again timestamp with finish time once request finishes. > > > > > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > > > then time accounting becomes an issue. > > > > > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > > > time elapsed between each of milestones is t. Also assume that all these > > > > > requests are from same queue/group. > > > > > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > > > > > Now higher layer will think that time consumed by group is: > > > > > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > > > > > But the time elapsed is only 7t. > > > > > > > > IO controller can know how many requests are issued and still in > > > > progress. Is it not enough to accumulate the time while in-flight IOs > > > > exist? > > > > > > > > > > That time would not reflect disk time used. It will be follwoing. > > > > > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > > > (time spent in disk) > > > > In the case where multiple IO requests are issued from IO controller, > > that time measurement is the time from when the first IO request is > > issued until when the endio is called for the last IO request. Does > > not it reflect disk time? > > > > Not accurately as it will be including the time spent in CFQ queues as > well as dispatch queue. I will not worry much about dispatch queue time > but time spent CFQ queues can be significant. > > This is assuming that you are using token based scheme and will be > dispatching requests from multiple groups at the same time. > Thinking more about it... Does time based fairness make sense at higher level logical devices? - Time based fairness generally helps with rotational devices which have high seek costs. At higher level we don't even know what is the nature of underlying device where IO will ultimately go. - For time based fairness to work accurately at higher level, most likely it will require dispatch from the single group at a time and wait for requests to complete from that group and then dispatch from next. Something like CFQ model of queue. Dispatching from single queue/group works well in case of a single underlying device where CFQ is operating but at higher level devices where typically there will be multiple physical devices under it, it might not make sense as it made things more linear and reduced parallel processing further. So dispatching from single group at a time and waiting before we dispatch from next group will most likely be killer for throughput in higher level devices and might not make sense. If we don't adopt the policy of dispatch from single group, then we run into all the issues of weak isolation between groups, higher latencies, preemptions across groups etc. More I think about the whole issue and desired set of requirements, more I am convinced that we probably need two io controlling mechanisms. One which focusses purely on providing bandwidth fairness numbers on high level devices and the other which works at low level devices with CFQ and provides good bandwidth shaping, strong isolation, preserves fairness with-in group and good control on latencies. Higher level controller will not worry about time based policies. It can implemente max bw and proportional bw control based on size of IO and number of IO. Lower level controller at CFQ level will implement time based group scheduling. Keeping it at low level will have the advantage of better utitlization of hardware in various dm/md configurations (as no throttling takes place at higher level) but at the cost of not so strict fairness numbers at higher level. So those who want strict fairness number policies at higher level devices irrespective of shortcomings, can use that. Others can stick to lower level controller. For buffered write control we anyway have to do either something in memory controller or come up with another cgroup controller which throttles IO before it goes into cache. Or, in fact we can have a re-look at Andrea Righi's controller which provided max BW and throttled buffered writes before they got into page cache and try to provide proportional BW also there. Basically I see the space for two IO controllers. At the moment can't think of a way of coming up with single controller which satisfies all the requirements. So instead provide two and let user choose one based on his need. Any thoughts? Before finishing this mail, will throw a whacky idea in the ring. I was going through the request based dm-multipath paper. Will it make sense to implement request based dm-ioband? So basically we implement all the group scheduling in CFQ and let dm-ioband implement a request function to take the request and break it back into bios. This way we can keep all the group control at one place and also meet most of the requirements. So request based dm-ioband will have a request in hand once that request has passed group control and prio control. Because dm-ioband is a device mapper target, one can put it on higher level devices (practically taking CFQ at higher level device), and provide fairness there. One can also put it on those SSDs which don't use IO scheduler (this is kind of forcing them to use the IO scheduler.) I am sure that will be many issues but one big issue I could think of that CFQ thinks that there is one device beneath it and dipsatches requests from one queue (in case of idling) and that would kill parallelism at higher layer and throughput will suffer on many of the dm/md configurations. Thanks Vivek > But if you figure out a way that you dispatch requests from one group only > at one time and wait for all requests to finish and then let next group > go, then above can work fairly accurately. In that case it will become > like CFQ with the only difference that effectively we have one queue per > group instread of per process. > > > > > > Secondly if a different group is running only single sequential reader, > > > > > there CFQ will be driving queue depth of 1 and time will not be running > > > > > faster and this inaccuracy in accounting will lead to unfair share between > > > > > groups. > > > > > > > > > > So we need something better to get a sense which group used how much of > > > > > disk time. > > > > > > > > It could be solved by implementing the way to pass on such information > > > > from IO scheduler to higher layer controller. > > > > > > > > > > How would you do that? Can you give some details exactly how and what > > > information IO scheduler will pass to higher level IO controller so that IO > > > controller can attribute right time to the group. > > > > If you would like to know when the idle timer is expired, how about > > adding a function to IO controller to be notified it from IO > > scheduler? IO scheduler calls the function when the timer is expired. > > > > This probably can be done. So this is like syncing between lower layers > and higher layers about when do we start idling and when do we stop it and > both the layers should be in sync. > > This is something my common layer approach does. Becuase it is so close to > IO scheuler, I can do it relatively easily. > > One probably can create interfaces to even propogate this information up. > But this all will probably come into the picture only if we don't use > token based schemes and come up with something where at one point of time > dispatch are from one group only. > > > > > > > How about making throttling policy be user selectable like the IO > > > > > > scheduler and putting it in the higher layer? So we could support > > > > > > all of policies (time-based, size-based and rate limiting). There > > > > > > seems not to only one solution which satisfies all users. But I agree > > > > > > with starting with proportional bandwidth control first. > > > > > > > > > > > > > > > > What are the cases where time based policy does not work and size based > > > > > policy works better and user would choose size based policy and not timed > > > > > based one? > > > > > > > > I think that disk time is not simply proportional to IO size. If there > > > > are two groups whose wights are equally assigned and they issue > > > > different sized IOs repsectively, the bandwidth of each group would > > > > not distributed equally as expected. > > > > > > > > > > If we are providing fairness in terms of time, it is fair. If we provide > > > equal time slots to two processes and if one got more IO done because it > > > was not wasting time seeking or it issued bigger size IO, it deserves that > > > higher BW. IO controller will make sure that process gets fair share in > > > terms of time and exactly how much BW one got will depend on the workload. > > > > > > That's the precise reason that fairness in terms of time is better on > > > seeky media. > > > > If the seek time is negligible, the bandwidth would not be distributed > > according to a proportion of weight settings. I think that it would be > > unclear for users to understand how bandwidth is distributed. And I > > also think that seeky media would gradually become obsolete, > > > > I can understand that if lesser the seek cost game starts changing and > probably a size based policy also work decently. > > In that case at some point of time probably CFQ will also need to support > another mode/policy where fairness is provided in terms of size of IO, if > it detects a SSD with hardware queuing. Currently it seem to be disabling > the idling in that case. But this is not very good from fairness point of > view. I guess if CFQ wants to provide fairness in such cases, it needs to > dynamically change the shape and start thinking in terms of size of IO. > > So far my testing has been very limited to hard disks connected to my > computer. I will do some testing on high end enterprise storage and see > how much do seek matter and how well both the implementations work. > > > > > > I am not against implementing things in higher layer as long as we can > > > > > ensure tight control on latencies, strong isolation between groups and > > > > > not break CFQ's class and ioprio model with-in group. > > > > > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > > > > > Can you elaborate little bit on this? > > > > > > > > bio is grabbed in generic_make_request() and throttled as well as > > > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > > > > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > > > throttling policies will apply. So until and unless we figure out a > > > better way, the issues I have pointed out will still exists even in > > > new implementation. > > > > Yes, those still exist, but somehow I would like to try to solve them. > > > > > > The default value of io_limit on the previous test was 128 (not 192) > > > > which is equall to the default value of nr_request. > > > > > > Hm..., I used following commands to create two ioband devices. > > > > > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > > > "weight 0 :100" | dmsetup create ioband1 > > > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > > > "weight 0 :100" | dmsetup create ioband2 > > > > > > Here io_limit value is zero so it should pick default value. Following is > > > output of "dmsetup table" command. > > > > > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > > > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > > > ^^^^ > > > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > > > to be 192? > > > > The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > > > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > > > writes. > > > > > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > > > sync/async requests separately, and it solves this > > > > buffered-write-starves-read problem. I would like to post it soon > > > > after doing some more test. > > > > > > > > > On top of that can you please give some details how increasing the > > > > > buffered queue length reduces the impact of writers? > > > > > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > > > IOs are finished. But IO scheduler layer can accept IO requests more > > > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > > > > > > > Ok, so it should have been throughput bottleneck but how did it solve the > > > issue of writer starving the reader as you had mentioned in the mail. > > > > As wrote above, I modified dm-ioband to handle sync/async requests > > separately, so even if writers do a lot of buffered IOs, readers can > > issue IOs regardless writers' busyness. Once the IOs are backlogged > > for throttling, the both sync and async requests are issued according > > to the other of arrival. > > > > Ok, so if both the readers and writers are buffered and some tokens become > available then these tokens will be divided half and half between readers > or writer queues? > > > > Secondly, you mentioned that processes are made to sleep once we cross > > > io_limit. This sounds like request descriptor facility on requeust queue > > > where processes are made to sleep. > > > > > > There are threads in kernel which don't want to sleep while submitting > > > bios. For example, btrfs has bio submitting thread which does not want > > > to sleep hence it checks with device if it is congested or not and not > > > submit the bio if it is congested. How would you handle such cases. Have > > > you implemented any per group congestion kind of interface to make sure > > > such IO's don't sleep if group is congested. > > > > > > Or this limit is per ioband device which every group on the device is > > > sharing. If yes, then how would you provide isolation between groups > > > because if one groups consumes io_limit tokens, then other will simply > > > be serialized on that device? > > > > There are two kind of limit and both limit the number of IO requests > > which can be issued simultaneously, but one is for per ioband device, > > the other is for per ioband group. The per group limit assigned to > > each group is calculated by dividing io_limit according to their > > proportion of weight. > > > > The kernel thread is not made to sleep by the per group limit, because > > several kinds of kernel threads submit IOs from multiple groups and > > for multiple devices in a single thread. At this time, the kernel > > thread is made to sleep by the per device limit only. > > > > Interesting. Actually not blocking kernel threads on per group limit > and instead blocking it only on per device limts sounds like a good idea. > > I can also do something similar and that will take away the need of > exporting per group congestion interface to higher layers and reduce > complexity. If some kernel thread does not want to block, these will > continue to use existing per device/bdi congestion interface. > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090930110500.GA26631-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090930110500.GA26631-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-10-01 6:41 ` Ryo Tsuruta 0 siblings, 0 replies; 349+ messages in thread From: Ryo Tsuruta @ 2009-10-01 6:41 UTC (permalink / raw) To: vgoyal-H+wXaHxf7aLQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > > Hi Vivek, > > > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > > I was thinking that elevator layer will do the merge of bios. So IO > > > scheduler/elevator can time stamp the first bio in the request as it goes > > > into the disk and again timestamp with finish time once request finishes. > > > > > > This way higher layer can get an idea how much disk time a group of bios > > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > > then time accounting becomes an issue. > > > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > > time elapsed between each of milestones is t. Also assume that all these > > > requests are from same queue/group. > > > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > > > Now higher layer will think that time consumed by group is: > > > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > > > But the time elapsed is only 7t. > > > > IO controller can know how many requests are issued and still in > > progress. Is it not enough to accumulate the time while in-flight IOs > > exist? > > > > That time would not reflect disk time used. It will be follwoing. > > (time spent waiting in CFQ queues) + (time spent in dispatch queue) + > (time spent in disk) In the case where multiple IO requests are issued from IO controller, that time measurement is the time from when the first IO request is issued until when the endio is called for the last IO request. Does not it reflect disk time? > > > Secondly if a different group is running only single sequential reader, > > > there CFQ will be driving queue depth of 1 and time will not be running > > > faster and this inaccuracy in accounting will lead to unfair share between > > > groups. > > > > > > So we need something better to get a sense which group used how much of > > > disk time. > > > > It could be solved by implementing the way to pass on such information > > from IO scheduler to higher layer controller. > > > > How would you do that? Can you give some details exactly how and what > information IO scheduler will pass to higher level IO controller so that IO > controller can attribute right time to the group. If you would like to know when the idle timer is expired, how about adding a function to IO controller to be notified it from IO scheduler? IO scheduler calls the function when the timer is expired. > > > > How about making throttling policy be user selectable like the IO > > > > scheduler and putting it in the higher layer? So we could support > > > > all of policies (time-based, size-based and rate limiting). There > > > > seems not to only one solution which satisfies all users. But I agree > > > > with starting with proportional bandwidth control first. > > > > > > > > > > What are the cases where time based policy does not work and size based > > > policy works better and user would choose size based policy and not timed > > > based one? > > > > I think that disk time is not simply proportional to IO size. If there > > are two groups whose wights are equally assigned and they issue > > different sized IOs repsectively, the bandwidth of each group would > > not distributed equally as expected. > > > > If we are providing fairness in terms of time, it is fair. If we provide > equal time slots to two processes and if one got more IO done because it > was not wasting time seeking or it issued bigger size IO, it deserves that > higher BW. IO controller will make sure that process gets fair share in > terms of time and exactly how much BW one got will depend on the workload. > > That's the precise reason that fairness in terms of time is better on > seeky media. If the seek time is negligible, the bandwidth would not be distributed according to a proportion of weight settings. I think that it would be unclear for users to understand how bandwidth is distributed. And I also think that seeky media would gradually become obsolete, > > > I am not against implementing things in higher layer as long as we can > > > ensure tight control on latencies, strong isolation between groups and > > > not break CFQ's class and ioprio model with-in group. > > > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > > > Can you elaborate little bit on this? > > > > bio is grabbed in generic_make_request() and throttled as well as > > dm-ioband's mechanism. dmsetup command is not necessary any longer. > > > > Ok, so one would not need dm-ioband device now, but same dm-ioband > throttling policies will apply. So until and unless we figure out a > better way, the issues I have pointed out will still exists even in > new implementation. Yes, those still exist, but somehow I would like to try to solve them. > > The default value of io_limit on the previous test was 128 (not 192) > > which is equall to the default value of nr_request. > > Hm..., I used following commands to create two ioband devices. > > echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" > "weight 0 :100" | dmsetup create ioband1 > echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" > "weight 0 :100" | dmsetup create ioband2 > > Here io_limit value is zero so it should pick default value. Following is > output of "dmsetup table" command. > > ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 > ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 > ^^^^ > IIUC, above number 192 is reflecting io_limit? If yes, then default seems > to be 192? The default vaule has changed since v1.12.0 and increased from 128 to 192. > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > > writes. > > > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > > sync/async requests separately, and it solves this > > buffered-write-starves-read problem. I would like to post it soon > > after doing some more test. > > > > > On top of that can you please give some details how increasing the > > > buffered queue length reduces the impact of writers? > > > > When the number of in-flight IOs exceeds io_limit, processes which are > > going to issue IOs are made sleep by dm-ioband until all the in-flight > > IOs are finished. But IO scheduler layer can accept IO requests more > > than the value of io_limit, so it was a bottleneck of the throughput. > > > > Ok, so it should have been throughput bottleneck but how did it solve the > issue of writer starving the reader as you had mentioned in the mail. As wrote above, I modified dm-ioband to handle sync/async requests separately, so even if writers do a lot of buffered IOs, readers can issue IOs regardless writers' busyness. Once the IOs are backlogged for throttling, the both sync and async requests are issued according to the other of arrival. > Secondly, you mentioned that processes are made to sleep once we cross > io_limit. This sounds like request descriptor facility on requeust queue > where processes are made to sleep. > > There are threads in kernel which don't want to sleep while submitting > bios. For example, btrfs has bio submitting thread which does not want > to sleep hence it checks with device if it is congested or not and not > submit the bio if it is congested. How would you handle such cases. Have > you implemented any per group congestion kind of interface to make sure > such IO's don't sleep if group is congested. > > Or this limit is per ioband device which every group on the device is > sharing. If yes, then how would you provide isolation between groups > because if one groups consumes io_limit tokens, then other will simply > be serialized on that device? There are two kind of limit and both limit the number of IO requests which can be issued simultaneously, but one is for per ioband device, the other is for per ioband group. The per group limit assigned to each group is calculated by dividing io_limit according to their proportion of weight. The kernel thread is not made to sleep by the per group limit, because several kinds of kernel threads submit IOs from multiple groups and for multiple devices in a single thread. At this time, the kernel thread is made to sleep by the per device limit only. Thanks, Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2009-09-30 11:05 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 11:05 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Wed, Sep 30, 2009 at 05:43:19PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > I was thinking that elevator layer will do the merge of bios. So IO > > scheduler/elevator can time stamp the first bio in the request as it goes > > into the disk and again timestamp with finish time once request finishes. > > > > This way higher layer can get an idea how much disk time a group of bios > > used. But on multi queue, if we dispatch say 4 requests from same queue, > > then time accounting becomes an issue. > > > > Consider following where four requests rq1, rq2, rq3 and rq4 are > > dispatched to disk at time t0, t1, t2 and t3 respectively and these > > requests finish at time t4, t5, t6 and t7. For sake of simlicity assume > > time elapsed between each of milestones is t. Also assume that all these > > requests are from same queue/group. > > > > t0 t1 t2 t3 t4 t5 t6 t7 > > rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 > > > > Now higher layer will think that time consumed by group is: > > > > (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t > > > > But the time elapsed is only 7t. > > IO controller can know how many requests are issued and still in > progress. Is it not enough to accumulate the time while in-flight IOs > exist? > That time would not reflect disk time used. It will be follwoing. (time spent waiting in CFQ queues) + (time spent in dispatch queue) + (time spent in disk) > > Secondly if a different group is running only single sequential reader, > > there CFQ will be driving queue depth of 1 and time will not be running > > faster and this inaccuracy in accounting will lead to unfair share between > > groups. > > > > So we need something better to get a sense which group used how much of > > disk time. > > It could be solved by implementing the way to pass on such information > from IO scheduler to higher layer controller. > How would you do that? Can you give some details exactly how and what information IO scheduler will pass to higher level IO controller so that IO controller can attribute right time to the group. > > > How about making throttling policy be user selectable like the IO > > > scheduler and putting it in the higher layer? So we could support > > > all of policies (time-based, size-based and rate limiting). There > > > seems not to only one solution which satisfies all users. But I agree > > > with starting with proportional bandwidth control first. > > > > > > > What are the cases where time based policy does not work and size based > > policy works better and user would choose size based policy and not timed > > based one? > > I think that disk time is not simply proportional to IO size. If there > are two groups whose wights are equally assigned and they issue > different sized IOs repsectively, the bandwidth of each group would > not distributed equally as expected. > If we are providing fairness in terms of time, it is fair. If we provide equal time slots to two processes and if one got more IO done because it was not wasting time seeking or it issued bigger size IO, it deserves that higher BW. IO controller will make sure that process gets fair share in terms of time and exactly how much BW one got will depend on the workload. That's the precise reason that fairness in terms of time is better on seeky media. > > I am not against implementing things in higher layer as long as we can > > ensure tight control on latencies, strong isolation between groups and > > not break CFQ's class and ioprio model with-in group. > > > > > BTW, I will start to reimplement dm-ioband into block layer. > > > > Can you elaborate little bit on this? > > bio is grabbed in generic_make_request() and throttled as well as > dm-ioband's mechanism. dmsetup command is not necessary any longer. > Ok, so one would not need dm-ioband device now, but same dm-ioband throttling policies will apply. So until and unless we figure out a better way, the issues I have pointed out will still exists even in new implementation. > > > > Fairness for higher level logical devices > > > > ========================================= > > > > Do we want good fairness numbers for higher level logical devices also > > > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > > > at leaf nodes can help us use the resources optimally and in the process > > > > we can get fairness at higher level also in many of the cases. > > > > > > We should also take care of block devices which provide their own > > > make_request_fn() and not use a IO scheduler. We can't use the leaf > > > nodes approach to such devices. > > > > > > > I am not sure how big an issue is this. This can be easily solved by > > making use of NOOP scheduler by these devices. What are the reasons for > > these devices to not use even noop? > > I'm not sure why the developers of the device driver choose their own > way, and the driver is provided in binary form, so we can't modify it. > > > > > Fairness with-in group > > > > ====================== > > > > One of the issues with higher level controller is that how to do fair > > > > throttling so that fairness with-in group is not impacted. Especially > > > > the case of making sure that we don't break the notion of ioprio of the > > > > processes with-in group. > > > > > > I ran your test script to confirm that the notion of ioprio was not > > > broken by dm-ioband. Here is the results of the test. > > > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > > > > > I think that the time period during which dm-ioband holds IO requests > > > for throttling would be too short to break the notion of ioprio. > > > > Ok, I re-ran that test. Previously default io_limit value was 192 and now > > The default value of io_limit on the previous test was 128 (not 192) > which is equall to the default value of nr_request. Hm..., I used following commands to create two ioband devices. echo "0 $(blockdev --getsize /dev/sdb2) ioband /dev/sdb2 1 0 0 none" "weight 0 :100" | dmsetup create ioband1 echo "0 $(blockdev --getsize /dev/sdb3) ioband /dev/sdb3 1 0 0 none" "weight 0 :100" | dmsetup create ioband2 Here io_limit value is zero so it should pick default value. Following is output of "dmsetup table" command. ioband2: 0 89899740 ioband 8:19 1 4 192 none weight 768 :100 ioband1: 0 41961780 ioband 8:18 1 4 192 none weight 768 :100 ^^^^ IIUC, above number 192 is reflecting io_limit? If yes, then default seems to be 192? > > > I set it up to 256 as you suggested. I still see writer starving reader. I > > have removed "conv=fdatasync" from writer so that a writer is pure buffered > > writes. > > O.K. You removed "conv=fdatasync", the new dm-ioband handles > sync/async requests separately, and it solves this > buffered-write-starves-read problem. I would like to post it soon > after doing some more test. > > > On top of that can you please give some details how increasing the > > buffered queue length reduces the impact of writers? > > When the number of in-flight IOs exceeds io_limit, processes which are > going to issue IOs are made sleep by dm-ioband until all the in-flight > IOs are finished. But IO scheduler layer can accept IO requests more > than the value of io_limit, so it was a bottleneck of the throughput. > Ok, so it should have been throughput bottleneck but how did it solve the issue of writer starving the reader as you had mentioned in the mail. Secondly, you mentioned that processes are made to sleep once we cross io_limit. This sounds like request descriptor facility on requeust queue where processes are made to sleep. There are threads in kernel which don't want to sleep while submitting bios. For example, btrfs has bio submitting thread which does not want to sleep hence it checks with device if it is congested or not and not submit the bio if it is congested. How would you handle such cases. Have you implemented any per group congestion kind of interface to make sure such IO's don't sleep if group is congested. Or this limit is per ioband device which every group on the device is sharing. If yes, then how would you provide isolation between groups because if one groups consumes io_limit tokens, then other will simply be serialized on that device? > > IO Prio issue > > -------------- > > I ran another test where two ioband devices were created of weight 100 > > each on two partitions. In first group 4 readers were launched. Three > > readers are of class BE and prio 7, fourth one is of class BE prio 0. In > > group2, I launched a buffered writer. > > > > One would expect that prio0 reader gets more bandwidth as compared to > > prio 4 readers and prio 7 readers will get more or less same bw. Looks like > > that is not happening. Look how vanilla CFQ provides much more bandwidth > > to prio0 reader as compared to prio7 reader and how putting them in the > > group reduces the difference betweej prio0 and prio7 readers. > > > > Following are the results. > > O.K. I'll try to do more test with dm-ioband according to your > comments especially working with CFQ. Thanks for pointing out. > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> @ 2009-09-29 10:49 ` Takuya Yoshikawa 2009-09-29 14:10 ` Vivek Goyal 2009-09-30 3:11 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: Takuya Yoshikawa @ 2009-09-29 10:49 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi, Ryo Tsuruta wrote: > Hi Vivek and all, > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: >> On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > >>> We are starting from a point where there is no cgroup based IO >>> scheduling in the kernel. And it is probably not reasonable to satisfy >>> all IO scheduling related requirements in one patch set. We can start >>> with something simple, and build on top of that. So a very simple >>> patch set that enables cgroup based proportional scheduling for CFQ >>> seems like the way to go at this point. >> Sure, we can start with CFQ only. But a bigger question we need to answer >> is that is CFQ the right place to solve the issue? Jens, do you think >> that CFQ is the right place to solve the problem? >> >> Andrew seems to favor a high level approach so that IO schedulers are less >> complex and we can provide fairness at high level logical devices also. > > I'm not in favor of expansion of CFQ, because some enterprise storages > are better performed with NOOP rather than CFQ, and I think bandwidth > control is needed much more for such storage system. Is it easy to > support other IO schedulers even if a new IO scheduler is introduced? > I would like to know a bit more specific about Namuman's scheduler design. Nauman said "cgroup based proportional scheduling for CFQ" and we need not expand much of CFQ itself, is it right Nauman? If so, we can reuse the io controller for new schedulers similar to CFQ. I do not know well about how much important is it to consider which scheduler is the current enterprise storages' favarite. If we introduce an io controller, io pattern to disks will change, in that case there is no guarantee that NOOP with some io controller should work better than CFQ with some io controller. Of course io controller for NOOP may be better. Thanks, Takuya Yoshikawa > >> I will again try to summarize my understanding so far about the pros/cons >> of each approach and then we can take the discussion forward. > > Good summary. Thanks for your work. > >> Fairness in terms of size of IO or disk time used >> ================================================= >> On a seeky media, fairness in terms of disk time can get us better results >> instead fairness interms of size of IO or number of IO. >> >> If we implement some kind of time based solution at higher layer, then >> that higher layer should know who used how much of time each group used. We >> can probably do some kind of timestamping in bio to get a sense when did it >> get into disk and when did it finish. But on a multi queue hardware there >> can be multiple requests in the disk either from same queue or from differnet >> queues and with pure timestamping based apparoch, so far I could not think >> how at high level we will get an idea who used how much of time. > > IIUC, could the overlap time be calculated from time-stamp on a multi > queue hardware? > >> So this is the first point of contention that how do we want to provide >> fairness. In terms of disk time used or in terms of size of IO/number of >> IO. >> >> Max bandwidth Controller or Proportional bandwidth controller >> ============================================================= >> What is our primary requirement here? A weight based proportional >> bandwidth controller where we can use the resources optimally and any >> kind of throttling kicks in only if there is contention for the disk. >> >> Or we want max bandwidth control where a group is not allowed to use the >> disk even if disk is free. >> >> Or we need both? I would think that at some point of time we will need >> both but we can start with proportional bandwidth control first. > > How about making throttling policy be user selectable like the IO > scheduler and putting it in the higher layer? So we could support > all of policies (time-based, size-based and rate limiting). There > seems not to only one solution which satisfies all users. But I agree > with starting with proportional bandwidth control first. > > BTW, I will start to reimplement dm-ioband into block layer. > >> Fairness for higher level logical devices >> ========================================= >> Do we want good fairness numbers for higher level logical devices also >> or it is sufficient to provide fairness at leaf nodes. Providing fairness >> at leaf nodes can help us use the resources optimally and in the process >> we can get fairness at higher level also in many of the cases. > > We should also take care of block devices which provide their own > make_request_fn() and not use a IO scheduler. We can't use the leaf > nodes approach to such devices. > >> But do we want strict fairness numbers on higher level logical devices >> even if it means sub-optimal usage of unerlying phsical devices? >> >> I think that for proportinal bandwidth control, it should be ok to provide >> fairness at higher level logical device but for max bandwidth control it >> might make more sense to provide fairness at higher level. Consider a >> case where from a striped device a customer wants to limit a group to >> 30MB/s and in case of leaf node control, if every leaf node provides >> 30MB/s, it might accumulate to much more than specified rate at logical >> device. >> >> Latency Control and strong isolation between groups >> =================================================== >> Do we want a good isolation between groups and better latencies and >> stronger isolation between groups? >> >> I think if problem is solved at IO scheduler level, we can achieve better >> latency control and hence stronger isolation between groups. >> >> Higher level solutions should find it hard to provide same kind of latency >> control and isolation between groups as IO scheduler based solution. > > Why do you think that the higher level solution is hard to provide it? > I think that it is a matter of how to implement throttling policy. > >> Fairness for buffered writes >> ============================ >> Doing io control at any place below page cache has disadvantage that page >> cache might not dispatch more writes from higher weight group hence higher >> weight group might not see more IO done. Andrew says that we don't have >> a solution to this problem in kernel and he would like to see it handled >> properly. >> >> Only way to solve this seems to be to slow down the writers before they >> write into page cache. IO throttling patch handled it by slowing down >> writer if it crossed max specified rate. Other suggestions have come in >> the form of dirty_ratio per memory cgroup or a separate cgroup controller >> al-together where some kind of per group write limit can be specified. >> >> So if solution is implemented at IO scheduler layer or at device mapper >> layer, both shall have to rely on another controller to be co-mounted >> to handle buffered writes properly. >> >> Fairness with-in group >> ====================== >> One of the issues with higher level controller is that how to do fair >> throttling so that fairness with-in group is not impacted. Especially >> the case of making sure that we don't break the notion of ioprio of the >> processes with-in group. > > I ran your test script to confirm that the notion of ioprio was not > broken by dm-ioband. Here is the results of the test. > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > I think that the time period during which dm-ioband holds IO requests > for throttling would be too short to break the notion of ioprio. > >> Especially io throttling patch was very bad in terms of prio with-in >> group where throttling treated everyone equally and difference between >> process prio disappeared. >> >> Reads Vs Writes >> =============== >> A higher level control most likely will change the ratio in which reads >> and writes are dispatched to disk with-in group. It used to be decided >> by IO scheduler so far but with higher level groups doing throttling and >> possibly buffering the bios and releasing them later, they will have to >> come up with their own policy on in what proportion reads and writes >> should be dispatched. In case of IO scheduler based control, all the >> queuing takes place at IO scheduler and it still retains control of >> in what ration reads and writes should be dispatched. > > I don't think it is a concern. The current implementation of dm-ioband > is that sync/async IO requests are handled separately and the > backlogged IOs are released according to the order of arrival if both > sync and async requests are backlogged. > >> Summary >> ======= >> >> - An io scheduler based io controller can provide better latencies, >> stronger isolation between groups, time based fairness and will not >> interfere with io schedulers policies like class, ioprio and >> reader vs writer issues. >> >> But it can gunrantee fairness at higher logical level devices. >> Especially in case of max bw control, leaf node control does not sound >> to be the most appropriate thing. >> >> - IO throttling provides max bw control in terms of absolute rate. It has >> the advantage that it can provide control at higher level logical device >> and also control buffered writes without need of additional controller >> co-mounted. >> >> But it does only max bw control and not proportion control so one might >> not be using resources optimally. It looses sense of task prio and class >> with-in group as any of the task can be throttled with-in group. Because >> throttling does not kick in till you hit the max bw limit, it should find >> it hard to provide same latencies as io scheduler based control. >> >> - dm-ioband also has the advantage that it can provide fairness at higher >> level logical devices. >> >> But, fairness is provided only in terms of size of IO or number of IO. >> No time based fairness. It is very throughput oriented and does not >> throttle high speed group if other group is running slow random reader. >> This results in bad latnecies for random reader group and weaker >> isolation between groups. > > A new policy can be added to dm-ioband. Actually, range-bw policy, > which provides min and max bandwidth control, does time-based > throttling. Moreover there is room for improvement for existing > policies. The write-starve-read issue you pointed out will be solved > soon. > >> Also it does not provide fairness if a group is not continuously >> backlogged. So if one is running 1-2 dd/sequential readers in the group, >> one does not get fairness until workload is increased to a point where >> group becomes continuously backlogged. This also results in poor >> latencies and limited fairness. > > This is intended to efficiently use bandwidth of underlying devices > when IO load is low. > >> At this point of time it does not look like a single IO controller all >> the scenarios/requirements. This means few things to me. >> >> - Drop some of the requirements and go with one implementation which meets >> those reduced set of requirements. >> >> - Have more than one IO controller implementation in kenrel. One for lower >> level control for better latencies, stronger isolation and optimal resource >> usage and other one for fairness at higher level logical devices and max >> bandwidth control. >> >> And let user decide which one to use based on his/her needs. >> >> - Come up with more intelligent way of doing IO control where single >> controller covers all the cases. >> >> At this point of time, I am more inclined towards option 2 of having more >> than one implementation in kernel. :-) (Until and unless we can brainstrom >> and come up with ideas to make option 3 happen). >> >>> It would be great if we discuss our plans on the mailing list, so we >>> can get early feedback from everyone. >> >> This is what comes to my mind so far. Please add to the list if I have missed >> some points. Also correct me if I am wrong about the pros/cons of the >> approaches. >> >> Thoughts/ideas/opinions are welcome... >> >> Thanks >> Vivek > > Thanks, > Ryo Tsuruta > ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-09-29 10:49 ` Takuya Yoshikawa @ 2009-09-29 14:10 ` Vivek Goyal 2009-09-30 3:11 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-29 14:10 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote: > Hi Vivek and all, > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > > > > We are starting from a point where there is no cgroup based IO > > > scheduling in the kernel. And it is probably not reasonable to satisfy > > > all IO scheduling related requirements in one patch set. We can start > > > with something simple, and build on top of that. So a very simple > > > patch set that enables cgroup based proportional scheduling for CFQ > > > seems like the way to go at this point. > > > > Sure, we can start with CFQ only. But a bigger question we need to answer > > is that is CFQ the right place to solve the issue? Jens, do you think > > that CFQ is the right place to solve the problem? > > > > Andrew seems to favor a high level approach so that IO schedulers are less > > complex and we can provide fairness at high level logical devices also. > > I'm not in favor of expansion of CFQ, because some enterprise storages > are better performed with NOOP rather than CFQ, and I think bandwidth > control is needed much more for such storage system. Is it easy to > support other IO schedulers even if a new IO scheduler is introduced? > I would like to know a bit more specific about Namuman's scheduler design. > The new design is essentially the old design. Except the fact that suggestion is that in the first step instead of covering all the 4 IO schedulers, first cover only CFQ and then later others. So providing fairness for NOOP is not an issue. Even if we introduce new IO schedulers down the line, I can't think of a reason why can't we cover that too with common layer. > > I will again try to summarize my understanding so far about the pros/cons > > of each approach and then we can take the discussion forward. > > Good summary. Thanks for your work. > > > Fairness in terms of size of IO or disk time used > > ================================================= > > On a seeky media, fairness in terms of disk time can get us better results > > instead fairness interms of size of IO or number of IO. > > > > If we implement some kind of time based solution at higher layer, then > > that higher layer should know who used how much of time each group used. We > > can probably do some kind of timestamping in bio to get a sense when did it > > get into disk and when did it finish. But on a multi queue hardware there > > can be multiple requests in the disk either from same queue or from differnet > > queues and with pure timestamping based apparoch, so far I could not think > > how at high level we will get an idea who used how much of time. > > IIUC, could the overlap time be calculated from time-stamp on a multi > queue hardware? So far could not think of anything clean. Do you have something in mind. I was thinking that elevator layer will do the merge of bios. So IO scheduler/elevator can time stamp the first bio in the request as it goes into the disk and again timestamp with finish time once request finishes. This way higher layer can get an idea how much disk time a group of bios used. But on multi queue, if we dispatch say 4 requests from same queue, then time accounting becomes an issue. Consider following where four requests rq1, rq2, rq3 and rq4 are dispatched to disk at time t0, t1, t2 and t3 respectively and these requests finish at time t4, t5, t6 and t7. For sake of simlicity assume time elapsed between each of milestones is t. Also assume that all these requests are from same queue/group. t0 t1 t2 t3 t4 t5 t6 t7 rq1 rq2 rq3 rq4 rq1 rq2 rq3 rq4 Now higher layer will think that time consumed by group is: (t4-t0) + (t5-t1) + (t6-t2) + (t7-t3) = 16t But the time elapsed is only 7t. Secondly if a different group is running only single sequential reader, there CFQ will be driving queue depth of 1 and time will not be running faster and this inaccuracy in accounting will lead to unfair share between groups. So we need something better to get a sense which group used how much of disk time. > > > So this is the first point of contention that how do we want to provide > > fairness. In terms of disk time used or in terms of size of IO/number of > > IO. > > > > Max bandwidth Controller or Proportional bandwidth controller > > ============================================================= > > What is our primary requirement here? A weight based proportional > > bandwidth controller where we can use the resources optimally and any > > kind of throttling kicks in only if there is contention for the disk. > > > > Or we want max bandwidth control where a group is not allowed to use the > > disk even if disk is free. > > > > Or we need both? I would think that at some point of time we will need > > both but we can start with proportional bandwidth control first. > > How about making throttling policy be user selectable like the IO > scheduler and putting it in the higher layer? So we could support > all of policies (time-based, size-based and rate limiting). There > seems not to only one solution which satisfies all users. But I agree > with starting with proportional bandwidth control first. > What are the cases where time based policy does not work and size based policy works better and user would choose size based policy and not timed based one? I am not against implementing things in higher layer as long as we can ensure tight control on latencies, strong isolation between groups and not break CFQ's class and ioprio model with-in group. > BTW, I will start to reimplement dm-ioband into block layer. Can you elaborate little bit on this? > > > Fairness for higher level logical devices > > ========================================= > > Do we want good fairness numbers for higher level logical devices also > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > at leaf nodes can help us use the resources optimally and in the process > > we can get fairness at higher level also in many of the cases. > > We should also take care of block devices which provide their own > make_request_fn() and not use a IO scheduler. We can't use the leaf > nodes approach to such devices. > I am not sure how big an issue is this. This can be easily solved by making use of NOOP scheduler by these devices. What are the reasons for these devices to not use even noop? > > But do we want strict fairness numbers on higher level logical devices > > even if it means sub-optimal usage of unerlying phsical devices? > > > > I think that for proportinal bandwidth control, it should be ok to provide > > fairness at higher level logical device but for max bandwidth control it > > might make more sense to provide fairness at higher level. Consider a > > case where from a striped device a customer wants to limit a group to > > 30MB/s and in case of leaf node control, if every leaf node provides > > 30MB/s, it might accumulate to much more than specified rate at logical > > device. > > > > Latency Control and strong isolation between groups > > =================================================== > > Do we want a good isolation between groups and better latencies and > > stronger isolation between groups? > > > > I think if problem is solved at IO scheduler level, we can achieve better > > latency control and hence stronger isolation between groups. > > > > Higher level solutions should find it hard to provide same kind of latency > > control and isolation between groups as IO scheduler based solution. > > Why do you think that the higher level solution is hard to provide it? > I think that it is a matter of how to implement throttling policy. > So far both in dm-ioband and IO throttling solution I have seen that higher layer implements some of kind leaky bucket/token bucket algorithm, which inherently allows IO from all the competing groups until they run out of tokens and then these groups are made to wait till fresh tokens are issued. That means, most of the times, IO scheduler will see requests from more than one group at the same time and that will be the source of weak isolation between groups. Consider following simple examples. Assume there are two groups and one contains 16 random readers and other contains 1 random reader. G1 G2 16RR 1RR Now it might happen that IO scheduler sees requests from all the 17 RR readers at the same time. (Throttling probably will kick in later because you would like to give one group a nice slice of 100ms otherwise sequential readers will suffer a lot and disk will become seek bound). So CFQ will dispatch requests (at least one), from each of the 16 random readers first and then from 1 random reader in group 2 and this increases the max latency for the application in group 2 and provides weak isolation. There will also be additional issues with CFQ preemtpion logic. CFQ will have no knowledge of groups and it will do cross group preemtptions. For example if a meta data request comes in group1, it will preempt any of the queue being served in other groups. So somebody doing "find . *" or "cat <small files>" in one group will keep on preempting a sequential reader in other group. Again this will probably lead to higher max latencies. Note, even if CFQ does not enable idling on random readers, and expires queue after single dispatch, seeking time between queues can be significant. Similarly, if instead of 16 random reders we had 16 random synchronous writers we will have seek time issue as well as writers can often dump bigger requests which also adds to latency. This latency issue can be solved if we dispatch requests only from one group for a certain time of time and then move to next group. (Something what common layer is doing). If we go for only single group dispatching requests, then we shall have to implemnt some of the preemption semantics also in higher layer because in certain cases we want to do preemption across the groups. Like RT task group preemting non-RT task group etc. Once we go deeper into implementation, I think we will find more issues. > > Fairness for buffered writes > > ============================ > > Doing io control at any place below page cache has disadvantage that page > > cache might not dispatch more writes from higher weight group hence higher > > weight group might not see more IO done. Andrew says that we don't have > > a solution to this problem in kernel and he would like to see it handled > > properly. > > > > Only way to solve this seems to be to slow down the writers before they > > write into page cache. IO throttling patch handled it by slowing down > > writer if it crossed max specified rate. Other suggestions have come in > > the form of dirty_ratio per memory cgroup or a separate cgroup controller > > al-together where some kind of per group write limit can be specified. > > > > So if solution is implemented at IO scheduler layer or at device mapper > > layer, both shall have to rely on another controller to be co-mounted > > to handle buffered writes properly. > > > > Fairness with-in group > > ====================== > > One of the issues with higher level controller is that how to do fair > > throttling so that fairness with-in group is not impacted. Especially > > the case of making sure that we don't break the notion of ioprio of the > > processes with-in group. > > I ran your test script to confirm that the notion of ioprio was not > broken by dm-ioband. Here is the results of the test. > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > I think that the time period during which dm-ioband holds IO requests > for throttling would be too short to break the notion of ioprio. Ok, I re-ran that test. Previously default io_limit value was 192 and now I set it up to 256 as you suggested. I still see writer starving reader. I have removed "conv=fdatasync" from writer so that a writer is pure buffered writes. With vanilla CFQ ---------------- reader: 578867200 bytes (579 MB) copied, 10.803 s, 53.6 MB/s writer: 2147483648 bytes (2.1 GB) copied, 39.4596 s, 54.4 MB/s with dm-ioband default io_limit=192 ----------------------------------- writer: 2147483648 bytes (2.1 GB) copied, 46.2991 s, 46.4 MB/s reader: 578867200 bytes (579 MB) copied, 52.1419 s, 11.1 MB/s ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 with dm-ioband default io_limit=256 ----------------------------------- reader: 578867200 bytes (579 MB) copied, 42.6231 s, 13.6 MB/s writer: 2147483648 bytes (2.1 GB) copied, 49.1678 s, 43.7 MB/s ioband2: 0 40355280 ioband 8:50 1 4 256 none weight 1024 :100 ioband1: 0 37768752 ioband 8:49 1 4 256 none weight 1024 :100 Notice that with vanilla CFQ, reader is taking 10 seconds to finish and with dm-ioband it takes more than 40 seconds to finish. So writer is still starving the reader with both io_limit 192 and 256. On top of that can you please give some details how increasing the buffered queue length reduces the impact of writers? IO Prio issue -------------- I ran another test where two ioband devices were created of weight 100 each on two partitions. In first group 4 readers were launched. Three readers are of class BE and prio 7, fourth one is of class BE prio 0. In group2, I launched a buffered writer. One would expect that prio0 reader gets more bandwidth as compared to prio 4 readers and prio 7 readers will get more or less same bw. Looks like that is not happening. Look how vanilla CFQ provides much more bandwidth to prio0 reader as compared to prio7 reader and how putting them in the group reduces the difference betweej prio0 and prio7 readers. Following are the results. Vanilla CFQ =========== set1 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.6287 s, 39.6 MB/s 578867200 bytes (579 MB) copied, 50.5431 s, 11.5 MB/s 578867200 bytes (579 MB) copied, 51.0175 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 52.1346 s, 11.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 85.2212 s, 25.2 MB/s set2 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.3198 s, 40.4 MB/s 578867200 bytes (579 MB) copied, 48.8599 s, 11.8 MB/s 578867200 bytes (579 MB) copied, 51.206 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.5233 s, 11.2 MB/s writer: 2147483648 bytes (2.1 GB) copied, 83.0834 s, 25.8 MB/s set3 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 14.5222 s, 39.9 MB/s 578867200 bytes (579 MB) copied, 51.1256 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.2004 s, 11.3 MB/s 578867200 bytes (579 MB) copied, 51.9652 s, 11.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 82.7328 s, 26.0 MB/s with dm-ioband ============== ioband2: 0 40355280 ioband 8:50 1 4 192 none weight 768 :100 ioband1: 0 37768752 ioband 8:49 1 4 192 none weight 768 :100 set1 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 67.4385 s, 8.6 MB/s 578867200 bytes (579 MB) copied, 126.726 s, 4.6 MB/s 578867200 bytes (579 MB) copied, 143.203 s, 4.0 MB/s 578867200 bytes (579 MB) copied, 148.025 s, 3.9 MB/s writer: 2147483648 bytes (2.1 GB) copied, 156.953 s, 13.7 MB/s set2 --- prio 0 reader: 578867200 bytes (579 MB) copied, 58.4422 s, 9.9 MB/s 578867200 bytes (579 MB) copied, 113.936 s, 5.1 MB/s 578867200 bytes (579 MB) copied, 122.763 s, 4.7 MB/s 578867200 bytes (579 MB) copied, 128.198 s, 4.5 MB/s writer: 2147483648 bytes (2.1 GB) copied, 141.394 s, 15.2 MB/s set3 ---- prio 0 reader: 578867200 bytes (579 MB) copied, 59.8992 s, 9.7 MB/s 578867200 bytes (579 MB) copied, 136.858 s, 4.2 MB/s 578867200 bytes (579 MB) copied, 139.91 s, 4.1 MB/s 578867200 bytes (579 MB) copied, 139.986 s, 4.1 MB/s writer: 2147483648 bytes (2.1 GB) copied, 151.889 s, 14.1 MB/s Note: In vanilla CFQ, prio0 reader got more than 350% BW of prio 7 reader. With dm-ioband this ratio changed to less than 200%. I will run more tests, but this show how notion of priority with-in a group changes if we implement throttling at higher layer and don't keep it with CFQ. The second thing which strikes me is that I divided the disk 50% each between readers and writers and in that case would expect protection for writers and expect writers to finish fast. But writers have been slowed down like and it also kills overall disk throughput. I think it probably became seek bound. I think the moment I get more time, I will run some timed fio tests and look at how overall disk performed and how bandwidth was distributed with-in group and between groups. > > > Especially io throttling patch was very bad in terms of prio with-in > > group where throttling treated everyone equally and difference between > > process prio disappeared. > > > > Reads Vs Writes > > =============== > > A higher level control most likely will change the ratio in which reads > > and writes are dispatched to disk with-in group. It used to be decided > > by IO scheduler so far but with higher level groups doing throttling and > > possibly buffering the bios and releasing them later, they will have to > > come up with their own policy on in what proportion reads and writes > > should be dispatched. In case of IO scheduler based control, all the > > queuing takes place at IO scheduler and it still retains control of > > in what ration reads and writes should be dispatched. > > I don't think it is a concern. The current implementation of dm-ioband > is that sync/async IO requests are handled separately and the > backlogged IOs are released according to the order of arrival if both > sync and async requests are backlogged. At least the version of dm-ioband I have is not producing the desired results. See above. Is there a newer version? I will run some tests on that too. But I think you will again run into same issue where you will decide the ratio of read vs write with-in group and as I change the IO schedulers results will vary. So at this point of time I can't think how can you solve read vs write ratio issue at higher layer without changing the behavior or underlying IO scheduler. > > > Summary > > ======= > > > > - An io scheduler based io controller can provide better latencies, > > stronger isolation between groups, time based fairness and will not > > interfere with io schedulers policies like class, ioprio and > > reader vs writer issues. > > > > But it can gunrantee fairness at higher logical level devices. > > Especially in case of max bw control, leaf node control does not sound > > to be the most appropriate thing. > > > > - IO throttling provides max bw control in terms of absolute rate. It has > > the advantage that it can provide control at higher level logical device > > and also control buffered writes without need of additional controller > > co-mounted. > > > > But it does only max bw control and not proportion control so one might > > not be using resources optimally. It looses sense of task prio and class > > with-in group as any of the task can be throttled with-in group. Because > > throttling does not kick in till you hit the max bw limit, it should find > > it hard to provide same latencies as io scheduler based control. > > > > - dm-ioband also has the advantage that it can provide fairness at higher > > level logical devices. > > > > But, fairness is provided only in terms of size of IO or number of IO. > > No time based fairness. It is very throughput oriented and does not > > throttle high speed group if other group is running slow random reader. > > This results in bad latnecies for random reader group and weaker > > isolation between groups. > > A new policy can be added to dm-ioband. Actually, range-bw policy, > which provides min and max bandwidth control, does time-based > throttling. Moreover there is room for improvement for existing > policies. The write-starve-read issue you pointed out will be solved > soon. > > > Also it does not provide fairness if a group is not continuously > > backlogged. So if one is running 1-2 dd/sequential readers in the group, > > one does not get fairness until workload is increased to a point where > > group becomes continuously backlogged. This also results in poor > > latencies and limited fairness. > > This is intended to efficiently use bandwidth of underlying devices > when IO load is low. But this has following undesired results. - Slow moving group does not get reduced latencies. For example, random readers in slow moving group get no isolation and will continue to see higher max latencies. - A single sequential reader in one group does not get fair share and we might be pushing buffered writes in other group thinking that we are getting better throughput. But the fact is that we are eating away readers share in group1 and giving it to writers in group2. Also I showed that we did not necessarily improve the overall throughput of the system by doing so. (Because it increases the number of seeks). I had sent you a mail to show that. http://www.linux-archive.org/device-mapper-development/368752-ioband-limited-fairness-weak-isolation-between-groups-regarding-dm-ioband-tests.html But you changed the test case to run 4 readers in a single group to show that it throughput does not decrease. Please don't change test cases. In case of 4 sequential readers in the group, group is continuously backlogged and you don't steal bandwidth from slow moving group. So in that mail I was not even discussing the scenario when you don't steal the bandwidth from other group. I specially created one slow moving group with one reader so that we end up stealing bandwidth from slow moving group and show that we did not achive higher overall throughput by stealing the BW at the same time we did not get fairness for single reader and observed decreasing throughput for single reader as number of writers in other group increased. Thanks Vivek > > > At this point of time it does not look like a single IO controller all > > the scenarios/requirements. This means few things to me. > > > > - Drop some of the requirements and go with one implementation which meets > > those reduced set of requirements. > > > > - Have more than one IO controller implementation in kenrel. One for lower > > level control for better latencies, stronger isolation and optimal resource > > usage and other one for fairness at higher level logical devices and max > > bandwidth control. > > > > And let user decide which one to use based on his/her needs. > > > > - Come up with more intelligent way of doing IO control where single > > controller covers all the cases. > > > > At this point of time, I am more inclined towards option 2 of having more > > than one implementation in kernel. :-) (Until and unless we can brainstrom > > and come up with ideas to make option 3 happen). > > > > > It would be great if we discuss our plans on the mailing list, so we > > > can get early feedback from everyone. > > > > This is what comes to my mind so far. Please add to the list if I have missed > > some points. Also correct me if I am wrong about the pros/cons of the > > approaches. > > > > Thoughts/ideas/opinions are welcome... > > > > Thanks > > Vivek > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-09-29 10:49 ` Takuya Yoshikawa 2009-09-29 14:10 ` Vivek Goyal @ 2009-09-30 3:11 ` Vivek Goyal 2 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 3:11 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote: > Hi Vivek and all, > > Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > > > > We are starting from a point where there is no cgroup based IO > > > scheduling in the kernel. And it is probably not reasonable to satisfy > > > all IO scheduling related requirements in one patch set. We can start > > > with something simple, and build on top of that. So a very simple > > > patch set that enables cgroup based proportional scheduling for CFQ > > > seems like the way to go at this point. > > > > Sure, we can start with CFQ only. But a bigger question we need to answer > > is that is CFQ the right place to solve the issue? Jens, do you think > > that CFQ is the right place to solve the problem? > > > > Andrew seems to favor a high level approach so that IO schedulers are less > > complex and we can provide fairness at high level logical devices also. > > I'm not in favor of expansion of CFQ, because some enterprise storages > are better performed with NOOP rather than CFQ, and I think bandwidth > control is needed much more for such storage system. Is it easy to > support other IO schedulers even if a new IO scheduler is introduced? > I would like to know a bit more specific about Namuman's scheduler design. > > > I will again try to summarize my understanding so far about the pros/cons > > of each approach and then we can take the discussion forward. > > Good summary. Thanks for your work. > > > Fairness in terms of size of IO or disk time used > > ================================================= > > On a seeky media, fairness in terms of disk time can get us better results > > instead fairness interms of size of IO or number of IO. > > > > If we implement some kind of time based solution at higher layer, then > > that higher layer should know who used how much of time each group used. We > > can probably do some kind of timestamping in bio to get a sense when did it > > get into disk and when did it finish. But on a multi queue hardware there > > can be multiple requests in the disk either from same queue or from differnet > > queues and with pure timestamping based apparoch, so far I could not think > > how at high level we will get an idea who used how much of time. > > IIUC, could the overlap time be calculated from time-stamp on a multi > queue hardware? > > > So this is the first point of contention that how do we want to provide > > fairness. In terms of disk time used or in terms of size of IO/number of > > IO. > > > > Max bandwidth Controller or Proportional bandwidth controller > > ============================================================= > > What is our primary requirement here? A weight based proportional > > bandwidth controller where we can use the resources optimally and any > > kind of throttling kicks in only if there is contention for the disk. > > > > Or we want max bandwidth control where a group is not allowed to use the > > disk even if disk is free. > > > > Or we need both? I would think that at some point of time we will need > > both but we can start with proportional bandwidth control first. > > How about making throttling policy be user selectable like the IO > scheduler and putting it in the higher layer? So we could support > all of policies (time-based, size-based and rate limiting). There > seems not to only one solution which satisfies all users. But I agree > with starting with proportional bandwidth control first. > > BTW, I will start to reimplement dm-ioband into block layer. > > > Fairness for higher level logical devices > > ========================================= > > Do we want good fairness numbers for higher level logical devices also > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > at leaf nodes can help us use the resources optimally and in the process > > we can get fairness at higher level also in many of the cases. > > We should also take care of block devices which provide their own > make_request_fn() and not use a IO scheduler. We can't use the leaf > nodes approach to such devices. > > > But do we want strict fairness numbers on higher level logical devices > > even if it means sub-optimal usage of unerlying phsical devices? > > > > I think that for proportinal bandwidth control, it should be ok to provide > > fairness at higher level logical device but for max bandwidth control it > > might make more sense to provide fairness at higher level. Consider a > > case where from a striped device a customer wants to limit a group to > > 30MB/s and in case of leaf node control, if every leaf node provides > > 30MB/s, it might accumulate to much more than specified rate at logical > > device. > > > > Latency Control and strong isolation between groups > > =================================================== > > Do we want a good isolation between groups and better latencies and > > stronger isolation between groups? > > > > I think if problem is solved at IO scheduler level, we can achieve better > > latency control and hence stronger isolation between groups. > > > > Higher level solutions should find it hard to provide same kind of latency > > control and isolation between groups as IO scheduler based solution. > > Why do you think that the higher level solution is hard to provide it? > I think that it is a matter of how to implement throttling policy. > > > Fairness for buffered writes > > ============================ > > Doing io control at any place below page cache has disadvantage that page > > cache might not dispatch more writes from higher weight group hence higher > > weight group might not see more IO done. Andrew says that we don't have > > a solution to this problem in kernel and he would like to see it handled > > properly. > > > > Only way to solve this seems to be to slow down the writers before they > > write into page cache. IO throttling patch handled it by slowing down > > writer if it crossed max specified rate. Other suggestions have come in > > the form of dirty_ratio per memory cgroup or a separate cgroup controller > > al-together where some kind of per group write limit can be specified. > > > > So if solution is implemented at IO scheduler layer or at device mapper > > layer, both shall have to rely on another controller to be co-mounted > > to handle buffered writes properly. > > > > Fairness with-in group > > ====================== > > One of the issues with higher level controller is that how to do fair > > throttling so that fairness with-in group is not impacted. Especially > > the case of making sure that we don't break the notion of ioprio of the > > processes with-in group. > > I ran your test script to confirm that the notion of ioprio was not > broken by dm-ioband. Here is the results of the test. > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > I think that the time period during which dm-ioband holds IO requests > for throttling would be too short to break the notion of ioprio. > Hi Ryo, I am doing some more tests to see how do we maintain notion of prio with-in group. I have created two ioband devies ioband1 and ioband2 of weight 100 each on two disk partitions. On one partition/device (ioband1) a buffered writer is doing writeout and on other partition I launch one prio0 reader and increasing number of prio4 readers using fio and let it run for 30 seconds and see how BW got distributed between prio0 and prio4 processes. Note, here readers are doing direct IO. I did this test with vanilla CFQ and with dm-ioband + cfq. With vanilla CFQ ---------------- <---------prio4 readers --------------------------> <---prio0 reader---> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 12892KiB/s 12892KiB/s 12892KiB/s 409K usec 14705KiB/s 252K usec 2 5667KiB/s 5637KiB/s 11302KiB/s 717K usec 17555KiB/s 339K usec 4 4395KiB/s 4173KiB/s 17027KiB/s 933K usec 12437KiB/s 553K usec 8 2652KiB/s 2391KiB/s 20268KiB/s 1410K usec 9482KiB/s 685K usec 16 1653KiB/s 1413KiB/s 24035KiB/s 2418K usec 5860KiB/s 1027K usec Note, as we increase number of prio4 readers, prio0 processes aggregate bandwidth goes down (nr=2 seems to be only exception) but it still maintains more BW than prio4 process. Also note that as we increase number of prio4 readers, their aggreagate bandwidth goes up which is expected. With dm-ioband -------------- <---------prio4 readers --------------------------> <---prio0 reader---> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 11242KiB/s 11242KiB/s 11242KiB/s 415K usec 3884KiB/s 244K usec 2 8110KiB/s 6236KiB/s 14345KiB/s 304K usec 320KiB/s 125K usec 4 6898KiB/s 622KiB/s 11059KiB/s 206K usec 503KiB/s 201K usec 8 345KiB/s 47KiB/s 850KiB/s 342K usec 8350KiB/s 164K usec 16 28KiB/s 28KiB/s 451KiB/s 688 msec 5092KiB/s 306K usec Looking at the output with dm-ioband, it seems to be all over the place. Look at aggregate bandwidth of prio0 reader and how wildly it is swinging. It first goes down and then suddenly jumps up way high. Similiarly look at aggregate bandwidth of prio4 readers and the moment we hit 8 readers, it suddenly tanks. Look at prio4 reader and prio 7 reader BW with 16 prio4 processes running. prio4 process gets 28Kb/s and prio 0 process gets 5MB/s. Can you please look into it? It looks like we got serious issues w.r.t to fairness and bandwidth distribution with-in group. Thanks Vivek > > Especially io throttling patch was very bad in terms of prio with-in > > group where throttling treated everyone equally and difference between > > process prio disappeared. > > > > Reads Vs Writes > > =============== > > A higher level control most likely will change the ratio in which reads > > and writes are dispatched to disk with-in group. It used to be decided > > by IO scheduler so far but with higher level groups doing throttling and > > possibly buffering the bios and releasing them later, they will have to > > come up with their own policy on in what proportion reads and writes > > should be dispatched. In case of IO scheduler based control, all the > > queuing takes place at IO scheduler and it still retains control of > > in what ration reads and writes should be dispatched. > > I don't think it is a concern. The current implementation of dm-ioband > is that sync/async IO requests are handled separately and the > backlogged IOs are released according to the order of arrival if both > sync and async requests are backlogged. > > > Summary > > ======= > > > > - An io scheduler based io controller can provide better latencies, > > stronger isolation between groups, time based fairness and will not > > interfere with io schedulers policies like class, ioprio and > > reader vs writer issues. > > > > But it can gunrantee fairness at higher logical level devices. > > Especially in case of max bw control, leaf node control does not sound > > to be the most appropriate thing. > > > > - IO throttling provides max bw control in terms of absolute rate. It has > > the advantage that it can provide control at higher level logical device > > and also control buffered writes without need of additional controller > > co-mounted. > > > > But it does only max bw control and not proportion control so one might > > not be using resources optimally. It looses sense of task prio and class > > with-in group as any of the task can be throttled with-in group. Because > > throttling does not kick in till you hit the max bw limit, it should find > > it hard to provide same latencies as io scheduler based control. > > > > - dm-ioband also has the advantage that it can provide fairness at higher > > level logical devices. > > > > But, fairness is provided only in terms of size of IO or number of IO. > > No time based fairness. It is very throughput oriented and does not > > throttle high speed group if other group is running slow random reader. > > This results in bad latnecies for random reader group and weaker > > isolation between groups. > > A new policy can be added to dm-ioband. Actually, range-bw policy, > which provides min and max bandwidth control, does time-based > throttling. Moreover there is room for improvement for existing > policies. The write-starve-read issue you pointed out will be solved > soon. > > > Also it does not provide fairness if a group is not continuously > > backlogged. So if one is running 1-2 dd/sequential readers in the group, > > one does not get fairness until workload is increased to a point where > > group becomes continuously backlogged. This also results in poor > > latencies and limited fairness. > > This is intended to efficiently use bandwidth of underlying devices > when IO load is low. > > > At this point of time it does not look like a single IO controller all > > the scenarios/requirements. This means few things to me. > > > > - Drop some of the requirements and go with one implementation which meets > > those reduced set of requirements. > > > > - Have more than one IO controller implementation in kenrel. One for lower > > level control for better latencies, stronger isolation and optimal resource > > usage and other one for fairness at higher level logical devices and max > > bandwidth control. > > > > And let user decide which one to use based on his/her needs. > > > > - Come up with more intelligent way of doing IO control where single > > controller covers all the cases. > > > > At this point of time, I am more inclined towards option 2 of having more > > than one implementation in kernel. :-) (Until and unless we can brainstrom > > and come up with ideas to make option 3 happen). > > > > > It would be great if we discuss our plans on the mailing list, so we > > > can get early feedback from everyone. > > > > This is what comes to my mind so far. Please add to the list if I have missed > > some points. Also correct me if I am wrong about the pros/cons of the > > approaches. > > > > Thoughts/ideas/opinions are welcome... > > > > Thanks > > Vivek > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 2009-09-29 9:56 ` Ryo Tsuruta @ 2009-09-30 3:11 ` Vivek Goyal 2009-09-29 14:10 ` Vivek Goyal ` (2 subsequent siblings) 3 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 3:11 UTC (permalink / raw) To: Ryo Tsuruta Cc: nauman, linux-kernel, jens.axboe, containers, dm-devel, dpshah, lizf, mikew, fchecconi, paolo.valente, fernando, s-uchida, taka, guijianfeng, jmoyer, dhaval, balbir, righi.andrea, m-ikeda, agk, akpm, peterz, jmarchan, torvalds, mingo, riel, yoshikawa.takuya On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote: > Hi Vivek and all, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > > > > We are starting from a point where there is no cgroup based IO > > > scheduling in the kernel. And it is probably not reasonable to satisfy > > > all IO scheduling related requirements in one patch set. We can start > > > with something simple, and build on top of that. So a very simple > > > patch set that enables cgroup based proportional scheduling for CFQ > > > seems like the way to go at this point. > > > > Sure, we can start with CFQ only. But a bigger question we need to answer > > is that is CFQ the right place to solve the issue? Jens, do you think > > that CFQ is the right place to solve the problem? > > > > Andrew seems to favor a high level approach so that IO schedulers are less > > complex and we can provide fairness at high level logical devices also. > > I'm not in favor of expansion of CFQ, because some enterprise storages > are better performed with NOOP rather than CFQ, and I think bandwidth > control is needed much more for such storage system. Is it easy to > support other IO schedulers even if a new IO scheduler is introduced? > I would like to know a bit more specific about Namuman's scheduler design. > > > I will again try to summarize my understanding so far about the pros/cons > > of each approach and then we can take the discussion forward. > > Good summary. Thanks for your work. > > > Fairness in terms of size of IO or disk time used > > ================================================= > > On a seeky media, fairness in terms of disk time can get us better results > > instead fairness interms of size of IO or number of IO. > > > > If we implement some kind of time based solution at higher layer, then > > that higher layer should know who used how much of time each group used. We > > can probably do some kind of timestamping in bio to get a sense when did it > > get into disk and when did it finish. But on a multi queue hardware there > > can be multiple requests in the disk either from same queue or from differnet > > queues and with pure timestamping based apparoch, so far I could not think > > how at high level we will get an idea who used how much of time. > > IIUC, could the overlap time be calculated from time-stamp on a multi > queue hardware? > > > So this is the first point of contention that how do we want to provide > > fairness. In terms of disk time used or in terms of size of IO/number of > > IO. > > > > Max bandwidth Controller or Proportional bandwidth controller > > ============================================================= > > What is our primary requirement here? A weight based proportional > > bandwidth controller where we can use the resources optimally and any > > kind of throttling kicks in only if there is contention for the disk. > > > > Or we want max bandwidth control where a group is not allowed to use the > > disk even if disk is free. > > > > Or we need both? I would think that at some point of time we will need > > both but we can start with proportional bandwidth control first. > > How about making throttling policy be user selectable like the IO > scheduler and putting it in the higher layer? So we could support > all of policies (time-based, size-based and rate limiting). There > seems not to only one solution which satisfies all users. But I agree > with starting with proportional bandwidth control first. > > BTW, I will start to reimplement dm-ioband into block layer. > > > Fairness for higher level logical devices > > ========================================= > > Do we want good fairness numbers for higher level logical devices also > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > at leaf nodes can help us use the resources optimally and in the process > > we can get fairness at higher level also in many of the cases. > > We should also take care of block devices which provide their own > make_request_fn() and not use a IO scheduler. We can't use the leaf > nodes approach to such devices. > > > But do we want strict fairness numbers on higher level logical devices > > even if it means sub-optimal usage of unerlying phsical devices? > > > > I think that for proportinal bandwidth control, it should be ok to provide > > fairness at higher level logical device but for max bandwidth control it > > might make more sense to provide fairness at higher level. Consider a > > case where from a striped device a customer wants to limit a group to > > 30MB/s and in case of leaf node control, if every leaf node provides > > 30MB/s, it might accumulate to much more than specified rate at logical > > device. > > > > Latency Control and strong isolation between groups > > =================================================== > > Do we want a good isolation between groups and better latencies and > > stronger isolation between groups? > > > > I think if problem is solved at IO scheduler level, we can achieve better > > latency control and hence stronger isolation between groups. > > > > Higher level solutions should find it hard to provide same kind of latency > > control and isolation between groups as IO scheduler based solution. > > Why do you think that the higher level solution is hard to provide it? > I think that it is a matter of how to implement throttling policy. > > > Fairness for buffered writes > > ============================ > > Doing io control at any place below page cache has disadvantage that page > > cache might not dispatch more writes from higher weight group hence higher > > weight group might not see more IO done. Andrew says that we don't have > > a solution to this problem in kernel and he would like to see it handled > > properly. > > > > Only way to solve this seems to be to slow down the writers before they > > write into page cache. IO throttling patch handled it by slowing down > > writer if it crossed max specified rate. Other suggestions have come in > > the form of dirty_ratio per memory cgroup or a separate cgroup controller > > al-together where some kind of per group write limit can be specified. > > > > So if solution is implemented at IO scheduler layer or at device mapper > > layer, both shall have to rely on another controller to be co-mounted > > to handle buffered writes properly. > > > > Fairness with-in group > > ====================== > > One of the issues with higher level controller is that how to do fair > > throttling so that fairness with-in group is not impacted. Especially > > the case of making sure that we don't break the notion of ioprio of the > > processes with-in group. > > I ran your test script to confirm that the notion of ioprio was not > broken by dm-ioband. Here is the results of the test. > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > I think that the time period during which dm-ioband holds IO requests > for throttling would be too short to break the notion of ioprio. > Hi Ryo, I am doing some more tests to see how do we maintain notion of prio with-in group. I have created two ioband devies ioband1 and ioband2 of weight 100 each on two disk partitions. On one partition/device (ioband1) a buffered writer is doing writeout and on other partition I launch one prio0 reader and increasing number of prio4 readers using fio and let it run for 30 seconds and see how BW got distributed between prio0 and prio4 processes. Note, here readers are doing direct IO. I did this test with vanilla CFQ and with dm-ioband + cfq. With vanilla CFQ ---------------- <---------prio4 readers --------------------------> <---prio0 reader---> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 12892KiB/s 12892KiB/s 12892KiB/s 409K usec 14705KiB/s 252K usec 2 5667KiB/s 5637KiB/s 11302KiB/s 717K usec 17555KiB/s 339K usec 4 4395KiB/s 4173KiB/s 17027KiB/s 933K usec 12437KiB/s 553K usec 8 2652KiB/s 2391KiB/s 20268KiB/s 1410K usec 9482KiB/s 685K usec 16 1653KiB/s 1413KiB/s 24035KiB/s 2418K usec 5860KiB/s 1027K usec Note, as we increase number of prio4 readers, prio0 processes aggregate bandwidth goes down (nr=2 seems to be only exception) but it still maintains more BW than prio4 process. Also note that as we increase number of prio4 readers, their aggreagate bandwidth goes up which is expected. With dm-ioband -------------- <---------prio4 readers --------------------------> <---prio0 reader---> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 11242KiB/s 11242KiB/s 11242KiB/s 415K usec 3884KiB/s 244K usec 2 8110KiB/s 6236KiB/s 14345KiB/s 304K usec 320KiB/s 125K usec 4 6898KiB/s 622KiB/s 11059KiB/s 206K usec 503KiB/s 201K usec 8 345KiB/s 47KiB/s 850KiB/s 342K usec 8350KiB/s 164K usec 16 28KiB/s 28KiB/s 451KiB/s 688 msec 5092KiB/s 306K usec Looking at the output with dm-ioband, it seems to be all over the place. Look at aggregate bandwidth of prio0 reader and how wildly it is swinging. It first goes down and then suddenly jumps up way high. Similiarly look at aggregate bandwidth of prio4 readers and the moment we hit 8 readers, it suddenly tanks. Look at prio4 reader and prio 7 reader BW with 16 prio4 processes running. prio4 process gets 28Kb/s and prio 0 process gets 5MB/s. Can you please look into it? It looks like we got serious issues w.r.t to fairness and bandwidth distribution with-in group. Thanks Vivek > > Especially io throttling patch was very bad in terms of prio with-in > > group where throttling treated everyone equally and difference between > > process prio disappeared. > > > > Reads Vs Writes > > =============== > > A higher level control most likely will change the ratio in which reads > > and writes are dispatched to disk with-in group. It used to be decided > > by IO scheduler so far but with higher level groups doing throttling and > > possibly buffering the bios and releasing them later, they will have to > > come up with their own policy on in what proportion reads and writes > > should be dispatched. In case of IO scheduler based control, all the > > queuing takes place at IO scheduler and it still retains control of > > in what ration reads and writes should be dispatched. > > I don't think it is a concern. The current implementation of dm-ioband > is that sync/async IO requests are handled separately and the > backlogged IOs are released according to the order of arrival if both > sync and async requests are backlogged. > > > Summary > > ======= > > > > - An io scheduler based io controller can provide better latencies, > > stronger isolation between groups, time based fairness and will not > > interfere with io schedulers policies like class, ioprio and > > reader vs writer issues. > > > > But it can gunrantee fairness at higher logical level devices. > > Especially in case of max bw control, leaf node control does not sound > > to be the most appropriate thing. > > > > - IO throttling provides max bw control in terms of absolute rate. It has > > the advantage that it can provide control at higher level logical device > > and also control buffered writes without need of additional controller > > co-mounted. > > > > But it does only max bw control and not proportion control so one might > > not be using resources optimally. It looses sense of task prio and class > > with-in group as any of the task can be throttled with-in group. Because > > throttling does not kick in till you hit the max bw limit, it should find > > it hard to provide same latencies as io scheduler based control. > > > > - dm-ioband also has the advantage that it can provide fairness at higher > > level logical devices. > > > > But, fairness is provided only in terms of size of IO or number of IO. > > No time based fairness. It is very throughput oriented and does not > > throttle high speed group if other group is running slow random reader. > > This results in bad latnecies for random reader group and weaker > > isolation between groups. > > A new policy can be added to dm-ioband. Actually, range-bw policy, > which provides min and max bandwidth control, does time-based > throttling. Moreover there is room for improvement for existing > policies. The write-starve-read issue you pointed out will be solved > soon. > > > Also it does not provide fairness if a group is not continuously > > backlogged. So if one is running 1-2 dd/sequential readers in the group, > > one does not get fairness until workload is increased to a point where > > group becomes continuously backlogged. This also results in poor > > latencies and limited fairness. > > This is intended to efficiently use bandwidth of underlying devices > when IO load is low. > > > At this point of time it does not look like a single IO controller all > > the scenarios/requirements. This means few things to me. > > > > - Drop some of the requirements and go with one implementation which meets > > those reduced set of requirements. > > > > - Have more than one IO controller implementation in kenrel. One for lower > > level control for better latencies, stronger isolation and optimal resource > > usage and other one for fairness at higher level logical devices and max > > bandwidth control. > > > > And let user decide which one to use based on his/her needs. > > > > - Come up with more intelligent way of doing IO control where single > > controller covers all the cases. > > > > At this point of time, I am more inclined towards option 2 of having more > > than one implementation in kernel. :-) (Until and unless we can brainstrom > > and come up with ideas to make option 3 happen). > > > > > It would be great if we discuss our plans on the mailing list, so we > > > can get early feedback from everyone. > > > > This is what comes to my mind so far. Please add to the list if I have missed > > some points. Also correct me if I am wrong about the pros/cons of the > > approaches. > > > > Thoughts/ideas/opinions are welcome... > > > > Thanks > > Vivek > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 @ 2009-09-30 3:11 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-30 3:11 UTC (permalink / raw) To: Ryo Tsuruta Cc: dhaval, peterz, dm-devel, dpshah, jens.axboe, agk, balbir, paolo.valente, jmarchan, guijianfeng, fernando, mikew, yoshikawa.takuya, jmoyer, nauman, mingo, m-ikeda, riel, lizf, fchecconi, s-uchida, containers, linux-kernel, akpm, righi.andrea, torvalds On Tue, Sep 29, 2009 at 06:56:53PM +0900, Ryo Tsuruta wrote: > Hi Vivek and all, > > Vivek Goyal <vgoyal@redhat.com> wrote: > > On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > > > > We are starting from a point where there is no cgroup based IO > > > scheduling in the kernel. And it is probably not reasonable to satisfy > > > all IO scheduling related requirements in one patch set. We can start > > > with something simple, and build on top of that. So a very simple > > > patch set that enables cgroup based proportional scheduling for CFQ > > > seems like the way to go at this point. > > > > Sure, we can start with CFQ only. But a bigger question we need to answer > > is that is CFQ the right place to solve the issue? Jens, do you think > > that CFQ is the right place to solve the problem? > > > > Andrew seems to favor a high level approach so that IO schedulers are less > > complex and we can provide fairness at high level logical devices also. > > I'm not in favor of expansion of CFQ, because some enterprise storages > are better performed with NOOP rather than CFQ, and I think bandwidth > control is needed much more for such storage system. Is it easy to > support other IO schedulers even if a new IO scheduler is introduced? > I would like to know a bit more specific about Namuman's scheduler design. > > > I will again try to summarize my understanding so far about the pros/cons > > of each approach and then we can take the discussion forward. > > Good summary. Thanks for your work. > > > Fairness in terms of size of IO or disk time used > > ================================================= > > On a seeky media, fairness in terms of disk time can get us better results > > instead fairness interms of size of IO or number of IO. > > > > If we implement some kind of time based solution at higher layer, then > > that higher layer should know who used how much of time each group used. We > > can probably do some kind of timestamping in bio to get a sense when did it > > get into disk and when did it finish. But on a multi queue hardware there > > can be multiple requests in the disk either from same queue or from differnet > > queues and with pure timestamping based apparoch, so far I could not think > > how at high level we will get an idea who used how much of time. > > IIUC, could the overlap time be calculated from time-stamp on a multi > queue hardware? > > > So this is the first point of contention that how do we want to provide > > fairness. In terms of disk time used or in terms of size of IO/number of > > IO. > > > > Max bandwidth Controller or Proportional bandwidth controller > > ============================================================= > > What is our primary requirement here? A weight based proportional > > bandwidth controller where we can use the resources optimally and any > > kind of throttling kicks in only if there is contention for the disk. > > > > Or we want max bandwidth control where a group is not allowed to use the > > disk even if disk is free. > > > > Or we need both? I would think that at some point of time we will need > > both but we can start with proportional bandwidth control first. > > How about making throttling policy be user selectable like the IO > scheduler and putting it in the higher layer? So we could support > all of policies (time-based, size-based and rate limiting). There > seems not to only one solution which satisfies all users. But I agree > with starting with proportional bandwidth control first. > > BTW, I will start to reimplement dm-ioband into block layer. > > > Fairness for higher level logical devices > > ========================================= > > Do we want good fairness numbers for higher level logical devices also > > or it is sufficient to provide fairness at leaf nodes. Providing fairness > > at leaf nodes can help us use the resources optimally and in the process > > we can get fairness at higher level also in many of the cases. > > We should also take care of block devices which provide their own > make_request_fn() and not use a IO scheduler. We can't use the leaf > nodes approach to such devices. > > > But do we want strict fairness numbers on higher level logical devices > > even if it means sub-optimal usage of unerlying phsical devices? > > > > I think that for proportinal bandwidth control, it should be ok to provide > > fairness at higher level logical device but for max bandwidth control it > > might make more sense to provide fairness at higher level. Consider a > > case where from a striped device a customer wants to limit a group to > > 30MB/s and in case of leaf node control, if every leaf node provides > > 30MB/s, it might accumulate to much more than specified rate at logical > > device. > > > > Latency Control and strong isolation between groups > > =================================================== > > Do we want a good isolation between groups and better latencies and > > stronger isolation between groups? > > > > I think if problem is solved at IO scheduler level, we can achieve better > > latency control and hence stronger isolation between groups. > > > > Higher level solutions should find it hard to provide same kind of latency > > control and isolation between groups as IO scheduler based solution. > > Why do you think that the higher level solution is hard to provide it? > I think that it is a matter of how to implement throttling policy. > > > Fairness for buffered writes > > ============================ > > Doing io control at any place below page cache has disadvantage that page > > cache might not dispatch more writes from higher weight group hence higher > > weight group might not see more IO done. Andrew says that we don't have > > a solution to this problem in kernel and he would like to see it handled > > properly. > > > > Only way to solve this seems to be to slow down the writers before they > > write into page cache. IO throttling patch handled it by slowing down > > writer if it crossed max specified rate. Other suggestions have come in > > the form of dirty_ratio per memory cgroup or a separate cgroup controller > > al-together where some kind of per group write limit can be specified. > > > > So if solution is implemented at IO scheduler layer or at device mapper > > layer, both shall have to rely on another controller to be co-mounted > > to handle buffered writes properly. > > > > Fairness with-in group > > ====================== > > One of the issues with higher level controller is that how to do fair > > throttling so that fairness with-in group is not impacted. Especially > > the case of making sure that we don't break the notion of ioprio of the > > processes with-in group. > > I ran your test script to confirm that the notion of ioprio was not > broken by dm-ioband. Here is the results of the test. > https://lists.linux-foundation.org/pipermail/containers/2009-May/017834.html > > I think that the time period during which dm-ioband holds IO requests > for throttling would be too short to break the notion of ioprio. > Hi Ryo, I am doing some more tests to see how do we maintain notion of prio with-in group. I have created two ioband devies ioband1 and ioband2 of weight 100 each on two disk partitions. On one partition/device (ioband1) a buffered writer is doing writeout and on other partition I launch one prio0 reader and increasing number of prio4 readers using fio and let it run for 30 seconds and see how BW got distributed between prio0 and prio4 processes. Note, here readers are doing direct IO. I did this test with vanilla CFQ and with dm-ioband + cfq. With vanilla CFQ ---------------- <---------prio4 readers --------------------------> <---prio0 reader---> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 12892KiB/s 12892KiB/s 12892KiB/s 409K usec 14705KiB/s 252K usec 2 5667KiB/s 5637KiB/s 11302KiB/s 717K usec 17555KiB/s 339K usec 4 4395KiB/s 4173KiB/s 17027KiB/s 933K usec 12437KiB/s 553K usec 8 2652KiB/s 2391KiB/s 20268KiB/s 1410K usec 9482KiB/s 685K usec 16 1653KiB/s 1413KiB/s 24035KiB/s 2418K usec 5860KiB/s 1027K usec Note, as we increase number of prio4 readers, prio0 processes aggregate bandwidth goes down (nr=2 seems to be only exception) but it still maintains more BW than prio4 process. Also note that as we increase number of prio4 readers, their aggreagate bandwidth goes up which is expected. With dm-ioband -------------- <---------prio4 readers --------------------------> <---prio0 reader---> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 11242KiB/s 11242KiB/s 11242KiB/s 415K usec 3884KiB/s 244K usec 2 8110KiB/s 6236KiB/s 14345KiB/s 304K usec 320KiB/s 125K usec 4 6898KiB/s 622KiB/s 11059KiB/s 206K usec 503KiB/s 201K usec 8 345KiB/s 47KiB/s 850KiB/s 342K usec 8350KiB/s 164K usec 16 28KiB/s 28KiB/s 451KiB/s 688 msec 5092KiB/s 306K usec Looking at the output with dm-ioband, it seems to be all over the place. Look at aggregate bandwidth of prio0 reader and how wildly it is swinging. It first goes down and then suddenly jumps up way high. Similiarly look at aggregate bandwidth of prio4 readers and the moment we hit 8 readers, it suddenly tanks. Look at prio4 reader and prio 7 reader BW with 16 prio4 processes running. prio4 process gets 28Kb/s and prio 0 process gets 5MB/s. Can you please look into it? It looks like we got serious issues w.r.t to fairness and bandwidth distribution with-in group. Thanks Vivek > > Especially io throttling patch was very bad in terms of prio with-in > > group where throttling treated everyone equally and difference between > > process prio disappeared. > > > > Reads Vs Writes > > =============== > > A higher level control most likely will change the ratio in which reads > > and writes are dispatched to disk with-in group. It used to be decided > > by IO scheduler so far but with higher level groups doing throttling and > > possibly buffering the bios and releasing them later, they will have to > > come up with their own policy on in what proportion reads and writes > > should be dispatched. In case of IO scheduler based control, all the > > queuing takes place at IO scheduler and it still retains control of > > in what ration reads and writes should be dispatched. > > I don't think it is a concern. The current implementation of dm-ioband > is that sync/async IO requests are handled separately and the > backlogged IOs are released according to the order of arrival if both > sync and async requests are backlogged. > > > Summary > > ======= > > > > - An io scheduler based io controller can provide better latencies, > > stronger isolation between groups, time based fairness and will not > > interfere with io schedulers policies like class, ioprio and > > reader vs writer issues. > > > > But it can gunrantee fairness at higher logical level devices. > > Especially in case of max bw control, leaf node control does not sound > > to be the most appropriate thing. > > > > - IO throttling provides max bw control in terms of absolute rate. It has > > the advantage that it can provide control at higher level logical device > > and also control buffered writes without need of additional controller > > co-mounted. > > > > But it does only max bw control and not proportion control so one might > > not be using resources optimally. It looses sense of task prio and class > > with-in group as any of the task can be throttled with-in group. Because > > throttling does not kick in till you hit the max bw limit, it should find > > it hard to provide same latencies as io scheduler based control. > > > > - dm-ioband also has the advantage that it can provide fairness at higher > > level logical devices. > > > > But, fairness is provided only in terms of size of IO or number of IO. > > No time based fairness. It is very throughput oriented and does not > > throttle high speed group if other group is running slow random reader. > > This results in bad latnecies for random reader group and weaker > > isolation between groups. > > A new policy can be added to dm-ioband. Actually, range-bw policy, > which provides min and max bandwidth control, does time-based > throttling. Moreover there is room for improvement for existing > policies. The write-starve-read issue you pointed out will be solved > soon. > > > Also it does not provide fairness if a group is not continuously > > backlogged. So if one is running 1-2 dd/sequential readers in the group, > > one does not get fairness until workload is increased to a point where > > group becomes continuously backlogged. This also results in poor > > latencies and limited fairness. > > This is intended to efficiently use bandwidth of underlying devices > when IO load is low. > > > At this point of time it does not look like a single IO controller all > > the scenarios/requirements. This means few things to me. > > > > - Drop some of the requirements and go with one implementation which meets > > those reduced set of requirements. > > > > - Have more than one IO controller implementation in kenrel. One for lower > > level control for better latencies, stronger isolation and optimal resource > > usage and other one for fairness at higher level logical devices and max > > bandwidth control. > > > > And let user decide which one to use based on his/her needs. > > > > - Come up with more intelligent way of doing IO control where single > > controller covers all the cases. > > > > At this point of time, I am more inclined towards option 2 of having more > > than one implementation in kernel. :-) (Until and unless we can brainstrom > > and come up with ideas to make option 3 happen). > > > > > It would be great if we discuss our plans on the mailing list, so we > > > can get early feedback from everyone. > > > > This is what comes to my mind so far. Please add to the list if I have missed > > some points. Also correct me if I am wrong about the pros/cons of the > > approaches. > > > > Thoughts/ideas/opinions are welcome... > > > > Thanks > > Vivek > > Thanks, > Ryo Tsuruta ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <e98e18940909281737q142c788dpd20b8bdc05dd0eff-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <e98e18940909281737q142c788dpd20b8bdc05dd0eff-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2009-09-29 3:22 ` Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-29 3:22 UTC (permalink / raw) To: Nauman Rafique Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Mon, Sep 28, 2009 at 05:37:28PM -0700, Nauman Rafique wrote: > Hi Vivek, > Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with > Jens about IO controller during Linux Plumbers Conference '09. Jens > expressed his concerns about the size and complexity of the patches. I > believe that is a reasonable concern. We talked about things that > could be done to reduce the size of the patches. The requirement that > the "solution has to work with all IO schedulers" seems like a > secondary concern at this point; and it came out as one thing that can > help to reduce the size of the patch set. Initially doing cgroup based IO control only for CFQ should help a lot in reducing the patchset size. > Another possibility is to > use a simpler scheduling algorithm e.g. weighted round robin, instead > of BFQ scheduler. BFQ indeed has great properties, but we cannot deny > the fact that it is complex to understand, and might be cumbersome to > maintain. Core of the BFQ I have gotten rid of already. The remaining part is idle tree and data structures. I will see how can I simplify it further. > Also, hierarchical scheduling is something that could be > unnecessary in the first set of patches, even though cgroups are > hierarchical in nature. Sure. Though I don't think that a lot of code is there because of hierarchical nature. If we solve the issue at CFQ layer, we have to maintain atleast two levels. One for queue and other for groups. So even the simplest solution becomes almost hierarchical in nature. But I will still see how to get rid of some code here too... > > We are starting from a point where there is no cgroup based IO > scheduling in the kernel. And it is probably not reasonable to satisfy > all IO scheduling related requirements in one patch set. We can start > with something simple, and build on top of that. So a very simple > patch set that enables cgroup based proportional scheduling for CFQ > seems like the way to go at this point. Sure, we can start with CFQ only. But a bigger question we need to answer is that is CFQ the right place to solve the issue? Jens, do you think that CFQ is the right place to solve the problem? Andrew seems to favor a high level approach so that IO schedulers are less complex and we can provide fairness at high level logical devices also. I will again try to summarize my understanding so far about the pros/cons of each approach and then we can take the discussion forward. Fairness in terms of size of IO or disk time used ================================================= On a seeky media, fairness in terms of disk time can get us better results instead fairness interms of size of IO or number of IO. If we implement some kind of time based solution at higher layer, then that higher layer should know who used how much of time each group used. We can probably do some kind of timestamping in bio to get a sense when did it get into disk and when did it finish. But on a multi queue hardware there can be multiple requests in the disk either from same queue or from differnet queues and with pure timestamping based apparoch, so far I could not think how at high level we will get an idea who used how much of time. So this is the first point of contention that how do we want to provide fairness. In terms of disk time used or in terms of size of IO/number of IO. Max bandwidth Controller or Proportional bandwidth controller ============================================================= What is our primary requirement here? A weight based proportional bandwidth controller where we can use the resources optimally and any kind of throttling kicks in only if there is contention for the disk. Or we want max bandwidth control where a group is not allowed to use the disk even if disk is free. Or we need both? I would think that at some point of time we will need both but we can start with proportional bandwidth control first. Fairness for higher level logical devices ========================================= Do we want good fairness numbers for higher level logical devices also or it is sufficient to provide fairness at leaf nodes. Providing fairness at leaf nodes can help us use the resources optimally and in the process we can get fairness at higher level also in many of the cases. But do we want strict fairness numbers on higher level logical devices even if it means sub-optimal usage of unerlying phsical devices? I think that for proportinal bandwidth control, it should be ok to provide fairness at higher level logical device but for max bandwidth control it might make more sense to provide fairness at higher level. Consider a case where from a striped device a customer wants to limit a group to 30MB/s and in case of leaf node control, if every leaf node provides 30MB/s, it might accumulate to much more than specified rate at logical device. Latency Control and strong isolation between groups =================================================== Do we want a good isolation between groups and better latencies and stronger isolation between groups? I think if problem is solved at IO scheduler level, we can achieve better latency control and hence stronger isolation between groups. Higher level solutions should find it hard to provide same kind of latency control and isolation between groups as IO scheduler based solution. Fairness for buffered writes ============================ Doing io control at any place below page cache has disadvantage that page cache might not dispatch more writes from higher weight group hence higher weight group might not see more IO done. Andrew says that we don't have a solution to this problem in kernel and he would like to see it handled properly. Only way to solve this seems to be to slow down the writers before they write into page cache. IO throttling patch handled it by slowing down writer if it crossed max specified rate. Other suggestions have come in the form of dirty_ratio per memory cgroup or a separate cgroup controller al-together where some kind of per group write limit can be specified. So if solution is implemented at IO scheduler layer or at device mapper layer, both shall have to rely on another controller to be co-mounted to handle buffered writes properly. Fairness with-in group ====================== One of the issues with higher level controller is that how to do fair throttling so that fairness with-in group is not impacted. Especially the case of making sure that we don't break the notion of ioprio of the processes with-in group. Especially io throttling patch was very bad in terms of prio with-in group where throttling treated everyone equally and difference between process prio disappeared. Reads Vs Writes =============== A higher level control most likely will change the ratio in which reads and writes are dispatched to disk with-in group. It used to be decided by IO scheduler so far but with higher level groups doing throttling and possibly buffering the bios and releasing them later, they will have to come up with their own policy on in what proportion reads and writes should be dispatched. In case of IO scheduler based control, all the queuing takes place at IO scheduler and it still retains control of in what ration reads and writes should be dispatched. Summary ======= - An io scheduler based io controller can provide better latencies, stronger isolation between groups, time based fairness and will not interfere with io schedulers policies like class, ioprio and reader vs writer issues. But it can gunrantee fairness at higher logical level devices. Especially in case of max bw control, leaf node control does not sound to be the most appropriate thing. - IO throttling provides max bw control in terms of absolute rate. It has the advantage that it can provide control at higher level logical device and also control buffered writes without need of additional controller co-mounted. But it does only max bw control and not proportion control so one might not be using resources optimally. It looses sense of task prio and class with-in group as any of the task can be throttled with-in group. Because throttling does not kick in till you hit the max bw limit, it should find it hard to provide same latencies as io scheduler based control. - dm-ioband also has the advantage that it can provide fairness at higher level logical devices. But, fairness is provided only in terms of size of IO or number of IO. No time based fairness. It is very throughput oriented and does not throttle high speed group if other group is running slow random reader. This results in bad latnecies for random reader group and weaker isolation between groups. Also it does not provide fairness if a group is not continuously backlogged. So if one is running 1-2 dd/sequential readers in the group, one does not get fairness until workload is increased to a point where group becomes continuously backlogged. This also results in poor latencies and limited fairness. At this point of time it does not look like a single IO controller all the scenarios/requirements. This means few things to me. - Drop some of the requirements and go with one implementation which meets those reduced set of requirements. - Have more than one IO controller implementation in kenrel. One for lower level control for better latencies, stronger isolation and optimal resource usage and other one for fairness at higher level logical devices and max bandwidth control. And let user decide which one to use based on his/her needs. - Come up with more intelligent way of doing IO control where single controller covers all the cases. At this point of time, I am more inclined towards option 2 of having more than one implementation in kernel. :-) (Until and unless we can brainstrom and come up with ideas to make option 3 happen). > > It would be great if we discuss our plans on the mailing list, so we > can get early feedback from everyone. This is what comes to my mind so far. Please add to the list if I have missed some points. Also correct me if I am wrong about the pros/cons of the approaches. Thoughts/ideas/opinions are welcome... Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
[parent not found: <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: IO scheduler based IO controller V10 [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2009-09-24 21:33 ` Andrew Morton 2009-09-25 2:20 ` Ulrich Lukas 2009-09-29 0:37 ` Nauman Rafique 2 siblings, 0 replies; 349+ messages in thread From: Andrew Morton @ 2009-09-24 21:33 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b On Thu, 24 Sep 2009 15:25:04 -0400 Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > > Hi All, > > Here is the V10 of the IO controller patches generated on top of 2.6.31. > Thanks for the writeup. It really helps and is most worthwhile for a project of this importance, size and complexity. > > What problem are we trying to solve > =================================== > Provide group IO scheduling feature in Linux along the lines of other resource > controllers like cpu. > > IOW, provide facility so that a user can group applications using cgroups and > control the amount of disk time/bandwidth received by a group based on its > weight. > > How to solve the problem > ========================= > > Different people have solved the issue differetnly. So far looks it looks > like we seem to have following two core requirements when it comes to > fairness at group level. > > - Control bandwidth seen by groups. > - Control on latencies when a request gets backlogged in group. > > At least there are now three patchsets available (including this one). > > IO throttling > ------------- > This is a bandwidth controller which keeps track of IO rate of a group and > throttles the process in the group if it exceeds the user specified limit. > > dm-ioband > --------- > This is a proportional bandwidth controller implemented as device mapper > driver and provides fair access in terms of amount of IO done (not in terms > of disk time as CFQ does). > > So one will setup one or more dm-ioband devices on top of physical/logical > block device, configure the ioband device and pass information like grouping > etc. Now this device will keep track of bios flowing through it and control > the flow of bios based on group policies. > > IO scheduler based IO controller > -------------------------------- > Here we have viewed the problem of IO contoller as hierarchical group > scheduling (along the lines of CFS group scheduling) issue. Currently one can > view linux IO schedulers as flat where there is one root group and all the IO > belongs to that group. > > This patchset basically modifies IO schedulers to also support hierarchical > group scheduling. CFQ already provides fairness among different processes. I > have extended it support group IO schduling. Also took some of the code out > of CFQ and put in a common layer so that same group scheduling code can be > used by noop, deadline and AS to support group scheduling. > > Pros/Cons > ========= > There are pros and cons to each of the approach. Following are some of the > thoughts. > > Max bandwidth vs proportional bandwidth > --------------------------------------- > IO throttling is a max bandwidth controller and not a proportional one. > Additionaly it provides fairness in terms of amount of IO done (and not in > terms of disk time as CFQ does). > > Personally, I think that proportional weight controller is useful to more > people than just max bandwidth controller. In addition, IO scheduler based > controller can also be enhanced to do max bandwidth control. So it can > satisfy wider set of requirements. > > Fairness in terms of disk time vs size of IO > --------------------------------------------- > An higher level controller will most likely be limited to providing fairness > in terms of size/number of IO done and will find it hard to provide fairness > in terms of disk time used (as CFQ provides between various prio levels). This > is because only IO scheduler knows how much disk time a queue has used and > information about queues and disk time used is not exported to higher > layers. > > So a seeky application will still run away with lot of disk time and bring > down the overall throughput of the the disk. But that's only true if the thing is poorly implemented. A high-level controller will need some view of the busyness of the underlying device(s). That could be "proportion of idle time", or "average length of queue" or "average request latency" or some mix of these or something else altogether. But these things are simple to calculate, and are simple to feed back to the higher-level controller and probably don't require any changes to to IO scheduler at all, which is a great advantage. And I must say that high-level throttling based upon feedback from lower layers seems like a much better model to me than hacking away in the IO scheduler layer. Both from an implementation point of view and from a "we can get it to work on things other than block devices" point of view. > Currently dm-ioband provides fairness in terms of number/size of IO. > > Latencies and isolation between groups > -------------------------------------- > An higher level controller is generally implementing a bandwidth throttling > solution where if a group exceeds either the max bandwidth or the proportional > share then throttle that group. > > This kind of approach will probably not help in controlling latencies as it > will depend on underlying IO scheduler. Consider following scenario. > > Assume there are two groups. One group is running multiple sequential readers > and other group has a random reader. sequential readers will get a nice 100ms > slice Do you refer to each reader within group1, or to all readers? It would be daft if each reader in group1 were to get 100ms. > each and then a random reader from group2 will get to dispatch the > request. So latency of this random reader will depend on how many sequential > readers are running in other group and that is a weak isolation between groups. And yet that is what you appear to mean. But surely nobody would do that - the 100ms would be assigned to and distributed amongst all readers in group1? > When we control things at IO scheduler level, we assign one time slice to one > group and then pick next entity to run. So effectively after one time slice > (max 180ms, if prio 0 sequential reader is running), random reader in other > group will get to run. Hence we achieve better isolation between groups as > response time of process in a differnt group is generally not dependent on > number of processes running in competing group. I don't understand why you're comparing this implementation with such an obviously dumb competing design! > So a higher level solution is most likely limited to only shaping bandwidth > without any control on latencies. > > Stacking group scheduler on top of CFQ can lead to issues > --------------------------------------------------------- > IO throttling and dm-ioband both are second level controller. That is these > controllers are implemented in higher layers than io schedulers. So they > control the IO at higher layer based on group policies and later IO > schedulers take care of dispatching these bios to disk. > > Implementing a second level controller has the advantage of being able to > provide bandwidth control even on logical block devices in the IO stack > which don't have any IO schedulers attached to these. But they can also > interefere with IO scheduling policy of underlying IO scheduler and change > the effective behavior. Following are some of the issues which I think > should be visible in second level controller in one form or other. > > Prio with-in group > ------------------ > A second level controller can potentially interefere with behavior of > different prio processes with-in a group. bios are buffered at higher layer > in single queue and release of bios is FIFO and not proportionate to the > ioprio of the process. This can result in a particular prio level not > getting fair share. That's an administrator error, isn't it? Should have put the different-priority processes into different groups. > Buffering at higher layer can delay read requests for more than slice idle > period of CFQ (default 8 ms). That means, it is possible that we are waiting > for a request from the queue but it is buffered at higher layer and then idle > timer will fire. It means that queue will losse its share at the same time > overall throughput will be impacted as we lost those 8 ms. That sounds like a bug. > Read Vs Write > ------------- > Writes can overwhelm readers hence second level controller FIFO release > will run into issue here. If there is a single queue maintained then reads > will suffer large latencies. If there separate queues for reads and writes > then it will be hard to decide in what ratio to dispatch reads and writes as > it is IO scheduler's decision to decide when and how much read/write to > dispatch. This is another place where higher level controller will not be in > sync with lower level io scheduler and can change the effective policies of > underlying io scheduler. The IO schedulers already take care of read-vs-write and already take care of preventing large writes-starve-reads latencies (or at least, they're supposed to). > CFQ IO context Issues > --------------------- > Buffering at higher layer means submission of bios later with the help of > a worker thread. Why? If it's a read, we just block the userspace process. If it's a delayed write, the IO submission already happens in a kernel thread. If it's a synchronous write, we have to block the userspace caller anyway. Async reads might be an issue, dunno. > This changes the io context information at CFQ layer which > assigns the request to submitting thread. Change of io context info again > leads to issues of idle timer expiry and issue of a process not getting fair > share and reduced throughput. But we already have that problem with delayed writeback, which is a huge thing - often it's the majority of IO. > Throughput with noop, deadline and AS > --------------------------------------------- > I think an higher level controller will result in reduced overall throughput > (as compared to io scheduler based io controller) and more seeks with noop, > deadline and AS. > > The reason being, that it is likely that IO with-in a group will be related > and will be relatively close as compared to IO across the groups. For example, > thread pool of kvm-qemu doing IO for virtual machine. In case of higher level > control, IO from various groups will go into a single queue at lower level > controller and it might happen that IO is now interleaved (G1, G2, G1, G3, > G4....) causing more seeks and reduced throughput. (Agreed that merging will > help up to some extent but still....). > > Instead, in case of lower level controller, IO scheduler maintains one queue > per group hence there is no interleaving of IO between groups. And if IO is > related with-in group, then we shoud get reduced number/amount of seek and > higher throughput. > > Latency can be a concern but that can be controlled by reducing the time > slice length of the queue. Well maybe, maybe not. If a group is throttled, it isn't submitting new IO. The unthrottled group is doing the IO submitting and that IO will have decent locality. > Fairness at logical device level vs at physical device level > ------------------------------------------------------------ > > IO scheduler based controller has the limitation that it works only with the > bottom most devices in the IO stack where IO scheduler is attached. > > For example, assume a user has created a logical device lv0 using three > underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 > in two groups doing IO on lv0. Also assume that weights of groups are in the > ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. > > T1 T2 > \ / > lv0 > / | \ > sda sdb sdc > > > Now resource control will take place only on devices sda, sdb and sdc and > not at lv0 level. So if IO from two tasks is relatively uniformly > distributed across the disks then T1 and T2 will see the throughput ratio > in proportion to weight specified. But if IO from T1 and T2 is going to > different disks and there is no contention then at higher level they both > will see same BW. > > Here a second level controller can produce better fairness numbers at > logical device but most likely at redued overall throughput of the system, > because it will try to control IO even if there is no contention at phsical > possibly leaving diksks unused in the system. > > Hence, question comes that how important it is to control bandwidth at > higher level logical devices also. The actual contention for resources is > at the leaf block device so it probably makes sense to do any kind of > control there and not at the intermediate devices. Secondly probably it > also means better use of available resources. hm. What will be the effects of this limitation in real-world use? > Limited Fairness > ---------------- > Currently CFQ idles on a sequential reader queue to make sure it gets its > fair share. A second level controller will find it tricky to anticipate. > Either it will not have any anticipation logic and in that case it will not > provide fairness to single readers in a group (as dm-ioband does) or if it > starts anticipating then we should run into these strange situations where > second level controller is anticipating on one queue/group and underlying > IO scheduler might be anticipating on something else. It depends on the size of the inter-group timeslices. If the amount of time for which a group is unthrottled is "large" comapred to the typical anticipation times, this issue fades away. And those timeslices _should_ be large. Because as you mentioned above, different groups are probably working different parts of the disk. > Need of device mapper tools > --------------------------- > A device mapper based solution will require creation of a ioband device > on each physical/logical device one wants to control. So it requires usage > of device mapper tools even for the people who are not using device mapper. > At the same time creation of ioband device on each partition in the system to > control the IO can be cumbersome and overwhelming if system has got lots of > disks and partitions with-in. > > > IMHO, IO scheduler based IO controller is a reasonable approach to solve the > problem of group bandwidth control, and can do hierarchical IO scheduling > more tightly and efficiently. > > But I am all ears to alternative approaches and suggestions how doing things > can be done better and will be glad to implement it. > > TODO > ==== > - code cleanups, testing, bug fixing, optimizations, benchmarking etc... > - More testing to make sure there are no regressions in CFQ. > > Testing > ======= > > Environment > ========== > A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. That's a bit of a toy. Do we have testing results for more enterprisey hardware? Big storage arrays? SSD? Infiniband? iscsi? nfs? (lol, gotcha) > I am mostly > running fio jobs which have been limited to 30 seconds run and then monitored > the throughput and latency. > > Test1: Random Reader Vs Random Writers > ====================================== > Launched a random reader and then increasing number of random writers to see > the effect on random reader BW and max lantecies. > > [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ] > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > [Vanilla CFQ, No groups] > <--------------random writers--------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec > 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec > 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec > 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec > 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec > 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > number of random writers in group1 and one random reader in group2 using fio. > > [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500] > <--------------random writers(group1)-------------> <-random reader(group2)-> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec > 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec > 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec > 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec > 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec > 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec That's a good result. > Also ran the same test with IO controller CFQ in flat mode to see if there > are any major deviations from Vanilla CFQ. Does not look like any. > > [IO controller CFQ; No groups ] > <--------------random writers--------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec > 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec > 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec > 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec > 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec > 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec > > Notes: > - With vanilla CFQ, random writers can overwhelm a random reader. Bring down > its throughput and bump up latencies significantly. Isn't that a CFQ shortcoming which we should address separately? If so, the comparisons aren't presently valid because we're comparing with a CFQ which has known, should-be-fixed problems. > - With IO controller, one can provide isolation to the random reader group and > maintain consitent view of bandwidth and latencies. > > Test2: Random Reader Vs Sequential Reader > ======================================== > Launched a random reader and then increasing number of sequential readers to > see the effect on BW and latencies of random reader. > > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ] > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] > > [ Vanilla CFQ, No groups ] > <---------------seq readers----------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec > 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec > 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec > 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec > 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec > > Created two cgroups group1 and group2 of weights 500 each. Launched increasing > number of sequential readers in group1 and one random reader in group2 using > fio. > > [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500] > <---------------group1---------------------------> <------group2---------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec > 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec > 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec > 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec > 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec > > Also ran the same test with IO controller CFQ in flat mode to see if there > are any major deviations from Vanilla CFQ. Does not look like any. > > [IO controller CFQ; No groups ] > <---------------seq readers----------------------> <------random reader--> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec > 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec > 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec > 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec > 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec > > Notes: > - The BW and latencies of random reader in group 2 seems to be stable and > bounded and does not get impacted much as number of sequential readers > increase in group1. Hence provding good isolation. > > - Throughput of sequential readers comes down and latencies go up as half > of disk bandwidth (in terms of time) has been reserved for random reader > group. > > Test3: Sequential Reader Vs Sequential Reader > ============================================ > Created two cgroups group1 and group2 of weights 500 and 1000 respectively. > Launched increasing number of sequential readers in group1 and one sequential > reader in group2 using fio and monitored how bandwidth is being distributed > between two groups. > > First 5 columns give stats about job in group1 and last two columns give > stats about job in group2. > > <---------------group1---------------------------> <------group2---------> > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency > 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec > 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec > 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec > 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec > 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec > > Note: group2 is getting double the bandwidth of group1 even in the face > of increasing number of readers in group1. > > Test4 (Isolation between two KVM virtual machines) > ================================================== > Created two KVM virtual machines. Partitioned a disk on host in two partitions > and gave one partition to each virtual machine. Put both the virtual machines > in two different cgroup of weight 1000 and 500 each. Virtual machines created > ext3 file system on the partitions exported from host and did buffered writes. > Host seems writes as synchronous and virtual machine with higher weight gets > double the disk time of virtual machine of lower weight. Used deadline > scheduler in this test case. > > Some more details about configuration are in documentation patch. > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > =================================================================== > Fairness for async writes is tricky and biggest reason is that async writes > are cached in higher layers (page cahe) as well as possibly in file system > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > in proportional manner. > > For example, consider two dd threads reading /dev/zero as input file and doing > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > be forced to write out some pages to disk before more pages can be dirtied. But > not necessarily dirty pages of same thread are picked. It can very well pick > the inode of lesser priority dd thread and do some writeout. So effectively > higher weight dd is doing writeouts of lower weight dd pages and we don't see > service differentation. > > IOW, the core problem with buffered write fairness is that higher weight thread > does not throw enought IO traffic at IO controller to keep the queue > continuously backlogged. In my testing, there are many .2 to .8 second > intervals where higher weight queue is empty and in that duration lower weight > queue get lots of job done giving the impression that there was no service > differentiation. > > In summary, from IO controller point of view async writes support is there. > Because page cache has not been designed in such a manner that higher > prio/weight writer can do more write out as compared to lower prio/weight > writer, gettting service differentiation is hard and it is visible in some > cases and not visible in some cases. Here's where it all falls to pieces. For async writeback we just don't care about IO priorities. Because from the point of view of the userspace task, the write was async! It occurred at memory bandwidth speed. It's only when the kernel's dirty memory thresholds start to get exceeded that we start to care about prioritisation. And at that time, all dirty memory (within a memcg?) is equal - a high-ioprio dirty page consumes just as much memory as a low-ioprio dirty page. So when balance_dirty_pages() hits, what do we want to do? I suppose that all we can do is to block low-ioprio processes more agressively at the VFS layer, to reduce the rate at which they're dirtying memory so as to give high-ioprio processes more of the disk bandwidth. But you've gone and implemented all of this stuff at the io-controller level and not at the VFS level so you're, umm, screwed. Importantly screwed! It's a very common workload pattern, and one which causes tremendous amounts of IO to be generated very quickly, traditionally causing bad latency effects all over the place. And we have no answer to this. > Vanilla CFQ Vs IO Controller CFQ > ================================ > We have not fundamentally changed CFQ, instead enhanced it to also support > hierarchical io scheduling. In the process invariably there are small changes > here and there as new scenarios come up. Running some tests here and comparing > both the CFQ's to see if there is any major deviation in behavior. > > Test1: Sequential Readers > ========================= > [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec > 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec > 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec > 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec > 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec > > IO scheduler: IO controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec > 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec > 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec > 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec > 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec > > Test2: Sequential Writers > ========================= > [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec > 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec > 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec > 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec > 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec > 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec > 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec > 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec > 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec > > Test3: Random Readers > ========================= > [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 484KiB/s 484KiB/s 484KiB/s 22596 usec > 2 229KiB/s 196KiB/s 425KiB/s 51111 usec > 4 119KiB/s 73KiB/s 405KiB/s 2344 msec > 8 93KiB/s 23KiB/s 399KiB/s 2246 msec > 16 38KiB/s 8KiB/s 328KiB/s 3965 msec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 483KiB/s 483KiB/s 483KiB/s 29391 usec > 2 229KiB/s 196KiB/s 426KiB/s 51625 usec > 4 132KiB/s 88KiB/s 417KiB/s 2313 msec > 8 79KiB/s 18KiB/s 389KiB/s 2298 msec > 16 43KiB/s 9KiB/s 327KiB/s 3905 msec > > Test4: Random Writers > ===================== > [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] > > IO scheduler: Vanilla CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec > 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec > 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec > 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec > 16 66KiB/s 22KiB/s 829KiB/s 1308 msec > > IO scheduler: IO Controller CFQ > > nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency > 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec > 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec > 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec > 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec > 16 71KiB/s 29KiB/s 814KiB/s 1457 msec > > Notes: > - Does not look like that anything has changed significantly. > > Previous versions of the patches were posted here. > ------------------------------------------------ > > (V1) http://lkml.org/lkml/2009/3/11/486 > (V2) http://lkml.org/lkml/2009/5/5/275 > (V3) http://lkml.org/lkml/2009/5/26/472 > (V4) http://lkml.org/lkml/2009/6/8/580 > (V5) http://lkml.org/lkml/2009/6/19/279 > (V6) http://lkml.org/lkml/2009/7/2/369 > (V7) http://lkml.org/lkml/2009/7/24/253 > (V8) http://lkml.org/lkml/2009/8/16/204 > (V9) http://lkml.org/lkml/2009/8/28/327 > > Thanks > Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-24 21:33 ` Andrew Morton @ 2009-09-25 2:20 ` Ulrich Lukas 2009-09-29 0:37 ` Nauman Rafique 2 siblings, 0 replies; 349+ messages in thread From: Ulrich Lukas @ 2009-09-25 2:20 UTC (permalink / raw) To: Vivek Goyal Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Vivek Goyal wrote: > Notes: > - With vanilla CFQ, random writers can overwhelm a random reader. > Bring down its throughput and bump up latencies significantly. IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, too. I'm basing this assumption on the observations I made on both OpenSuse 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML titled: "Poor desktop responsiveness with background I/O-operations" of 2009-09-20. (Message ID: 4AB59CBB.8090907-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org) Thus, I'm posting this to show that your work is greatly appreciated, given the rather disappointig status quo of Linux's fairness when it comes to disk IO time. I hope that your efforts lead to a change in performance of current userland applications, the sooner, the better. Thanks Ulrich ^ permalink raw reply [flat|nested] 349+ messages in thread
* Re: IO scheduler based IO controller V10 [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-24 21:33 ` Andrew Morton 2009-09-25 2:20 ` Ulrich Lukas @ 2009-09-29 0:37 ` Nauman Rafique 2 siblings, 0 replies; 349+ messages in thread From: Nauman Rafique @ 2009-09-29 0:37 UTC (permalink / raw) To: Vivek Goyal Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi Vivek, Me, Divyesh, Fernando and Yoshikawa had a chance to have a chat with Jens about IO controller during Linux Plumbers Conference '09. Jens expressed his concerns about the size and complexity of the patches. I believe that is a reasonable concern. We talked about things that could be done to reduce the size of the patches. The requirement that the "solution has to work with all IO schedulers" seems like a secondary concern at this point; and it came out as one thing that can help to reduce the size of the patch set. Another possibility is to use a simpler scheduling algorithm e.g. weighted round robin, instead of BFQ scheduler. BFQ indeed has great properties, but we cannot deny the fact that it is complex to understand, and might be cumbersome to maintain. Also, hierarchical scheduling is something that could be unnecessary in the first set of patches, even though cgroups are hierarchical in nature. We are starting from a point where there is no cgroup based IO scheduling in the kernel. And it is probably not reasonable to satisfy all IO scheduling related requirements in one patch set. We can start with something simple, and build on top of that. So a very simple patch set that enables cgroup based proportional scheduling for CFQ seems like the way to go at this point. It would be great if we discuss our plans on the mailing list, so we can get early feedback from everyone. ^ permalink raw reply [flat|nested] 349+ messages in thread
* IO scheduler based IO controller V10 @ 2009-09-24 19:25 Vivek Goyal 0 siblings, 0 replies; 349+ messages in thread From: Vivek Goyal @ 2009-09-24 19:25 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, jens.axboe-QHcLZuEGTsvQT0dZR+AlfA Cc: dhaval-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, dm-devel-H+wXaHxf7aLQT0dZR+AlfA, agk-H+wXaHxf7aLQT0dZR+AlfA, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, paolo.valente-rcYM44yAMweonA0d6jMUrA, jmarchan-H+wXaHxf7aLQT0dZR+AlfA, fernando-gVGce1chcLdL9jVzuh4AOg, jmoyer-H+wXaHxf7aLQT0dZR+AlfA, mingo-X9Un+BFzKDI, riel-H+wXaHxf7aLQT0dZR+AlfA, fchecconi-Re5JQEeQqe8AvxtiuMwx3w, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, righi.andrea-Re5JQEeQqe8AvxtiuMwx3w, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b Hi All, Here is the V10 of the IO controller patches generated on top of 2.6.31. For ease of patching, a consolidated patch is available here. http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v10.patch Changes from V9 =============== - Brought back the mechanism of idle trees (cache of recently served io queues). BFQ had originally implemented it and I had got rid of it. Later I realized that it helps providing fairness when io queue and io groups are running at same level. Hence brought the mechanism back. This cache helps in determining whether a task getting back into tree is a streaming reader who just consumed full slice legth or a new process (if not in cache) or a random reader who just got a small slice lenth and now got backlogged again. - Implemented "wait busy" for sequential reader queues. So we wait for one extra idle period for these queues to become busy so that group does not loose fairness. This works even if group_idle=0. - Fixed an issue where readers don't preempt writers with-in a group when readers get backlogged. (implemented late preemption). - Fixed the issue reported by Gui where Anticipatory was not expiring the queue. - Did more modification to AS so that it lets common layer know that it is anticipation on next requeust and common fair queuing layer does not try to do excessive queue expiratrions. - Started charging the queue only for allocated slice length (if fairness is not set) if it consumed more than allocated slice. Otherwise that queue can miss a dispatch round doubling the max latencies. This idea also borrowed from BFQ. - Allowed preemption where a reader can preempt other writer running in sibling groups or a meta data reader can preempt other non metadata reader in sibling group. - Fixed freed_request() issue pointed out by Nauman. What problem are we trying to solve =================================== Provide group IO scheduling feature in Linux along the lines of other resource controllers like cpu. IOW, provide facility so that a user can group applications using cgroups and control the amount of disk time/bandwidth received by a group based on its weight. How to solve the problem ========================= Different people have solved the issue differetnly. So far looks it looks like we seem to have following two core requirements when it comes to fairness at group level. - Control bandwidth seen by groups. - Control on latencies when a request gets backlogged in group. At least there are now three patchsets available (including this one). IO throttling ------------- This is a bandwidth controller which keeps track of IO rate of a group and throttles the process in the group if it exceeds the user specified limit. dm-ioband --------- This is a proportional bandwidth controller implemented as device mapper driver and provides fair access in terms of amount of IO done (not in terms of disk time as CFQ does). So one will setup one or more dm-ioband devices on top of physical/logical block device, configure the ioband device and pass information like grouping etc. Now this device will keep track of bios flowing through it and control the flow of bios based on group policies. IO scheduler based IO controller -------------------------------- Here we have viewed the problem of IO contoller as hierarchical group scheduling (along the lines of CFS group scheduling) issue. Currently one can view linux IO schedulers as flat where there is one root group and all the IO belongs to that group. This patchset basically modifies IO schedulers to also support hierarchical group scheduling. CFQ already provides fairness among different processes. I have extended it support group IO schduling. Also took some of the code out of CFQ and put in a common layer so that same group scheduling code can be used by noop, deadline and AS to support group scheduling. Pros/Cons ========= There are pros and cons to each of the approach. Following are some of the thoughts. Max bandwidth vs proportional bandwidth --------------------------------------- IO throttling is a max bandwidth controller and not a proportional one. Additionaly it provides fairness in terms of amount of IO done (and not in terms of disk time as CFQ does). Personally, I think that proportional weight controller is useful to more people than just max bandwidth controller. In addition, IO scheduler based controller can also be enhanced to do max bandwidth control. So it can satisfy wider set of requirements. Fairness in terms of disk time vs size of IO --------------------------------------------- An higher level controller will most likely be limited to providing fairness in terms of size/number of IO done and will find it hard to provide fairness in terms of disk time used (as CFQ provides between various prio levels). This is because only IO scheduler knows how much disk time a queue has used and information about queues and disk time used is not exported to higher layers. So a seeky application will still run away with lot of disk time and bring down the overall throughput of the the disk. Currently dm-ioband provides fairness in terms of number/size of IO. Latencies and isolation between groups -------------------------------------- An higher level controller is generally implementing a bandwidth throttling solution where if a group exceeds either the max bandwidth or the proportional share then throttle that group. This kind of approach will probably not help in controlling latencies as it will depend on underlying IO scheduler. Consider following scenario. Assume there are two groups. One group is running multiple sequential readers and other group has a random reader. sequential readers will get a nice 100ms slice each and then a random reader from group2 will get to dispatch the request. So latency of this random reader will depend on how many sequential readers are running in other group and that is a weak isolation between groups. When we control things at IO scheduler level, we assign one time slice to one group and then pick next entity to run. So effectively after one time slice (max 180ms, if prio 0 sequential reader is running), random reader in other group will get to run. Hence we achieve better isolation between groups as response time of process in a differnt group is generally not dependent on number of processes running in competing group. So a higher level solution is most likely limited to only shaping bandwidth without any control on latencies. Stacking group scheduler on top of CFQ can lead to issues --------------------------------------------------------- IO throttling and dm-ioband both are second level controller. That is these controllers are implemented in higher layers than io schedulers. So they control the IO at higher layer based on group policies and later IO schedulers take care of dispatching these bios to disk. Implementing a second level controller has the advantage of being able to provide bandwidth control even on logical block devices in the IO stack which don't have any IO schedulers attached to these. But they can also interefere with IO scheduling policy of underlying IO scheduler and change the effective behavior. Following are some of the issues which I think should be visible in second level controller in one form or other. Prio with-in group ------------------ A second level controller can potentially interefere with behavior of different prio processes with-in a group. bios are buffered at higher layer in single queue and release of bios is FIFO and not proportionate to the ioprio of the process. This can result in a particular prio level not getting fair share. Buffering at higher layer can delay read requests for more than slice idle period of CFQ (default 8 ms). That means, it is possible that we are waiting for a request from the queue but it is buffered at higher layer and then idle timer will fire. It means that queue will losse its share at the same time overall throughput will be impacted as we lost those 8 ms. Read Vs Write ------------- Writes can overwhelm readers hence second level controller FIFO release will run into issue here. If there is a single queue maintained then reads will suffer large latencies. If there separate queues for reads and writes then it will be hard to decide in what ratio to dispatch reads and writes as it is IO scheduler's decision to decide when and how much read/write to dispatch. This is another place where higher level controller will not be in sync with lower level io scheduler and can change the effective policies of underlying io scheduler. CFQ IO context Issues --------------------- Buffering at higher layer means submission of bios later with the help of a worker thread. This changes the io context information at CFQ layer which assigns the request to submitting thread. Change of io context info again leads to issues of idle timer expiry and issue of a process not getting fair share and reduced throughput. Throughput with noop, deadline and AS --------------------------------------------- I think an higher level controller will result in reduced overall throughput (as compared to io scheduler based io controller) and more seeks with noop, deadline and AS. The reason being, that it is likely that IO with-in a group will be related and will be relatively close as compared to IO across the groups. For example, thread pool of kvm-qemu doing IO for virtual machine. In case of higher level control, IO from various groups will go into a single queue at lower level controller and it might happen that IO is now interleaved (G1, G2, G1, G3, G4....) causing more seeks and reduced throughput. (Agreed that merging will help up to some extent but still....). Instead, in case of lower level controller, IO scheduler maintains one queue per group hence there is no interleaving of IO between groups. And if IO is related with-in group, then we shoud get reduced number/amount of seek and higher throughput. Latency can be a concern but that can be controlled by reducing the time slice length of the queue. Fairness at logical device level vs at physical device level ------------------------------------------------------------ IO scheduler based controller has the limitation that it works only with the bottom most devices in the IO stack where IO scheduler is attached. For example, assume a user has created a logical device lv0 using three underlying disks sda, sdb and sdc. Also assume there are two tasks T1 and T2 in two groups doing IO on lv0. Also assume that weights of groups are in the ratio of 2:1 so T1 should get double the BW of T2 on lv0 device. T1 T2 \ / lv0 / | \ sda sdb sdc Now resource control will take place only on devices sda, sdb and sdc and not at lv0 level. So if IO from two tasks is relatively uniformly distributed across the disks then T1 and T2 will see the throughput ratio in proportion to weight specified. But if IO from T1 and T2 is going to different disks and there is no contention then at higher level they both will see same BW. Here a second level controller can produce better fairness numbers at logical device but most likely at redued overall throughput of the system, because it will try to control IO even if there is no contention at phsical possibly leaving diksks unused in the system. Hence, question comes that how important it is to control bandwidth at higher level logical devices also. The actual contention for resources is at the leaf block device so it probably makes sense to do any kind of control there and not at the intermediate devices. Secondly probably it also means better use of available resources. Limited Fairness ---------------- Currently CFQ idles on a sequential reader queue to make sure it gets its fair share. A second level controller will find it tricky to anticipate. Either it will not have any anticipation logic and in that case it will not provide fairness to single readers in a group (as dm-ioband does) or if it starts anticipating then we should run into these strange situations where second level controller is anticipating on one queue/group and underlying IO scheduler might be anticipating on something else. Need of device mapper tools --------------------------- A device mapper based solution will require creation of a ioband device on each physical/logical device one wants to control. So it requires usage of device mapper tools even for the people who are not using device mapper. At the same time creation of ioband device on each partition in the system to control the IO can be cumbersome and overwhelming if system has got lots of disks and partitions with-in. IMHO, IO scheduler based IO controller is a reasonable approach to solve the problem of group bandwidth control, and can do hierarchical IO scheduling more tightly and efficiently. But I am all ears to alternative approaches and suggestions how doing things can be done better and will be glad to implement it. TODO ==== - code cleanups, testing, bug fixing, optimizations, benchmarking etc... - More testing to make sure there are no regressions in CFQ. Testing ======= Environment ========== A 7200 RPM SATA drive with queue depth of 31. Ext3 filesystem. I am mostly running fio jobs which have been limited to 30 seconds run and then monitored the throughput and latency. Test1: Random Reader Vs Random Writers ====================================== Launched a random reader and then increasing number of random writers to see the effect on random reader BW and max lantecies. [fio --rw=randwrite --bs=64K --size=2G --runtime=30 --direct=1 --ioengine=libaio --iodepth=4 --numjobs= <1 to 32> ] [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] [Vanilla CFQ, No groups] <--------------random writers--------------------> <------random reader--> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 5737KiB/s 5737KiB/s 5737KiB/s 164K usec 503KiB/s 159K usec 2 2055KiB/s 1984KiB/s 4039KiB/s 1459K usec 150KiB/s 170K usec 4 1238KiB/s 932KiB/s 4419KiB/s 4332K usec 153KiB/s 225K usec 8 1059KiB/s 929KiB/s 7901KiB/s 1260K usec 118KiB/s 377K usec 16 604KiB/s 483KiB/s 8519KiB/s 3081K usec 47KiB/s 756K usec 32 367KiB/s 222KiB/s 9643KiB/s 5940K usec 22KiB/s 923K usec Created two cgroups group1 and group2 of weights 500 each. Launched increasing number of random writers in group1 and one random reader in group2 using fio. [IO controller CFQ; group_idle=8; group1 weight=500; group2 weight=500] <--------------random writers(group1)-------------> <-random reader(group2)-> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 18115KiB/s 18115KiB/s 18115KiB/s 604K usec 345KiB/s 176K usec 2 3752KiB/s 3676KiB/s 7427KiB/s 4367K usec 402KiB/s 187K usec 4 1951KiB/s 1863KiB/s 7642KiB/s 1989K usec 384KiB/s 181K usec 8 755KiB/s 629KiB/s 5683KiB/s 2133K usec 366KiB/s 319K usec 16 418KiB/s 369KiB/s 6276KiB/s 1323K usec 352KiB/s 287K usec 32 236KiB/s 191KiB/s 6518KiB/s 1910K usec 337KiB/s 273K usec Also ran the same test with IO controller CFQ in flat mode to see if there are any major deviations from Vanilla CFQ. Does not look like any. [IO controller CFQ; No groups ] <--------------random writers--------------------> <------random reader--> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 5696KiB/s 5696KiB/s 5696KiB/s 259K usec 500KiB/s 194K usec 2 2483KiB/s 2197KiB/s 4680KiB/s 887K usec 150KiB/s 159K usec 4 1471KiB/s 1433KiB/s 5817KiB/s 962K usec 126KiB/s 189K usec 8 691KiB/s 580KiB/s 5159KiB/s 2752K usec 197KiB/s 246K usec 16 781KiB/s 698KiB/s 11892KiB/s 943K usec 61KiB/s 529K usec 32 415KiB/s 324KiB/s 12461KiB/s 4614K usec 17KiB/s 737K usec Notes: - With vanilla CFQ, random writers can overwhelm a random reader. Bring down its throughput and bump up latencies significantly. - With IO controller, one can provide isolation to the random reader group and maintain consitent view of bandwidth and latencies. Test2: Random Reader Vs Sequential Reader ======================================== Launched a random reader and then increasing number of sequential readers to see the effect on BW and latencies of random reader. [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs= <1 to 16> ] [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1] [ Vanilla CFQ, No groups ] <---------------seq readers----------------------> <------random reader--> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 23318KiB/s 23318KiB/s 23318KiB/s 55940 usec 36KiB/s 247K usec 2 14732KiB/s 11406KiB/s 26126KiB/s 142K usec 20KiB/s 446K usec 4 9417KiB/s 5169KiB/s 27338KiB/s 404K usec 10KiB/s 993K usec 8 3360KiB/s 3041KiB/s 25850KiB/s 954K usec 60KiB/s 956K usec 16 1888KiB/s 1457KiB/s 26763KiB/s 1871K usec 28KiB/s 1868K usec Created two cgroups group1 and group2 of weights 500 each. Launched increasing number of sequential readers in group1 and one random reader in group2 using fio. [IO controller CFQ; group_idle=1; group1 weight=500; group2 weight=500] <---------------group1---------------------------> <------group2---------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 13733KiB/s 13733KiB/s 13733KiB/s 247K usec 330KiB/s 154K usec 2 8553KiB/s 4963KiB/s 13514KiB/s 472K usec 322KiB/s 174K usec 4 5045KiB/s 1367KiB/s 13134KiB/s 947K usec 318KiB/s 178K usec 8 1774KiB/s 1420KiB/s 13035KiB/s 1871K usec 323KiB/s 233K usec 16 959KiB/s 518KiB/s 12691KiB/s 3809K usec 324KiB/s 208K usec Also ran the same test with IO controller CFQ in flat mode to see if there are any major deviations from Vanilla CFQ. Does not look like any. [IO controller CFQ; No groups ] <---------------seq readers----------------------> <------random reader--> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 23028KiB/s 23028KiB/s 23028KiB/s 47460 usec 36KiB/s 253K usec 2 14452KiB/s 11176KiB/s 25628KiB/s 145K usec 20KiB/s 447K usec 4 8815KiB/s 5720KiB/s 27121KiB/s 396K usec 10KiB/s 968K usec 8 3335KiB/s 2827KiB/s 24866KiB/s 960K usec 62KiB/s 955K usec 16 1784KiB/s 1311KiB/s 26537KiB/s 1883K usec 26KiB/s 1866K usec Notes: - The BW and latencies of random reader in group 2 seems to be stable and bounded and does not get impacted much as number of sequential readers increase in group1. Hence provding good isolation. - Throughput of sequential readers comes down and latencies go up as half of disk bandwidth (in terms of time) has been reserved for random reader group. Test3: Sequential Reader Vs Sequential Reader ============================================ Created two cgroups group1 and group2 of weights 500 and 1000 respectively. Launched increasing number of sequential readers in group1 and one sequential reader in group2 using fio and monitored how bandwidth is being distributed between two groups. First 5 columns give stats about job in group1 and last two columns give stats about job in group2. <---------------group1---------------------------> <------group2---------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 8970KiB/s 8970KiB/s 8970KiB/s 230K usec 20681KiB/s 124K usec 2 6783KiB/s 3202KiB/s 9984KiB/s 546K usec 19682KiB/s 139K usec 4 4641KiB/s 1029KiB/s 9280KiB/s 1185K usec 19235KiB/s 172K usec 8 1435KiB/s 1079KiB/s 9926KiB/s 2461K usec 19501KiB/s 153K usec 16 764KiB/s 398KiB/s 9395KiB/s 4986K usec 19367KiB/s 172K usec Note: group2 is getting double the bandwidth of group1 even in the face of increasing number of readers in group1. Test4 (Isolation between two KVM virtual machines) ================================================== Created two KVM virtual machines. Partitioned a disk on host in two partitions and gave one partition to each virtual machine. Put both the virtual machines in two different cgroup of weight 1000 and 500 each. Virtual machines created ext3 file system on the partitions exported from host and did buffered writes. Host seems writes as synchronous and virtual machine with higher weight gets double the disk time of virtual machine of lower weight. Used deadline scheduler in this test case. Some more details about configuration are in documentation patch. Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) =================================================================== Fairness for async writes is tricky and biggest reason is that async writes are cached in higher layers (page cahe) as well as possibly in file system layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily in proportional manner. For example, consider two dd threads reading /dev/zero as input file and doing writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will be forced to write out some pages to disk before more pages can be dirtied. But not necessarily dirty pages of same thread are picked. It can very well pick the inode of lesser priority dd thread and do some writeout. So effectively higher weight dd is doing writeouts of lower weight dd pages and we don't see service differentation. IOW, the core problem with buffered write fairness is that higher weight thread does not throw enought IO traffic at IO controller to keep the queue continuously backlogged. In my testing, there are many .2 to .8 second intervals where higher weight queue is empty and in that duration lower weight queue get lots of job done giving the impression that there was no service differentiation. In summary, from IO controller point of view async writes support is there. Because page cache has not been designed in such a manner that higher prio/weight writer can do more write out as compared to lower prio/weight writer, gettting service differentiation is hard and it is visible in some cases and not visible in some cases. Vanilla CFQ Vs IO Controller CFQ ================================ We have not fundamentally changed CFQ, instead enhanced it to also support hierarchical io scheduling. In the process invariably there are small changes here and there as new scenarios come up. Running some tests here and comparing both the CFQ's to see if there is any major deviation in behavior. Test1: Sequential Readers ========================= [fio --rw=read --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] IO scheduler: Vanilla CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 35499KiB/s 35499KiB/s 35499KiB/s 19195 usec 2 17089KiB/s 13600KiB/s 30690KiB/s 118K usec 4 9165KiB/s 5421KiB/s 29411KiB/s 380K usec 8 3815KiB/s 3423KiB/s 29312KiB/s 830K usec 16 1911KiB/s 1554KiB/s 28921KiB/s 1756K usec IO scheduler: IO controller CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 34494KiB/s 34494KiB/s 34494KiB/s 14482 usec 2 16983KiB/s 13632KiB/s 30616KiB/s 123K usec 4 9237KiB/s 5809KiB/s 29631KiB/s 372K usec 8 3901KiB/s 3505KiB/s 29162KiB/s 822K usec 16 1895KiB/s 1653KiB/s 28945KiB/s 1778K usec Test2: Sequential Writers ========================= [fio --rw=write --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=<1 to 16> ] IO scheduler: Vanilla CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 22669KiB/s 22669KiB/s 22669KiB/s 401K usec 2 14760KiB/s 7419KiB/s 22179KiB/s 571K usec 4 5862KiB/s 5746KiB/s 23174KiB/s 444K usec 8 3377KiB/s 2199KiB/s 22427KiB/s 1057K usec 16 2229KiB/s 556KiB/s 20601KiB/s 5099K usec IO scheduler: IO Controller CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 22911KiB/s 22911KiB/s 22911KiB/s 37319 usec 2 11752KiB/s 11632KiB/s 23383KiB/s 245K usec 4 6663KiB/s 5409KiB/s 23207KiB/s 384K usec 8 3161KiB/s 2460KiB/s 22566KiB/s 935K usec 16 1888KiB/s 795KiB/s 21349KiB/s 3009K usec Test3: Random Readers ========================= [fio --rw=randread --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] IO scheduler: Vanilla CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 484KiB/s 484KiB/s 484KiB/s 22596 usec 2 229KiB/s 196KiB/s 425KiB/s 51111 usec 4 119KiB/s 73KiB/s 405KiB/s 2344 msec 8 93KiB/s 23KiB/s 399KiB/s 2246 msec 16 38KiB/s 8KiB/s 328KiB/s 3965 msec IO scheduler: IO Controller CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 483KiB/s 483KiB/s 483KiB/s 29391 usec 2 229KiB/s 196KiB/s 426KiB/s 51625 usec 4 132KiB/s 88KiB/s 417KiB/s 2313 msec 8 79KiB/s 18KiB/s 389KiB/s 2298 msec 16 43KiB/s 9KiB/s 327KiB/s 3905 msec Test4: Random Writers ===================== [fio --rw=randwrite --bs=4K --size=2G --runtime=30 --direct=1 --numjobs=1 to 16] IO scheduler: Vanilla CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 14641KiB/s 14641KiB/s 14641KiB/s 93045 usec 2 7896KiB/s 1348KiB/s 9245KiB/s 82778 usec 4 2657KiB/s 265KiB/s 6025KiB/s 216K usec 8 951KiB/s 122KiB/s 3386KiB/s 1148K usec 16 66KiB/s 22KiB/s 829KiB/s 1308 msec IO scheduler: IO Controller CFQ nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency 1 14454KiB/s 14454KiB/s 14454KiB/s 74623 usec 2 4595KiB/s 4104KiB/s 8699KiB/s 135K usec 4 3113KiB/s 334KiB/s 5782KiB/s 200K usec 8 1146KiB/s 95KiB/s 3832KiB/s 593K usec 16 71KiB/s 29KiB/s 814KiB/s 1457 msec Notes: - Does not look like that anything has changed significantly. Previous versions of the patches were posted here. ------------------------------------------------ (V1) http://lkml.org/lkml/2009/3/11/486 (V2) http://lkml.org/lkml/2009/5/5/275 (V3) http://lkml.org/lkml/2009/5/26/472 (V4) http://lkml.org/lkml/2009/6/8/580 (V5) http://lkml.org/lkml/2009/6/19/279 (V6) http://lkml.org/lkml/2009/7/2/369 (V7) http://lkml.org/lkml/2009/7/24/253 (V8) http://lkml.org/lkml/2009/8/16/204 (V9) http://lkml.org/lkml/2009/8/28/327 Thanks Vivek ^ permalink raw reply [flat|nested] 349+ messages in thread
end of thread, other threads:[~2009-10-08 10:23 UTC | newest] Thread overview: 349+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-10-02 10:55 IO scheduler based IO controller V10 Corrado Zoccolo 2009-10-02 11:04 ` Jens Axboe [not found] ` <200910021255.27689.czoccolo-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2009-10-02 11:04 ` Jens Axboe 2009-10-02 12:49 ` Vivek Goyal 2009-10-02 12:49 ` Vivek Goyal 2009-10-02 12:49 ` Vivek Goyal [not found] ` <20091002124921.GA4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-02 15:27 ` Corrado Zoccolo 2009-10-02 15:27 ` Corrado Zoccolo 2009-10-02 15:31 ` Vivek Goyal 2009-10-02 15:31 ` Vivek Goyal [not found] ` <4e5e476b0910020827s23e827b1n847c64e355999d4a-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-02 15:31 ` Vivek Goyal 2009-10-02 15:32 ` Mike Galbraith 2009-10-02 15:32 ` Mike Galbraith 2009-10-02 15:32 ` Mike Galbraith 2009-10-02 15:40 ` Vivek Goyal 2009-10-02 15:40 ` Vivek Goyal [not found] ` <20091002154020.GC4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-02 16:03 ` Mike Galbraith 2009-10-02 16:50 ` Valdis.Kletnieks-PjAqaU27lzQ 2009-10-02 16:03 ` Mike Galbraith 2009-10-02 16:50 ` Valdis.Kletnieks 2009-10-02 16:50 ` Valdis.Kletnieks [not found] ` <12774.1254502217-+bZmOdGhbsPr6rcHtW+onFJE71vCis6O@public.gmane.org> 2009-10-02 19:58 ` Vivek Goyal 2009-10-02 19:58 ` Vivek Goyal 2009-10-02 19:58 ` Vivek Goyal 2009-10-02 22:14 ` Corrado Zoccolo 2009-10-02 22:14 ` Corrado Zoccolo [not found] ` <4e5e476b0910021514i1b461229t667bed94fd67f140-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-02 22:27 ` Vivek Goyal 2009-10-02 22:27 ` Vivek Goyal 2009-10-02 22:27 ` Vivek Goyal 2009-10-03 12:43 ` Corrado Zoccolo 2009-10-03 13:38 ` Do we support ioprio on SSDs with NCQ (Was: Re: IO scheduler based IO controller V10) Vivek Goyal 2009-10-03 13:38 ` Vivek Goyal 2009-10-04 9:15 ` Corrado Zoccolo 2009-10-04 12:11 ` Vivek Goyal 2009-10-04 12:46 ` Corrado Zoccolo 2009-10-04 16:20 ` Fabio Checconi [not found] ` <20091004162005.GH4650-f9ZlEuEWxVeACYmtYXMKmw@public.gmane.org> 2009-10-05 21:21 ` Corrado Zoccolo 2009-10-05 21:21 ` Corrado Zoccolo 2009-10-05 21:21 ` Corrado Zoccolo [not found] ` <4e5e476b0910040546h5f77cd1fo3172fe5c229eb579-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-05 15:06 ` Jeff Moyer 2009-10-05 15:06 ` Jeff Moyer 2009-10-05 21:09 ` Corrado Zoccolo 2009-10-05 21:09 ` Corrado Zoccolo [not found] ` <4e5e476b0910051409x33f8365flf32e8e7548d72e79-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-06 8:41 ` Jens Axboe 2009-10-06 8:41 ` Jens Axboe 2009-10-06 8:41 ` Jens Axboe [not found] ` <20091006084120.GJ5216-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-06 9:00 ` Corrado Zoccolo 2009-10-06 9:00 ` Corrado Zoccolo 2009-10-06 9:00 ` Corrado Zoccolo [not found] ` <4e5e476b0910060200i7c028b3fr4c235bf5f18c3aa1-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-06 18:53 ` Jens Axboe 2009-10-06 18:53 ` Jens Axboe 2009-10-06 18:53 ` Jens Axboe [not found] ` <x49my457uef.fsf-RRHT56Q3PSP4kTEheFKJxxDDeQx5vsVwAInAS/Ez/D0@public.gmane.org> 2009-10-05 21:09 ` Corrado Zoccolo 2009-10-06 21:36 ` Vivek Goyal 2009-10-06 21:36 ` Vivek Goyal 2009-10-06 21:36 ` Vivek Goyal [not found] ` <4e5e476b0910030543o776fb505ka0ce38da9d83b33c-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-03 13:38 ` Vivek Goyal [not found] ` <20091002222756.GG4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-03 12:43 ` IO scheduler based IO controller V10 Corrado Zoccolo [not found] ` <20091002195815.GE4494-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-02 22:14 ` Corrado Zoccolo [not found] ` <1254497520.10392.11.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 15:40 ` Vivek Goyal -- strict thread matches above, loose matches on Subject: below -- 2009-10-02 10:55 Corrado Zoccolo 2009-09-24 19:25 Vivek Goyal 2009-09-24 21:33 ` Andrew Morton 2009-09-24 21:33 ` Andrew Morton 2009-09-25 1:09 ` KAMEZAWA Hiroyuki [not found] ` <20090925100952.55c2dd7a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2009-09-25 1:18 ` KAMEZAWA Hiroyuki 2009-09-25 4:14 ` Vivek Goyal 2009-09-25 1:18 ` KAMEZAWA Hiroyuki 2009-09-25 1:18 ` KAMEZAWA Hiroyuki 2009-09-25 5:29 ` Balbir Singh 2009-09-25 7:09 ` Ryo Tsuruta 2009-09-25 7:09 ` Ryo Tsuruta [not found] ` <20090925052911.GK4590-SINUvgVNF2CyUtPGxGje5AC/G2K4zDHf@public.gmane.org> 2009-09-25 7:09 ` Ryo Tsuruta [not found] ` <20090925101821.1de8091a.kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2009-09-25 5:29 ` Balbir Singh 2009-09-25 4:14 ` Vivek Goyal 2009-09-25 4:14 ` Vivek Goyal [not found] ` <20090924143315.781cd0ac.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> 2009-09-25 1:09 ` KAMEZAWA Hiroyuki 2009-09-25 5:04 ` Vivek Goyal 2009-09-25 5:04 ` Vivek Goyal 2009-09-25 5:04 ` Vivek Goyal [not found] ` <20090925050429.GB12555-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-25 9:07 ` Ryo Tsuruta 2009-09-25 9:07 ` Ryo Tsuruta 2009-09-25 9:07 ` Ryo Tsuruta 2009-09-25 14:33 ` Vivek Goyal 2009-09-25 14:33 ` Vivek Goyal [not found] ` <20090925143337.GA15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-28 7:30 ` Ryo Tsuruta 2009-09-28 7:30 ` Ryo Tsuruta [not found] ` <20090925.180724.104041942.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-09-25 14:33 ` Vivek Goyal 2009-09-25 15:04 ` Rik van Riel 2009-09-25 15:04 ` Rik van Riel 2009-09-25 15:04 ` Rik van Riel 2009-09-28 7:38 ` Ryo Tsuruta 2009-09-28 7:38 ` Ryo Tsuruta [not found] ` <4ABCDBFF.1020203-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-28 7:38 ` Ryo Tsuruta 2009-09-25 2:20 ` Ulrich Lukas [not found] ` <4ABC28DE.7050809-7vBoImwI/YtIVYojq0lqJrNAH6kLmebB@public.gmane.org> 2009-09-25 20:26 ` Vivek Goyal 2009-09-25 20:26 ` Vivek Goyal 2009-09-25 20:26 ` Vivek Goyal 2009-09-26 14:51 ` Mike Galbraith [not found] ` <1253976676.7005.40.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-27 6:55 ` Mike Galbraith 2009-09-27 6:55 ` Mike Galbraith [not found] ` <1254034500.7933.6.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-27 16:42 ` Jens Axboe 2009-09-27 16:42 ` Jens Axboe [not found] ` <20090927164235.GA23126-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-09-27 18:15 ` Mike Galbraith 2009-09-30 19:58 ` Mike Galbraith 2009-09-27 18:15 ` Mike Galbraith 2009-09-28 4:04 ` Mike Galbraith 2009-09-28 5:55 ` Mike Galbraith 2009-09-28 17:48 ` Vivek Goyal 2009-09-28 17:48 ` Vivek Goyal 2009-09-28 18:24 ` Mike Galbraith [not found] ` <20090928174809.GB3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-28 18:24 ` Mike Galbraith [not found] ` <1254110648.7683.3.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-28 5:55 ` Mike Galbraith 2009-09-28 17:48 ` Vivek Goyal [not found] ` <1254075359.7354.66.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-28 4:04 ` Mike Galbraith 2009-09-30 19:58 ` Mike Galbraith [not found] ` <1254340730.7695.32.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-30 20:05 ` Mike Galbraith 2009-09-30 20:05 ` Mike Galbraith 2009-09-30 20:24 ` Vivek Goyal 2009-09-30 20:24 ` Vivek Goyal [not found] ` <20090930202447.GA28236-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-01 7:33 ` Mike Galbraith 2009-10-01 7:33 ` Mike Galbraith [not found] ` <1254382405.7595.9.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-01 18:58 ` Jens Axboe 2009-10-01 18:58 ` Jens Axboe 2009-10-02 6:23 ` Mike Galbraith [not found] ` <1254464628.7158.101.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 8:04 ` Jens Axboe 2009-10-02 8:04 ` Jens Axboe 2009-10-02 8:04 ` Jens Axboe 2009-10-02 8:53 ` Mike Galbraith [not found] ` <1254473609.6378.24.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 9:00 ` Mike Galbraith 2009-10-02 9:55 ` Jens Axboe 2009-10-02 9:00 ` Mike Galbraith 2009-10-02 9:55 ` Jens Axboe 2009-10-02 12:22 ` Mike Galbraith [not found] ` <20091002095555.GB26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 12:22 ` Mike Galbraith [not found] ` <20091002080417.GG14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 8:53 ` Mike Galbraith 2009-10-02 9:24 ` Ingo Molnar 2009-10-02 9:24 ` Ingo Molnar 2009-10-02 9:24 ` Ingo Molnar 2009-10-02 9:28 ` Jens Axboe 2009-10-02 9:28 ` Jens Axboe [not found] ` <20091002092839.GA26962-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 14:24 ` Linus Torvalds 2009-10-02 14:24 ` Linus Torvalds 2009-10-02 14:45 ` Mike Galbraith 2009-10-02 14:57 ` Jens Axboe [not found] ` <1254494742.7307.37.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 14:57 ` Jens Axboe 2009-10-02 14:56 ` Jens Axboe 2009-10-02 14:56 ` Jens Axboe 2009-10-02 15:14 ` Linus Torvalds 2009-10-02 15:14 ` Linus Torvalds 2009-10-02 16:01 ` jim owens 2009-10-02 16:01 ` jim owens 2009-10-02 17:11 ` Jens Axboe 2009-10-02 17:11 ` Jens Axboe 2009-10-02 17:20 ` Ingo Molnar 2009-10-02 17:20 ` Ingo Molnar 2009-10-02 17:25 ` Jens Axboe 2009-10-02 17:25 ` Jens Axboe 2009-10-02 17:28 ` Ingo Molnar 2009-10-02 17:28 ` Ingo Molnar [not found] ` <20091002172842.GA4884-X9Un+BFzKDI@public.gmane.org> 2009-10-02 17:37 ` Jens Axboe 2009-10-02 17:37 ` Jens Axboe [not found] ` <20091002173732.GK31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 17:56 ` Ingo Molnar 2009-10-02 18:13 ` Mike Galbraith 2009-10-02 17:56 ` Ingo Molnar 2009-10-02 17:56 ` Ingo Molnar [not found] ` <20091002175629.GA14860-X9Un+BFzKDI@public.gmane.org> 2009-10-02 18:04 ` Jens Axboe 2009-10-02 18:04 ` Jens Axboe [not found] ` <20091002180437.GL31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 18:22 ` Mike Galbraith 2009-10-02 18:36 ` Theodore Tso 2009-10-02 18:22 ` Mike Galbraith 2009-10-02 18:26 ` Jens Axboe 2009-10-02 18:33 ` Mike Galbraith [not found] ` <20091002182608.GO31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 18:33 ` Mike Galbraith [not found] ` <1254507754.8667.15.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 18:26 ` Jens Axboe 2009-10-02 18:36 ` Theodore Tso 2009-10-02 18:45 ` Jens Axboe 2009-10-02 18:45 ` Jens Axboe 2009-10-02 19:01 ` Ingo Molnar 2009-10-02 19:09 ` Jens Axboe 2009-10-02 19:09 ` Jens Axboe [not found] ` <20091002190110.GA25297-X9Un+BFzKDI@public.gmane.org> 2009-10-02 19:09 ` Jens Axboe [not found] ` <20091002184549.GS31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 19:01 ` Ingo Molnar [not found] ` <20091002183649.GE8161-3s7WtUTddSA@public.gmane.org> 2009-10-02 18:45 ` Jens Axboe 2009-10-02 18:13 ` Mike Galbraith 2009-10-02 18:19 ` Jens Axboe 2009-10-02 18:57 ` Mike Galbraith 2009-10-02 20:47 ` Mike Galbraith [not found] ` <1254509838.8667.30.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 20:47 ` Mike Galbraith 2009-10-03 5:48 ` Mike Galbraith 2009-10-03 5:56 ` Mike Galbraith 2009-10-03 7:24 ` Jens Axboe 2009-10-03 9:00 ` Mike Galbraith 2009-10-03 9:12 ` Corrado Zoccolo 2009-10-03 9:12 ` Corrado Zoccolo [not found] ` <4e5e476b0910030212y50f97d97nc2e17c35d855cc63-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-03 13:18 ` Jens Axboe 2009-10-03 13:18 ` Jens Axboe 2009-10-03 13:18 ` Jens Axboe [not found] ` <1254560434.17052.14.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 9:12 ` Corrado Zoccolo 2009-10-03 13:17 ` Jens Axboe 2009-10-03 13:17 ` Jens Axboe 2009-10-03 13:17 ` Jens Axboe [not found] ` <20091003072401.GV31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-03 9:00 ` Mike Galbraith 2009-10-03 11:29 ` Vivek Goyal 2009-10-03 11:29 ` Vivek Goyal [not found] ` <1254549378.8299.21.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 7:24 ` Jens Axboe 2009-10-03 11:29 ` Vivek Goyal [not found] ` <1254548931.8299.18.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-03 5:56 ` Mike Galbraith 2009-10-03 7:20 ` Ingo Molnar 2009-10-03 7:20 ` Ingo Molnar 2009-10-03 7:20 ` Ingo Molnar [not found] ` <20091003072021.GB21407-X9Un+BFzKDI@public.gmane.org> 2009-10-03 7:25 ` Jens Axboe 2009-10-03 7:25 ` Jens Axboe 2009-10-03 7:25 ` Jens Axboe 2009-10-03 8:53 ` Mike Galbraith [not found] ` <20091003072540.GW31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-03 8:53 ` Mike Galbraith 2009-10-03 9:01 ` Corrado Zoccolo 2009-10-03 9:01 ` Corrado Zoccolo [not found] ` <20091002181903.GN31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 18:57 ` Mike Galbraith 2009-10-03 5:48 ` Mike Galbraith [not found] ` <1254507215.8667.7.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 18:19 ` Jens Axboe [not found] ` <20091002172554.GJ31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 17:28 ` Ingo Molnar [not found] ` <20091002172046.GA2376-X9Un+BFzKDI@public.gmane.org> 2009-10-02 17:25 ` Jens Axboe [not found] ` <20091002171129.GG31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 17:20 ` Ingo Molnar [not found] ` <alpine.LFD.2.01.0910020811490.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2009-10-02 16:01 ` jim owens 2009-10-02 17:11 ` Jens Axboe 2009-10-02 16:33 ` Ray Lee 2009-10-02 17:13 ` Jens Axboe 2009-10-02 17:13 ` Jens Axboe [not found] ` <2c0942db0910020933l6d312c6ahae0e00619f598b39-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-02 17:13 ` Jens Axboe [not found] ` <20091002145610.GD31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 15:14 ` Linus Torvalds 2009-10-02 16:33 ` Ray Lee [not found] ` <alpine.LFD.2.01.0910020715160.6996-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2009-10-02 14:45 ` Mike Galbraith 2009-10-02 14:56 ` Jens Axboe 2009-10-02 16:22 ` Ingo Molnar 2009-10-02 16:22 ` Ingo Molnar 2009-10-02 16:22 ` Ingo Molnar [not found] ` <20091002092409.GA19529-X9Un+BFzKDI@public.gmane.org> 2009-10-02 9:28 ` Jens Axboe 2009-10-02 9:36 ` Mike Galbraith 2009-10-02 9:36 ` Mike Galbraith 2009-10-02 16:37 ` Ingo Molnar 2009-10-02 16:37 ` Ingo Molnar [not found] ` <1254476214.11022.8.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 16:37 ` Ingo Molnar [not found] ` <20091001185816.GU14918-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 6:23 ` Mike Galbraith 2009-10-02 18:08 ` Jens Axboe 2009-10-02 18:08 ` Jens Axboe 2009-10-02 18:29 ` Mike Galbraith 2009-10-02 18:36 ` Jens Axboe [not found] ` <1254508197.8667.22.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-10-02 18:36 ` Jens Axboe [not found] ` <20091002180857.GM31616-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org> 2009-10-02 18:29 ` Mike Galbraith [not found] ` <1254341139.7695.36.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-30 20:24 ` Vivek Goyal 2009-09-27 17:00 ` Corrado Zoccolo 2009-09-28 14:56 ` Vivek Goyal 2009-09-28 14:56 ` Vivek Goyal [not found] ` <20090928145655.GB8192-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-28 15:35 ` Corrado Zoccolo 2009-09-28 15:35 ` Corrado Zoccolo 2009-09-28 17:14 ` Vivek Goyal 2009-09-28 17:14 ` Vivek Goyal 2009-09-29 7:10 ` Corrado Zoccolo [not found] ` <20090928171420.GA3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-29 7:10 ` Corrado Zoccolo 2009-09-28 17:51 ` Mike Galbraith 2009-09-28 18:18 ` Vivek Goyal 2009-09-28 18:18 ` Vivek Goyal 2009-09-28 18:53 ` Mike Galbraith 2009-09-29 7:14 ` Corrado Zoccolo [not found] ` <1254164034.9820.81.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-29 7:14 ` Corrado Zoccolo [not found] ` <20090928181846.GC3643-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-28 18:53 ` Mike Galbraith 2009-09-29 5:55 ` Mike Galbraith [not found] ` <1254160274.9820.25.camel-YqMYhexLQo1vAv1Ojkdn7Q@public.gmane.org> 2009-09-28 18:18 ` Vivek Goyal 2009-09-29 5:55 ` Mike Galbraith [not found] ` <4e5e476b0909280835w3410d58aod93a29d1dcda8909-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-09-28 17:14 ` Vivek Goyal 2009-09-28 17:51 ` Mike Galbraith [not found] ` <4e5e476b0909271000u69d79346s27cccad219e49902-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-09-28 14:56 ` Vivek Goyal [not found] ` <20090925202636.GC15007-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-26 14:51 ` Mike Galbraith 2009-09-27 17:00 ` Corrado Zoccolo 2009-09-29 0:37 ` Nauman Rafique 2009-09-29 0:37 ` Nauman Rafique 2009-09-29 3:22 ` Vivek Goyal 2009-09-29 3:22 ` Vivek Goyal [not found] ` <20090929032255.GA10664-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-29 9:56 ` Ryo Tsuruta 2009-09-29 9:56 ` Ryo Tsuruta 2009-09-29 10:49 ` Takuya Yoshikawa 2009-09-29 14:10 ` Vivek Goyal 2009-09-29 14:10 ` Vivek Goyal 2009-09-29 19:53 ` Nauman Rafique [not found] ` <20090929141049.GA12141-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-29 19:53 ` Nauman Rafique 2009-09-30 8:43 ` Ryo Tsuruta 2009-09-30 8:43 ` Ryo Tsuruta 2009-09-30 11:05 ` Vivek Goyal 2009-09-30 11:05 ` Vivek Goyal 2009-10-01 6:41 ` Ryo Tsuruta 2009-10-01 6:41 ` Ryo Tsuruta [not found] ` <20091001.154125.104044685.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-10-01 13:31 ` Vivek Goyal 2009-10-01 13:31 ` Vivek Goyal 2009-10-01 13:31 ` Vivek Goyal 2009-10-02 2:57 ` Vivek Goyal 2009-10-02 2:57 ` Vivek Goyal 2009-10-02 20:27 ` Munehiro Ikeda 2009-10-02 20:27 ` Munehiro Ikeda [not found] ` <4AC6623F.70600-MDRzhb/z0dd8UrSeD/g0lQ@public.gmane.org> 2009-10-05 10:38 ` Ryo Tsuruta 2009-10-05 10:38 ` Ryo Tsuruta 2009-10-05 10:38 ` Ryo Tsuruta [not found] ` <20091005.193808.104033719.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-10-05 12:31 ` Vivek Goyal 2009-10-05 12:31 ` Vivek Goyal 2009-10-05 12:31 ` Vivek Goyal 2009-10-05 14:55 ` Ryo Tsuruta 2009-10-05 14:55 ` Ryo Tsuruta 2009-10-05 17:10 ` Vivek Goyal 2009-10-05 17:10 ` Vivek Goyal 2009-10-05 18:11 ` Nauman Rafique 2009-10-05 18:11 ` Nauman Rafique [not found] ` <e98e18940910051111r110dc776l5105bf931761b842-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-10-06 7:17 ` Ryo Tsuruta 2009-10-06 7:17 ` Ryo Tsuruta 2009-10-06 7:17 ` Ryo Tsuruta 2009-10-06 11:22 ` Vivek Goyal 2009-10-06 11:22 ` Vivek Goyal 2009-10-07 14:38 ` Ryo Tsuruta 2009-10-07 14:38 ` Ryo Tsuruta 2009-10-07 15:09 ` Vivek Goyal 2009-10-07 15:09 ` Vivek Goyal 2009-10-08 2:18 ` Ryo Tsuruta 2009-10-08 2:18 ` Ryo Tsuruta [not found] ` <20091007150929.GB3674-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-08 2:18 ` Ryo Tsuruta 2009-10-07 16:41 ` Rik van Riel 2009-10-07 16:41 ` Rik van Riel [not found] ` <4ACCC4B7.4050805-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-08 10:22 ` Ryo Tsuruta 2009-10-08 10:22 ` Ryo Tsuruta 2009-10-08 10:22 ` Ryo Tsuruta [not found] ` <20091007.233805.183040347.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-10-07 15:09 ` Vivek Goyal 2009-10-07 16:41 ` Rik van Riel [not found] ` <20091006112201.GA27866-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-07 14:38 ` Ryo Tsuruta [not found] ` <20091006.161744.189719641.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-10-06 11:22 ` Vivek Goyal [not found] ` <20091005171023.GG22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-05 18:11 ` Nauman Rafique [not found] ` <20091005.235535.193690928.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-10-05 17:10 ` Vivek Goyal [not found] ` <20091005123148.GB22143-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-05 14:55 ` Ryo Tsuruta [not found] ` <20091002025731.GA2738-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-02 20:27 ` Munehiro Ikeda [not found] ` <20091001133109.GA4058-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-02 2:57 ` Vivek Goyal [not found] ` <20090930110500.GA26631-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-10-01 6:41 ` Ryo Tsuruta [not found] ` <20090930.174319.183036386.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-09-30 11:05 ` Vivek Goyal [not found] ` <20090929.185653.183056711.ryov-jCdQPDEk3idL9jVzuh4AOg@public.gmane.org> 2009-09-29 10:49 ` Takuya Yoshikawa 2009-09-29 14:10 ` Vivek Goyal 2009-09-30 3:11 ` Vivek Goyal 2009-09-30 3:11 ` Vivek Goyal 2009-09-30 3:11 ` Vivek Goyal [not found] ` <e98e18940909281737q142c788dpd20b8bdc05dd0eff-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2009-09-29 3:22 ` Vivek Goyal [not found] ` <1253820332-10246-1-git-send-email-vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2009-09-24 21:33 ` Andrew Morton 2009-09-25 2:20 ` Ulrich Lukas 2009-09-29 0:37 ` Nauman Rafique 2009-09-24 19:25 Vivek Goyal
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.