Re: [PATCH 2/5] writeback: dirty position control

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 2/5] writeback: dirty position control
       [not found] <CAFdhcLRKvfqBnXCXLwq-Qe1eNAGC-8XJ3BtHpQKzaa3RhHyp6A@mail.gmail.com>
@ 2011-08-17  6:40 ` David Horner
  2011-08-17 12:03   ` Jan Kara
  0 siblings, 1 reply; 169+ messages in thread
From: David Horner @ 2011-08-17  6:40 UTC (permalink / raw)
  To: linux-kernel, fengguang.wu; +Cc: jack

 I noticed a significant typo below (another of those thousand eyes,
thanks to Jan Kara's post that started ne looking) :

 > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
 > + unsigned long thresh,
...
 > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
 > + * own size, so move the slope over accordingly.
 > + */
 > + if (unlikely(bdi_thresh > thresh))
 > + bdi_thresh = thresh;
 > + /*
 > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
 > + */
 > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);

                  ^
 I believe should be

    x = div_u64((u64)bdi_thresh << 16, thresh + 1);

    David Horner

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17  6:40 ` [PATCH 2/5] writeback: dirty position control David Horner
@ 2011-08-17 12:03   ` Jan Kara
  2011-08-17 12:35     ` Wu Fengguang
  0 siblings, 1 reply; 169+ messages in thread
From: Jan Kara @ 2011-08-17 12:03 UTC (permalink / raw)
  To: David Horner; +Cc: linux-kernel, fengguang.wu, jack

On Wed 17-08-11 02:40:19, David Horner wrote:
>  I noticed a significant typo below (another of those thousand eyes,
> thanks to Jan Kara's post that started ne looking) :
> 
>  > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>  > + unsigned long thresh,
> ...
>  > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
>  > + * own size, so move the slope over accordingly.
>  > + */
>  > + if (unlikely(bdi_thresh > thresh))
>  > + bdi_thresh = thresh;
>  > + /*
>  > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
>  > + */
>  > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> 
>                   ^
>  I believe should be
> 
>     x = div_u64((u64)bdi_thresh << 16, thresh + 1);
  I've noticed this as well but it's mostly a consistency issue. 'thresh'
is going to be large in practice so there's not much difference between
thresh + 1 and thresh | 1.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 12:03   ` Jan Kara
@ 2011-08-17 12:35     ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-17 12:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: David Horner, linux-kernel

On Wed, Aug 17, 2011 at 08:03:56PM +0800, Jan Kara wrote:
> On Wed 17-08-11 02:40:19, David Horner wrote:
> >  I noticed a significant typo below (another of those thousand eyes,
> > thanks to Jan Kara's post that started ne looking) :
> > 
> >  > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> >  > + unsigned long thresh,
> > ...
> >  > + * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> >  > + * own size, so move the slope over accordingly.
> >  > + */
> >  > + if (unlikely(bdi_thresh > thresh))
> >  > + bdi_thresh = thresh;
> >  > + /*
> >  > + * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> >  > + */
> >  > + x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > 
> >                   ^
> >  I believe should be
> > 
> >     x = div_u64((u64)bdi_thresh << 16, thresh + 1);
>   I've noticed this as well but it's mostly a consistency issue. 'thresh'
> is going to be large in practice so there's not much difference between
> thresh + 1 and thresh | 1.

Right :) Anyway I'll change it to thresh + 1.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:37                                 ` Wu Fengguang
@ 2011-09-06 12:40                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-09-02 at 14:16 +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > > 
> > > Ok so this argument makes sense, is there some formalism to describe
> > > such systems where such things are more evident?
> > 
> > I find the most easy and clean way to describe it is,
> > 
> > (1) the below formula
> >                                                           write_bw  
> >     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                           dirty_bw
> > is able to yield
> > 
> >     dirty_ratelimit_(i) ~= (write_bw / N)
> > 
> > as long as
> > 
> > - write_bw, dirty_bw and pos_ratio are not changing rapidly
> > - dirty pages are not around @freerun or @limit
> > 
> > Otherwise there will be larger estimation errors.
> > 
> > (2) based on (1), we get
> > 
> >     task_ratelimit ~= (write_bw / N) * pos_ratio
> > 
> > So the pos_ratio feedback is able to drive dirty count to the
> > setpoint, where pos_ratio = 1.
> > 
> > That interpretation based on _real values_ can neatly decouple the two
> > feedback loops :) It makes full utilization of the fact "the
> > dirty_ratelimit _value_ is independent on pos_ratio except for
> > possible impacts on estimation errors". 
> 
> OK, so the 'problem' I have with this is that the whole control thing
> really doesn't care about N. All it does is measure:
> 
>  - dirty rate
>  - writeback rate
> 
> observe:
> 
>  - dirty count; with the independent input of its setpoint
> 
> control:
> 
>  - ratelimit
> 
> so I was looking for a way to describe the interaction between the two
> feedback loops without involving the exact details of what they're
> controlling, but that might just end up being an oxymoron.


Hmm, so per Vivek's argument the system without pos_ratio in the
feedback term isn't convergent. Therefore we should be able to argue
from convergent/stability grounds that this term is indeed needed.

Does the stability proof of a control system need the model of what its
controlling? I guess I ought to go get a book on this or so.




^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-09-06 12:40                                   ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-09-06 12:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-09-02 at 14:16 +0200, Peter Zijlstra wrote:
> On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > > 
> > > Ok so this argument makes sense, is there some formalism to describe
> > > such systems where such things are more evident?
> > 
> > I find the most easy and clean way to describe it is,
> > 
> > (1) the below formula
> >                                                           write_bw  
> >     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                           dirty_bw
> > is able to yield
> > 
> >     dirty_ratelimit_(i) ~= (write_bw / N)
> > 
> > as long as
> > 
> > - write_bw, dirty_bw and pos_ratio are not changing rapidly
> > - dirty pages are not around @freerun or @limit
> > 
> > Otherwise there will be larger estimation errors.
> > 
> > (2) based on (1), we get
> > 
> >     task_ratelimit ~= (write_bw / N) * pos_ratio
> > 
> > So the pos_ratio feedback is able to drive dirty count to the
> > setpoint, where pos_ratio = 1.
> > 
> > That interpretation based on _real values_ can neatly decouple the two
> > feedback loops :) It makes full utilization of the fact "the
> > dirty_ratelimit _value_ is independent on pos_ratio except for
> > possible impacts on estimation errors". 
> 
> OK, so the 'problem' I have with this is that the whole control thing
> really doesn't care about N. All it does is measure:
> 
>  - dirty rate
>  - writeback rate
> 
> observe:
> 
>  - dirty count; with the independent input of its setpoint
> 
> control:
> 
>  - ratelimit
> 
> so I was looking for a way to describe the interaction between the two
> feedback loops without involving the exact details of what they're
> controlling, but that might just end up being an oxymoron.


Hmm, so per Vivek's argument the system without pos_ratio in the
feedback term isn't convergent. Therefore we should be able to argue
from convergent/stability grounds that this term is indeed needed.

Does the stability proof of a control system need the model of what its
controlling? I guess I ought to go get a book on this or so.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:37                                 ` Wu Fengguang
@ 2011-09-02 12:16                                   ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-09-02 12:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > 
> > Ok so this argument makes sense, is there some formalism to describe
> > such systems where such things are more evident?
> 
> I find the most easy and clean way to describe it is,
> 
> (1) the below formula
>                                                           write_bw  
>     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                           dirty_bw
> is able to yield
> 
>     dirty_ratelimit_(i) ~= (write_bw / N)
> 
> as long as
> 
> - write_bw, dirty_bw and pos_ratio are not changing rapidly
> - dirty pages are not around @freerun or @limit
> 
> Otherwise there will be larger estimation errors.
> 
> (2) based on (1), we get
> 
>     task_ratelimit ~= (write_bw / N) * pos_ratio
> 
> So the pos_ratio feedback is able to drive dirty count to the
> setpoint, where pos_ratio = 1.
> 
> That interpretation based on _real values_ can neatly decouple the two
> feedback loops :) It makes full utilization of the fact "the
> dirty_ratelimit _value_ is independent on pos_ratio except for
> possible impacts on estimation errors". 

OK, so the 'problem' I have with this is that the whole control thing
really doesn't care about N. All it does is measure:

 - dirty rate
 - writeback rate

observe:

 - dirty count; with the independent input of its setpoint

control:

 - ratelimit

so I was looking for a way to describe the interaction between the two
feedback loops without involving the exact details of what they're
controlling, but that might just end up being an oxymoron.



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-09-02 12:16                                   ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-09-02 12:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-29 at 21:37 +0800, Wu Fengguang wrote:
> > 
> > Ok so this argument makes sense, is there some formalism to describe
> > such systems where such things are more evident?
> 
> I find the most easy and clean way to describe it is,
> 
> (1) the below formula
>                                                           write_bw  
>     bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                           dirty_bw
> is able to yield
> 
>     dirty_ratelimit_(i) ~= (write_bw / N)
> 
> as long as
> 
> - write_bw, dirty_bw and pos_ratio are not changing rapidly
> - dirty pages are not around @freerun or @limit
> 
> Otherwise there will be larger estimation errors.
> 
> (2) based on (1), we get
> 
>     task_ratelimit ~= (write_bw / N) * pos_ratio
> 
> So the pos_ratio feedback is able to drive dirty count to the
> setpoint, where pos_ratio = 1.
> 
> That interpretation based on _real values_ can neatly decouple the two
> feedback loops :) It makes full utilization of the fact "the
> dirty_ratelimit _value_ is independent on pos_ratio except for
> possible impacts on estimation errors". 

OK, so the 'problem' I have with this is that the whole control thing
really doesn't care about N. All it does is measure:

 - dirty rate
 - writeback rate

observe:

 - dirty count; with the independent input of its setpoint

control:

 - ratelimit

so I was looking for a way to describe the interaction between the two
feedback loops without involving the exact details of what they're
controlling, but that might just end up being an oxymoron.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-29 13:12                               ` Peter Zijlstra
@ 2011-08-29 13:37                                 ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-29 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 29, 2011 at 09:12:07PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> > 
> > Ok, I think I am beginning to see your point. Let me just elaborate on
> > the example you gave.
> > 
> > Assume a system is completely balanced and a task is writing at 100MB/s
> > rate.
> > 
> > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> > 
> > bdi->dirty_ratelimit = 100MB/s
> > 
> > Now another tasks starts dirtying the page cache on same bdi. Number of 
> > dirty pages should go up pretty fast and likely position ratio feedback
> > will kick in to reduce the dirtying rate. (rate based feedback does not
> > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> > Assume new pos_ratio is .5
> > 
> > So new throttle rate for both the tasks is 50MB/s.
> > 
> > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> > 
> > Now lets say 200ms have passed and rate base feedback is reevaluated.
> > 
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> >                                                       dirty_bw
> > 
> > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> > 
> > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> > that did not happen. And reason being that there are two feedback control
> > loops and pos_ratio loops reacts to imbalances much more quickly. Because
> > previous loop has already reacted to the imbalance and reduced the
> > dirtying rate of task, rate based loop does not try to adjust anything
> > and thinks everything is just fine.
> > 
> > Things are fine in the sense that still dirty_rate == write_bw but
> > system is not balanced in terms of number of dirty pages and pos_ratio=.5
> > 
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                       dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> 
> Ok so this argument makes sense, is there some formalism to describe
> such systems where such things are more evident?

I find the most easy and clean way to describe it is,

(1) the below formula
                                                          write_bw  
    bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
                                                          dirty_bw
is able to yield

    dirty_ratelimit_(i) ~= (write_bw / N)

as long as

- write_bw, dirty_bw and pos_ratio are not changing rapidly
- dirty pages are not around @freerun or @limit

Otherwise there will be larger estimation errors.

(2) based on (1), we get

    task_ratelimit ~= (write_bw / N) * pos_ratio

So the pos_ratio feedback is able to drive dirty count to the
setpoint, where pos_ratio = 1.

That interpretation based on _real values_ can neatly decouple the two
feedback loops :) It makes full utilization of the fact "the
dirty_ratelimit _value_ is independent on pos_ratio except for
possible impacts on estimation errors".

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-29 13:37                                 ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-29 13:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 29, 2011 at 09:12:07PM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> > 
> > Ok, I think I am beginning to see your point. Let me just elaborate on
> > the example you gave.
> > 
> > Assume a system is completely balanced and a task is writing at 100MB/s
> > rate.
> > 
> > write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> > 
> > bdi->dirty_ratelimit = 100MB/s
> > 
> > Now another tasks starts dirtying the page cache on same bdi. Number of 
> > dirty pages should go up pretty fast and likely position ratio feedback
> > will kick in to reduce the dirtying rate. (rate based feedback does not
> > kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> > Assume new pos_ratio is .5
> > 
> > So new throttle rate for both the tasks is 50MB/s.
> > 
> > bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> > task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> > 
> > Now lets say 200ms have passed and rate base feedback is reevaluated.
> > 
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> >                                                       dirty_bw
> > 
> > bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> > 
> > Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> > that did not happen. And reason being that there are two feedback control
> > loops and pos_ratio loops reacts to imbalances much more quickly. Because
> > previous loop has already reacted to the imbalance and reduced the
> > dirtying rate of task, rate based loop does not try to adjust anything
> > and thinks everything is just fine.
> > 
> > Things are fine in the sense that still dirty_rate == write_bw but
> > system is not balanced in terms of number of dirty pages and pos_ratio=.5
> > 
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> >                                                       write_bw  
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> >                                                       dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> 
> Ok so this argument makes sense, is there some formalism to describe
> such systems where such things are more evident?

I find the most easy and clean way to describe it is,

(1) the below formula
                                                          write_bw  
    bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
                                                          dirty_bw
is able to yield

    dirty_ratelimit_(i) ~= (write_bw / N)

as long as

- write_bw, dirty_bw and pos_ratio are not changing rapidly
- dirty pages are not around @freerun or @limit

Otherwise there will be larger estimation errors.

(2) based on (1), we get

    task_ratelimit ~= (write_bw / N) * pos_ratio

So the pos_ratio feedback is able to drive dirty count to the
setpoint, where pos_ratio = 1.

That interpretation based on _real values_ can neatly decouple the two
feedback loops :) It makes full utilization of the fact "the
dirty_ratelimit _value_ is independent on pos_ratio except for
possible impacts on estimation errors".

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 18:00                             ` Vivek Goyal
@ 2011-08-29 13:12                               ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-29 13:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.
> 
> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
>                                                       dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.
> 
> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5
> 
> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                       dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.


Ok so this argument makes sense, is there some formalism to describe
such systems where such things are more evident?



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-29 13:12                               ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-29 13:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 14:00 -0400, Vivek Goyal wrote:
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.
> 
> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
>                                                       dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.
> 
> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5
> 
> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
>                                                       write_bw  
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
>                                                       dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.


Ok so this argument makes sense, is there some formalism to describe
such systems where such things are more evident?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 13:18                                             ` Peter Zijlstra
@ 2011-08-26 13:24                                               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 09:18:21PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> > We got similar result as in the read disturber case, even though one
> > disturbs N and the other impacts writeout bandwith.  The original
> > patchset is consistently performing much better :) 
> 
> It does indeed, and I figure on these timescales it makes sense to
> assumes N is a constant. Fair enough, thanks!

Thank you! Glad that we finally reaches some consensus :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 13:24                                               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 09:18:21PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> > We got similar result as in the read disturber case, even though one
> > disturbs N and the other impacts writeout bandwith.  The original
> > patchset is consistently performing much better :) 
> 
> It does indeed, and I figure on these timescales it makes sense to
> assumes N is a constant. Fair enough, thanks!

Thank you! Glad that we finally reaches some consensus :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 13:13                                         ` Wu Fengguang
@ 2011-08-26 13:18                                             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26 13:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> We got similar result as in the read disturber case, even though one
> disturbs N and the other impacts writeout bandwith.  The original
> patchset is consistently performing much better :) 

It does indeed, and I figure on these timescales it makes sense to
assumes N is a constant. Fair enough, thanks!

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 13:18                                             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26 13:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 21:13 +0800, Wu Fengguang wrote:
> We got similar result as in the read disturber case, even though one
> disturbs N and the other impacts writeout bandwith.  The original
> patchset is consistently performing much better :) 

It does indeed, and I figure on these timescales it makes sense to
assumes N is a constant. Fair enough, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 12:20                                         ` Wu Fengguang
  (?)
@ 2011-08-26 13:13                                         ` Wu Fengguang
  2011-08-26 13:18                                             ` Peter Zijlstra
  -1 siblings, 1 reply; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 13:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 845 bytes --]

On Fri, Aug 26, 2011 at 08:20:57PM +0800, Wu Fengguang wrote:
> On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > > a "disturber" dd read task during roughly 120-130s. 
> > 
> > Ah, but ideally the disturber task should run in bursts of 100ms
> > (<feedback period), otherwise your N is indeed mostly constant.
> 
> Ah yeah, the disturber task should be a dd writer! Then we get
> 
> - 120s: N=1 => N=2
> - 130s: N=2 => N=1

Here they are. The write disturber starts/stops around 150s.

We got similar result as in the read disturber case, even though one
disturbs N and the other impacts writeout bandwith.  The original
patchset is consistently performing much better :)

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 120914 bytes --]

[-- Attachment #3: balance_dirty_pages-pages_pure-rate-feedback.png --]
[-- Type: image/png, Size: 142966 bytes --]

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 12:11                                       ` Peter Zijlstra
@ 2011-08-26 12:20                                         ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > a "disturber" dd read task during roughly 120-130s. 
> 
> Ah, but ideally the disturber task should run in bursts of 100ms
> (<feedback period), otherwise your N is indeed mostly constant.

Ah yeah, the disturber task should be a dd writer! Then we get

- 120s: N=1 => N=2
- 130s: N=2 => N=1

I'll try it right away.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 12:20                                         ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 08:11:50PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> > Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> > a "disturber" dd read task during roughly 120-130s. 
> 
> Ah, but ideally the disturber task should run in bursts of 100ms
> (<feedback period), otherwise your N is indeed mostly constant.

Ah yeah, the disturber task should be a dd writer! Then we get

- 120s: N=1 => N=2
- 130s: N=2 => N=1

I'll try it right away.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 11:26                                   ` Wu Fengguang
@ 2011-08-26 12:11                                       ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26 12:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> a "disturber" dd read task during roughly 120-130s. 

Ah, but ideally the disturber task should run in bursts of 100ms
(<feedback period), otherwise your N is indeed mostly constant.



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 12:11                                       ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26 12:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 19:26 +0800, Wu Fengguang wrote:
> Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
> a "disturber" dd read task during roughly 120-130s. 

Ah, but ideally the disturber task should run in bursts of 100ms
(<feedback period), otherwise your N is indeed mostly constant.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:04                                   ` Wu Fengguang
  (?)
  (?)
@ 2011-08-26 11:26                                   ` Wu Fengguang
  2011-08-26 12:11                                       ` Peter Zijlstra
  -1 siblings, 1 reply; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 11:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1633 bytes --]

Peter,

Now I get 3 figures. Test case is: run 1 dd write task for 300s, with
a "disturber" dd read task during roughly 120-130s.

(1) balance_dirty_pages-pages.png

This is the output of the original patchset. Here the "balanced
ratelimit" dots are mostly accurate except when near @freerun or @limit.

(2) balance_dirty_pages-pages_pure-rate-feedback.png

do this change:
  -	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
  +	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
   					   dirty_rate | 1);

Here the "balanced ratelimit" dots goto the opposite direction
comparing to "pos ratelimit", which is the expected result discussed
in the other email. Then the system got stuck in unbalanced dirty
position.  It's slowly moving towards the setpoint thanks to the
dirty_ratelimit update policy: it only updates dirty_ratelimit when
balanced_dirty_ratelimit fluctuates to the same side of
task_ratelimit, hence introduced some systematical "errors" in the
right direction ;)

(3) balance_dirty_pages-pages_pure-rate-feedback-without-dirty_ratelimit-update-constraints.png

further remove the "do conservative bdi->dirty_ratelimit updates"
feature, by replacing its update policy with a direct assignment:

        bdi->dirty_ratelimit = max(balanced_dirty_ratelimit, 1UL);

This is to check if dirty_ratelimit can still go back to the balance
point without the help of the dirty_ratelimit update policy. To my
surprise, dirty_ratelimit jumps to HUGE singular value and shows no
sign to come back to normal..

In summary, the original patchset shows the best behavior :)

Thanks,
Fengguang

[-- Attachment #2: balance_dirty_pages-pages.png --]
[-- Type: image/png, Size: 75688 bytes --]

[-- Attachment #3: balance_dirty_pages-pages_pure-rate-feedback.png --]
[-- Type: image/png, Size: 83327 bytes --]

[-- Attachment #4: balance_dirty_pages-pages_pure-rate-feedback-without-dirty_ratelimit-update-constraints.png --]
[-- Type: image/png, Size: 63923 bytes --]

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:42                                     ` Peter Zijlstra
@ 2011-08-26 10:52                                       ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:42:22PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> > Sorry I'm now feeling lost...
> 
> hehe welcome to my world ;-)

Yeah, so sorry...

> Seriously though, I appreciate all the effort you put in trying to
> explain things. I feel I do understand things now, although I might not
> completely agree with them quite yet ;-)

Thank you :)

> I'll go read the v9 patch-set you send out and look at some of the
> details (such as pos_ratio being comprised of both global and bdi
> limits, which so far has been somewhat glossed over).

Hold on please! I'll immediately post a v10 with all the comment updates.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:52                                       ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:42:22PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> > Sorry I'm now feeling lost...
> 
> hehe welcome to my world ;-)

Yeah, so sorry...

> Seriously though, I appreciate all the effort you put in trying to
> explain things. I feel I do understand things now, although I might not
> completely agree with them quite yet ;-)

Thank you :)

> I'll go read the v9 patch-set you send out and look at some of the
> details (such as pos_ratio being comprised of both global and bdi
> limits, which so far has been somewhat glossed over).

Hold on please! I'll immediately post a v10 with all the comment updates.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26 10:04                                   ` Wu Fengguang
@ 2011-08-26 10:42                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> Sorry I'm now feeling lost...

hehe welcome to my world ;-)

Seriously though, I appreciate all the effort you put in trying to
explain things. I feel I do understand things now, although I might not
completely agree with them quite yet ;-)

I'll go read the v9 patch-set you send out and look at some of the
details (such as pos_ratio being comprised of both global and bdi
limits, which so far has been somewhat glossed over).

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:42                                     ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26 10:42 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 18:04 +0800, Wu Fengguang wrote:
> Sorry I'm now feeling lost...

hehe welcome to my world ;-)

Seriously though, I appreciate all the effort you put in trying to
explain things. I feel I do understand things now, although I might not
completely agree with them quite yet ;-)

I'll go read the v9 patch-set you send out and look at some of the
details (such as pos_ratio being comprised of both global and bdi
limits, which so far has been somewhat glossed over).


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  9:04                                 ` Peter Zijlstra
@ 2011-08-26 10:04                                   ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 05:04:29PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> 
> > > > Put (6) into (4), we get
> > > > 
> > > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > > >                             = (write_bw / N) * 2
> > > > 
> > > > That means, any position imbalance will lead to balanced_rate
> > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > > always get the right balanced dirty ratelimit value whether or not
> > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > > dirty position control.
> > > > 
> > > > (*) independent as in real values, not the seemingly relations in equation
> > > 
> > > 
> > > The assumption here is that N is a constant.. in the above case
> > > pos_ratio would eventually end up at 1 and things would be good again. I
> > > see your argument about oscillations, but I think you can introduce
> > > similar effects by varying N.
> > 
> > Yeah, it's very possible for N to change over time, in which case
> > balanced_rate will adapt to new N in similar way.
> 
> Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
> you accept that for pos_ratio but you don't mind for N ?

Sorry I'm now feeling lost...anyway it's convenient to try out the
pure rate feedback. And the test case exactly includes the sudden
change of N.

I'm now running the tests with this trivial patch:

--- linux-next.orig/mm/page-writeback.c	2011-08-26 17:58:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 17:59:06.000000000 +0800
@@ -800,7 +800,7 @@ static void bdi_update_dirty_ratelimit(s
 	 * the dirty count meet the setpoint, but also where the slope of
 	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
 	 */
-	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
 					   dirty_rate | 1);
 
 	/*

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26 10:04                                   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26 10:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 05:04:29PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> > On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> 
> > > > Put (6) into (4), we get
> > > > 
> > > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > > >                             = (write_bw / N) * 2
> > > > 
> > > > That means, any position imbalance will lead to balanced_rate
> > > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > > always get the right balanced dirty ratelimit value whether or not
> > > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > > dirty position control.
> > > > 
> > > > (*) independent as in real values, not the seemingly relations in equation
> > > 
> > > 
> > > The assumption here is that N is a constant.. in the above case
> > > pos_ratio would eventually end up at 1 and things would be good again. I
> > > see your argument about oscillations, but I think you can introduce
> > > similar effects by varying N.
> > 
> > Yeah, it's very possible for N to change over time, in which case
> > balanced_rate will adapt to new N in similar way.
> 
> Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
> you accept that for pos_ratio but you don't mind for N ?

Sorry I'm now feeling lost...anyway it's convenient to try out the
pure rate feedback. And the test case exactly includes the sudden
change of N.

I'm now running the tests with this trivial patch:

--- linux-next.orig/mm/page-writeback.c	2011-08-26 17:58:01.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-26 17:59:06.000000000 +0800
@@ -800,7 +800,7 @@ static void bdi_update_dirty_ratelimit(s
 	 * the dirty count meet the setpoint, but also where the slope of
 	 * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
 	 */
-	balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
+	balanced_dirty_ratelimit = div_u64((u64)dirty_ratelimit * write_bw,
 					   dirty_rate | 1);
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  8:56                                     ` Peter Zijlstra
@ 2011-08-26  9:53                                       ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 04:56:11PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
> >         /*
> >          * A linear estimation of the "balanced" throttle rate. The theory is,
> >          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
> >          * dirty_rate will be measured to be (N * task_ratelimit). So the below
> >          * formula will yield the balanced rate limit (write_bw / N).
> >          *
> >          * Note that the expanded form is not a pure rate feedback:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
> >          * but also takes pos_ratio into account:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
> >          *
> >          * (1) is not realistic because pos_ratio also takes part in balancing
> >          * the dirty rate.  Consider the state
> >          *      pos_ratio = 0.5                                              (3)
> >          *      rate = 2 * (write_bw / N)                                    (4)
> >          * If (1) is used, it will stuck in that state! Because each dd will be
> >          * throttled at
> >          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
> >          * yielding
> >          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
> >          * put (6) into (1) we get
> >          *      rate_(i+1) = rate_(i)                                        (7)
> >          *
> >          * So we end up using (2) to always keep
> >          *      rate_(i+1) ~= (write_bw / N)                                 (8)
> >          * regardless of the value of pos_ratio. As long as (8) is satisfied,
> >          * pos_ratio is able to drive itself to 1.0, which is not only where
> >          * the dirty count meet the setpoint, but also where the slope of
> >          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
> >          */ 
> 
> I'm still not buying this, it has the massive assumption N is a
> constant, without that assumption you get the same kind of thing you get
> from not adding pos_ratio to the feedback term.

The reasoning between (3)-(7) actually assumes both N and write_bw to
be some constant. It's documenting some stuck state..

> Also, I've yet to see what harm it does if you leave it out, all
> feedback loops should stabilize just fine.

That's a good question. It should be trivial to try out equation (1)
and see how it work out in practice. Let me collect some figures..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  9:53                                       ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26  9:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 04:56:11PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
> >         /*
> >          * A linear estimation of the "balanced" throttle rate. The theory is,
> >          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
> >          * dirty_rate will be measured to be (N * task_ratelimit). So the below
> >          * formula will yield the balanced rate limit (write_bw / N).
> >          *
> >          * Note that the expanded form is not a pure rate feedback:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
> >          * but also takes pos_ratio into account:
> >          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
> >          *
> >          * (1) is not realistic because pos_ratio also takes part in balancing
> >          * the dirty rate.  Consider the state
> >          *      pos_ratio = 0.5                                              (3)
> >          *      rate = 2 * (write_bw / N)                                    (4)
> >          * If (1) is used, it will stuck in that state! Because each dd will be
> >          * throttled at
> >          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
> >          * yielding
> >          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
> >          * put (6) into (1) we get
> >          *      rate_(i+1) = rate_(i)                                        (7)
> >          *
> >          * So we end up using (2) to always keep
> >          *      rate_(i+1) ~= (write_bw / N)                                 (8)
> >          * regardless of the value of pos_ratio. As long as (8) is satisfied,
> >          * pos_ratio is able to drive itself to 1.0, which is not only where
> >          * the dirty count meet the setpoint, but also where the slope of
> >          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
> >          */ 
> 
> I'm still not buying this, it has the massive assumption N is a
> constant, without that assumption you get the same kind of thing you get
> from not adding pos_ratio to the feedback term.

The reasoning between (3)-(7) actually assumes both N and write_bw to
be some constant. It's documenting some stuck state..

> Also, I've yet to see what harm it does if you leave it out, all
> feedback loops should stabilize just fine.

That's a good question. It should be trivial to try out equation (1)
and see how it work out in practice. Let me collect some figures..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  0:18                               ` Wu Fengguang
@ 2011-08-26  9:04                                 ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26  9:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:

> > > Put (6) into (4), we get
> > > 
> > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > >                             = (write_bw / N) * 2
> > > 
> > > That means, any position imbalance will lead to balanced_rate
> > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > always get the right balanced dirty ratelimit value whether or not
> > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > dirty position control.
> > > 
> > > (*) independent as in real values, not the seemingly relations in equation
> > 
> > 
> > The assumption here is that N is a constant.. in the above case
> > pos_ratio would eventually end up at 1 and things would be good again. I
> > see your argument about oscillations, but I think you can introduce
> > similar effects by varying N.
> 
> Yeah, it's very possible for N to change over time, in which case
> balanced_rate will adapt to new N in similar way.

Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
you accept that for pos_ratio but you don't mind for N ?



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  9:04                                 ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26  9:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 08:18 +0800, Wu Fengguang wrote:
> On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> > On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:

> > > Put (6) into (4), we get
> > > 
> > >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> > >                             = (write_bw / N) * 2
> > > 
> > > That means, any position imbalance will lead to balanced_rate
> > > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > > always get the right balanced dirty ratelimit value whether or not
> > > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > > dirty position control.
> > > 
> > > (*) independent as in real values, not the seemingly relations in equation
> > 
> > 
> > The assumption here is that N is a constant.. in the above case
> > pos_ratio would eventually end up at 1 and things would be good again. I
> > see your argument about oscillations, but I think you can introduce
> > similar effects by varying N.
> 
> Yeah, it's very possible for N to change over time, in which case
> balanced_rate will adapt to new N in similar way.

Gah.. but but but, that gives the same stuff as your (6)+(4). Why won't
you accept that for pos_ratio but you don't mind for N ?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-26  1:56                                   ` Wu Fengguang
@ 2011-08-26  8:56                                     ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26  8:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
>         /*
>          * A linear estimation of the "balanced" throttle rate. The theory is,
>          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
>          * dirty_rate will be measured to be (N * task_ratelimit). So the below
>          * formula will yield the balanced rate limit (write_bw / N).
>          *
>          * Note that the expanded form is not a pure rate feedback:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
>          * but also takes pos_ratio into account:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
>          *
>          * (1) is not realistic because pos_ratio also takes part in balancing
>          * the dirty rate.  Consider the state
>          *      pos_ratio = 0.5                                              (3)
>          *      rate = 2 * (write_bw / N)                                    (4)
>          * If (1) is used, it will stuck in that state! Because each dd will be
>          * throttled at
>          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
>          * yielding
>          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
>          * put (6) into (1) we get
>          *      rate_(i+1) = rate_(i)                                        (7)
>          *
>          * So we end up using (2) to always keep
>          *      rate_(i+1) ~= (write_bw / N)                                 (8)
>          * regardless of the value of pos_ratio. As long as (8) is satisfied,
>          * pos_ratio is able to drive itself to 1.0, which is not only where
>          * the dirty count meet the setpoint, but also where the slope of
>          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
>          */ 

I'm still not buying this, it has the massive assumption N is a
constant, without that assumption you get the same kind of thing you get
from not adding pos_ratio to the feedback term.

Also, I've yet to see what harm it does if you leave it out, all
feedback loops should stabilize just fine.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  8:56                                     ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-26  8:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-26 at 09:56 +0800, Wu Fengguang wrote:
>         /*
>          * A linear estimation of the "balanced" throttle rate. The theory is,
>          * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
>          * dirty_rate will be measured to be (N * task_ratelimit). So the below
>          * formula will yield the balanced rate limit (write_bw / N).
>          *
>          * Note that the expanded form is not a pure rate feedback:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
>          * but also takes pos_ratio into account:
>          *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
>          *
>          * (1) is not realistic because pos_ratio also takes part in balancing
>          * the dirty rate.  Consider the state
>          *      pos_ratio = 0.5                                              (3)
>          *      rate = 2 * (write_bw / N)                                    (4)
>          * If (1) is used, it will stuck in that state! Because each dd will be
>          * throttled at
>          *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
>          * yielding
>          *      dirty_rate = N * task_ratelimit = write_bw                   (6)
>          * put (6) into (1) we get
>          *      rate_(i+1) = rate_(i)                                        (7)
>          *
>          * So we end up using (2) to always keep
>          *      rate_(i+1) ~= (write_bw / N)                                 (8)
>          * regardless of the value of pos_ratio. As long as (8) is satisfied,
>          * pos_ratio is able to drive itself to 1.0, which is not only where
>          * the dirty count meet the setpoint, but also where the slope of
>          * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
>          */ 

I'm still not buying this, it has the massive assumption N is a
constant, without that assumption you get the same kind of thing you get
from not adding pos_ratio to the feedback term.

Also, I've yet to see what harm it does if you leave it out, all
feedback loops should stabilize just fine.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-25 22:20                                 ` Vivek Goyal
@ 2011-08-26  1:56                                   ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote:
> On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > So you are trying to make one feedback loop aware of second loop so that
> > > if second loop is unbalanced, first loop reacts to that as well and not
> > > just look at dirty_rate and write_bw. So refining new balanced rate by
> > > pos_ratio helps.
> > > 						      write_bw	
> > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > > 						      dirty_bw
> > > 
> > > Now if global dirty pages are imbalanced, balanced rate will still go
> > > down despite the fact that dirty_bw == write_bw. This will lead to
> > > further reduction in task dirty rate. Which in turn will lead to reduced
> > > number of dirty rate and should eventually lead to pos_ratio=1.
> > 
> > Right, that's a good alternative viewpoint to the below one.
> > 
> >   						  write_bw	
> >   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
> >   						  dirty_bw
> > 
> > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0
> 
> Personally I found it much easier to understand the other representation.
> Once you have come up with equation.
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw
> 
> Can you please put few lines of comments to explain that why above
> alone is not sufficient and we need to take pos_ratio also in to
> account to keep number of dirty pages in check. And then go onto
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio
> 
> This kind of maintains the continuity of explanation and explains
> that why are we deviating from the theory we discussed so far.

Good point. Here is the commented code:

        /*
         * task_ratelimit reflects each dd's dirty rate for the past 200ms.
         */
        task_ratelimit = (u64)dirty_ratelimit *
                                        pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * A linear estimation of the "balanced" throttle rate. The theory is,
         * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
         * dirty_rate will be measured to be (N * task_ratelimit). So the below
         * formula will yield the balanced rate limit (write_bw / N).
         *
         * Note that the expanded form is not a pure rate feedback:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
         * but also takes pos_ratio into account:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
         *
         * (1) is not realistic because pos_ratio also takes part in balancing
         * the dirty rate.  Consider the state
         *      pos_ratio = 0.5                                              (3)
         *      rate = 2 * (write_bw / N)                                    (4)
         * If (1) is used, it will stuck in that state! Because each dd will be
         * throttled at
         *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
         * yielding
         *      dirty_rate = N * task_ratelimit = write_bw                   (6)
         * put (6) into (1) we get
         *      rate_(i+1) = rate_(i)                                        (7)
         *
         * So we end up using (2) to always keep
         *      rate_(i+1) ~= (write_bw / N)                                 (8)
         * regardless of the value of pos_ratio. As long as (8) is satisfied,
         * pos_ratio is able to drive itself to 1.0, which is not only where
         * the dirty count meet the setpoint, but also where the slope of
         * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
         */
        balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
                                           dirty_rate | 1);

> > 
> > > A related question though I should have asked you this long back. How does
> > > throttling based on rate helps. Why we could not just work with two
> > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > > And then throttle task gradually to achieve smooth throttling behavior.
> > > IOW, what property does rate provide which is not available just by
> > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > > limit the way you have done for gloabl setpoint and throttle tasks
> > > accordingly?
> > 
> > Good question. If we have no idea of the balanced rate at all, but
> > still want to limit dirty pages within the range [freerun, limit],
> > all we can do is to throttle the task at eg. 1TB/s at @freerun and
> > 0 at @limit. Then you get a really sharp control line which will make
> > task_ratelimit fluctuate like mad...
> > 
> > So the balanced rate estimation is the key to get smooth task_ratelimit,
> > while pos_ratio is the ultimate guarantee for the dirty pages range.
> 
> Ok, that makes sense. By keeping an estimation of rate at which bdi
> can write, our range of throttling goes down. Say 0 to 300MB/s instead
> of 0 to 1TB/sec and that can lead to a more smooth behavior.

Yeah exactly, and even better, we can make the slope much more flat
around the setpoint to achieve excellent smoothness in stable state :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  1:56                                   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26  1:56 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 26, 2011 at 06:20:01AM +0800, Vivek Goyal wrote:
> On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:
> 
> [..]
> > > So you are trying to make one feedback loop aware of second loop so that
> > > if second loop is unbalanced, first loop reacts to that as well and not
> > > just look at dirty_rate and write_bw. So refining new balanced rate by
> > > pos_ratio helps.
> > > 						      write_bw	
> > > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > > 						      dirty_bw
> > > 
> > > Now if global dirty pages are imbalanced, balanced rate will still go
> > > down despite the fact that dirty_bw == write_bw. This will lead to
> > > further reduction in task dirty rate. Which in turn will lead to reduced
> > > number of dirty rate and should eventually lead to pos_ratio=1.
> > 
> > Right, that's a good alternative viewpoint to the below one.
> > 
> >   						  write_bw	
> >   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
> >   						  dirty_bw
> > 
> > (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> > (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0
> 
> Personally I found it much easier to understand the other representation.
> Once you have come up with equation.
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw
> 
> Can you please put few lines of comments to explain that why above
> alone is not sufficient and we need to take pos_ratio also in to
> account to keep number of dirty pages in check. And then go onto
> 
> balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio
> 
> This kind of maintains the continuity of explanation and explains
> that why are we deviating from the theory we discussed so far.

Good point. Here is the commented code:

        /*
         * task_ratelimit reflects each dd's dirty rate for the past 200ms.
         */
        task_ratelimit = (u64)dirty_ratelimit *
                                        pos_ratio >> RATELIMIT_CALC_SHIFT;

        /*
         * A linear estimation of the "balanced" throttle rate. The theory is,
         * if there are N dd tasks, each throttled at task_ratelimit, the bdi's
         * dirty_rate will be measured to be (N * task_ratelimit). So the below
         * formula will yield the balanced rate limit (write_bw / N).
         *
         * Note that the expanded form is not a pure rate feedback:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate)              (1)
         * but also takes pos_ratio into account:
         *      rate_(i+1) = rate_(i) * (write_bw / dirty_rate) * pos_ratio  (2)
         *
         * (1) is not realistic because pos_ratio also takes part in balancing
         * the dirty rate.  Consider the state
         *      pos_ratio = 0.5                                              (3)
         *      rate = 2 * (write_bw / N)                                    (4)
         * If (1) is used, it will stuck in that state! Because each dd will be
         * throttled at
         *      task_ratelimit = pos_ratio * rate = (write_bw / N)           (5)
         * yielding
         *      dirty_rate = N * task_ratelimit = write_bw                   (6)
         * put (6) into (1) we get
         *      rate_(i+1) = rate_(i)                                        (7)
         *
         * So we end up using (2) to always keep
         *      rate_(i+1) ~= (write_bw / N)                                 (8)
         * regardless of the value of pos_ratio. As long as (8) is satisfied,
         * pos_ratio is able to drive itself to 1.0, which is not only where
         * the dirty count meet the setpoint, but also where the slope of
         * pos_ratio is most flat and hence task_ratelimit is least fluctuated.
         */
        balanced_dirty_ratelimit = div_u64((u64)task_ratelimit * write_bw,
                                           dirty_rate | 1);

> > 
> > > A related question though I should have asked you this long back. How does
> > > throttling based on rate helps. Why we could not just work with two
> > > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > > And then throttle task gradually to achieve smooth throttling behavior.
> > > IOW, what property does rate provide which is not available just by
> > > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > > limit the way you have done for gloabl setpoint and throttle tasks
> > > accordingly?
> > 
> > Good question. If we have no idea of the balanced rate at all, but
> > still want to limit dirty pages within the range [freerun, limit],
> > all we can do is to throttle the task at eg. 1TB/s at @freerun and
> > 0 at @limit. Then you get a really sharp control line which will make
> > task_ratelimit fluctuate like mad...
> > 
> > So the balanced rate estimation is the key to get smooth task_ratelimit,
> > while pos_ratio is the ultimate guarantee for the dirty pages range.
> 
> Ok, that makes sense. By keeping an estimation of rate at which bdi
> can write, our range of throttling goes down. Say 0 to 300MB/s instead
> of 0 to 1TB/sec and that can lead to a more smooth behavior.

Yeah exactly, and even better, we can make the slope much more flat
around the setpoint to achieve excellent smoothness in stable state :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 16:12                             ` Peter Zijlstra
@ 2011-08-26  0:18                               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> 
> The assumption here is that N is a constant.. in the above case
> pos_ratio would eventually end up at 1 and things would be good again. I
> see your argument about oscillations, but I think you can introduce
> similar effects by varying N.

Yeah, it's very possible for N to change over time, in which case
balanced_rate will adapt to new N in similar way.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-26  0:18                               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-26  0:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 12:12:58AM +0800, Peter Zijlstra wrote:
> On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> 
> The assumption here is that N is a constant.. in the above case
> pos_ratio would eventually end up at 1 and things would be good again. I
> see your argument about oscillations, but I think you can introduce
> similar effects by varying N.

Yeah, it's very possible for N to change over time, in which case
balanced_rate will adapt to new N in similar way.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-25  3:19                               ` Wu Fengguang
@ 2011-08-25 22:20                                 ` Vivek Goyal
  -1 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-25 22:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:

[..]
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> > 						      write_bw	
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > 						      dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> Right, that's a good alternative viewpoint to the below one.
> 
>   						  write_bw	
>   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
>   						  dirty_bw
> 
> (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

Personally I found it much easier to understand the other representation.
Once you have come up with equation.

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw

Can you please put few lines of comments to explain that why above
alone is not sufficient and we need to take pos_ratio also in to
account to keep number of dirty pages in check. And then go onto

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio

This kind of maintains the continuity of explanation and explains
that why are we deviating from the theory we discussed so far.

> 
> > A related question though I should have asked you this long back. How does
> > throttling based on rate helps. Why we could not just work with two
> > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > And then throttle task gradually to achieve smooth throttling behavior.
> > IOW, what property does rate provide which is not available just by
> > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > limit the way you have done for gloabl setpoint and throttle tasks
> > accordingly?
> 
> Good question. If we have no idea of the balanced rate at all, but
> still want to limit dirty pages within the range [freerun, limit],
> all we can do is to throttle the task at eg. 1TB/s at @freerun and
> 0 at @limit. Then you get a really sharp control line which will make
> task_ratelimit fluctuate like mad...
> 
> So the balanced rate estimation is the key to get smooth task_ratelimit,
> while pos_ratio is the ultimate guarantee for the dirty pages range.

Ok, that makes sense. By keeping an estimation of rate at which bdi
can write, our range of throttling goes down. Say 0 to 300MB/s instead
of 0 to 1TB/sec and that can lead to a more smooth behavior.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25 22:20                                 ` Vivek Goyal
  0 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-25 22:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 11:19:34AM +0800, Wu Fengguang wrote:

[..]
> > So you are trying to make one feedback loop aware of second loop so that
> > if second loop is unbalanced, first loop reacts to that as well and not
> > just look at dirty_rate and write_bw. So refining new balanced rate by
> > pos_ratio helps.
> > 						      write_bw	
> > bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> > 						      dirty_bw
> > 
> > Now if global dirty pages are imbalanced, balanced rate will still go
> > down despite the fact that dirty_bw == write_bw. This will lead to
> > further reduction in task dirty rate. Which in turn will lead to reduced
> > number of dirty rate and should eventually lead to pos_ratio=1.
> 
> Right, that's a good alternative viewpoint to the below one.
> 
>   						  write_bw	
>   bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
>   						  dirty_bw
> 
> (1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
> (2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

Personally I found it much easier to understand the other representation.
Once you have come up with equation.

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw

Can you please put few lines of comments to explain that why above
alone is not sufficient and we need to take pos_ratio also in to
account to keep number of dirty pages in check. And then go onto

balance_rate_(i+1) = balance_rate(i) * write_bw/dirty_bw * pos_ratio

This kind of maintains the continuity of explanation and explains
that why are we deviating from the theory we discussed so far.

> 
> > A related question though I should have asked you this long back. How does
> > throttling based on rate helps. Why we could not just work with two
> > pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> > And then throttle task gradually to achieve smooth throttling behavior.
> > IOW, what property does rate provide which is not available just by
> > looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> > limit the way you have done for gloabl setpoint and throttle tasks
> > accordingly?
> 
> Good question. If we have no idea of the balanced rate at all, but
> still want to limit dirty pages within the range [freerun, limit],
> all we can do is to throttle the task at eg. 1TB/s at @freerun and
> 0 at @limit. Then you get a really sharp control line which will make
> task_ratelimit fluctuate like mad...
> 
> So the balanced rate estimation is the key to get smooth task_ratelimit,
> while pos_ratio is the ultimate guarantee for the dirty pages range.

Ok, that makes sense. By keeping an estimation of rate at which bdi
can write, our range of throttling goes down. Say 0 to 300MB/s instead
of 0 to 1TB/sec and that can lead to a more smooth behavior.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 15:57                         ` Peter Zijlstra
@ 2011-08-25  5:30                           ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-25  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 11:57:39PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > > >   well, in this concept: the balanced_rate formula inherently does not
> > > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > > >   based on the ratelimit executed for the past 200ms:
> > > > 
> > > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > > 
> > > Ok, this is where it all goes funny..
> > > 
> > > So if you want completely separated feedback loops I would expect
> > 
> > If call it feedback loops, then it's a series of independent feedback
> > loops of depth 1.  Because each balanced_rate is a fresh estimation
> > dependent solely on
> > 
> > - writeout bandwidth
> > - N, the number of dd tasks
> > 
> > in the past 200ms.
> > 
> > As long as a CONSTANT ratelimit (whatever value it is) is executed in
> > the past 200ms, we can get the same balanced_rate.
> > 
> >         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> > 
> > The resulted balanced_rate is independent of how large the CONSTANT
> > ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> > we'll see doubled dirty_rate and result in the same balanced_rate. 
> > 
> > In that manner, balance_rate_(i+1) is not really depending on the
> > value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> > to get the same balance_rate_(i+1) 
> 
> At best this argument says it doesn't matter what we use, making
> balance_rate_i an equally valid choice. However I don't buy this, your
> argument is broken, your CONSTANT_ratelimit breaks feedback but then you
> rely on the iterative form of feedback to finish your argument.
> 
> Consider:
> 
> 	r_(i+1) = r_i * ratio_i
> 
> you say, r_i := C for all i, then by definition ratio_i must be 1 and
> you've got nothing. The only way your conclusion can be right is by
> allowing the proper iteration, otherwise we'll never reach the
> equilibrium.
> 
> Now it is true you can introduce random perturbations in r_i at any
> given point and still end up in equilibrium, such is the power of
> iterative feedback, but that doesn't say you can do away with r_i. 

Sure there are always r_i.

Sorry what I mean CONSTANT_ratelimit is, it remains CONSTANT _inside_
every 200ms. There will be a series of different CONSTANT values for
each 200ms, which is roughly (r_i * pos_ratio_i).

> > > something like:
> > > 
> > > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > > 
> > > The former is a complete feedback loop, expressing the new value in the
> > > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > > causing the balance_rate to drop increasing the dirty_rate, and vice
> > > versa.
> > 
> > In principle, the bw_ratio works that way. However since
> > balance_rate_(i) is not the exact _executed_ ratelimit in
> > balance_dirty_pages().
> 
> This seems to be where your argument goes bad, the actually executed
> ratelimit is not important, the variance introduced by pos_ratio is
> purely for the benefit of the dirty page count. 
> 
> It doesn't matter for the balance_rate. Without pos_ratio, the dirty
> page count would stay stable (ignoring all these oscillations and other
> fun things), and therefore it is the balance_rate we should be using for
> the iterative feedback.

Nope. The dirty page count can always stay stable somewhere (but not
necessarily at setpoint) purely by the pos_ratio feedback, as illustrated
by Vivek's example.

But that's not the balance state we want. Although the pos_ratio
feedback all by itself is strong enough to keep (dirty_rate == write_bw),
the ideal state is to achieve pos_ratio=1 and eliminate its feedback
error as much as possible, so as to get smooth task_ratelimit.

We may take this viewpoint: a "successful" balance_rate should help
keep pos_ratio around 1.0 in long term.

> > > (*) which is the form I expected and why I thought your primary feedback
> > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> >  
> > Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> No, because iterative feedback has the form: 
> 
> 	new = old $op $feedback-term
> 

The problem is, the pos_ratio feedback will jump in and prematurely make
$feedback-term = 1, thus rendering the pure rate feedback weak/useless.

> > > Then when you use the balance_rate to actually throttle tasks you apply
> > > your secondary control steering the dirty page count, yielding:
> > > 
> > > 	task_rate = balance_rate * pos_ratio
> > 
> > Right. Note the above formula is not a derived one, 
> 
> Agreed, its not a derived expression but the originator of the dirty
> page count control.
> 
> > but an original
> > one that later leads to pos_ratio showing up in the calculation of
> > balanced_rate.
> 
> That's where I disagree :-)
> 
> > > >   and task_ratelimit_200ms happen to can be estimated from
> > > > 
> > > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > > 
> > > >   We may alternatively record every task_ratelimit executed in the
> > > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > > >   way we take the "superfluous" pos_ratio out of sight :) 
> > > 
> > > Right, so I'm not at all sure that makes sense, its not immediately
> > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > > all. 
> > 
> > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> > by balance_dirty_pages(). So this is an original formula:
> > 
> >         task_ratelimit = balance_rate * pos_ratio
> > 
> > task_ratelimit_200ms is also used as an original data source in
> > 
> >         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> But that's exactly where you conflate the positional feedback with the
> throughput feedback, the effective ratelimit includes the positional
> feedback so that the dirty page count can move around, but that is
> completely orthogonal to the throughput feedback since the throughout
> thing would leave the dirty count constant (ideal case again).
> 
> That is, yes the iterative feedback still works because you still got
> your primary feedback in place, but the addition of pos_ratio in the
> feedback loop is a pure perturbation and doesn't matter one whit.

The problem is that pure rate feedback is not possible because
pos_ratio also takes part in altering the task rate...

> > Then we try to estimate task_ratelimit_200ms by assuming all tasks
> > have been executing the same CONSTANT ratelimit in
> > balance_dirty_pages(). Hence we get
> > 
> >         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio
> 
> But this just cannot be true (and, as argued above, is completely
> unnecessary). 
> 
> Consider the case where the dirty count is way below the setpoint but
> the base ratelimit is pretty accurate. In that case we would start out
> by creating very low task ratelimits such that the dirty count can

s/low/high/

> increase. Once we match the setpoint we go back to the base ratelimit.
> The average over those 200ms would be <1, but since we're right at the
> setpoint when we do the base ratelimit feedback we pick exactly 1. 

Yeah that's the kind of error introduced by the CONSTANT ratelimit.
Which could be pretty large in small memory boxes. Given that
pos_ratio will fluctuate more anyway when memory and hence the
dirty control scope is small, such rate estimation errors are tolerable.

> Anyway, its completely irrelevant.. :-)

Yeah, that's one step further to discuss all kinds of possible errors
on top of the basic theory :)

> > > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > > 
> > > How can there not be a relation between balance_rate_(i+1) and
> > > balance_rate_(i) ? 
> > 
> > In this manner: even though balance_rate_(i) is somehow used for
> > calculating balance_rate_(i+1), the latter will evaluate to the same
> > value given whatever balance_rate_(i).
> 
> But only if you allow for the iterative feedback to work, you absolutely
> need that balance_rate_(i), without that its completely broken.

Agreed.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25  5:30                           ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-25  5:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 11:57:39PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> > On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > > >   well, in this concept: the balanced_rate formula inherently does not
> > > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > > >   based on the ratelimit executed for the past 200ms:
> > > > 
> > > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > > 
> > > Ok, this is where it all goes funny..
> > > 
> > > So if you want completely separated feedback loops I would expect
> > 
> > If call it feedback loops, then it's a series of independent feedback
> > loops of depth 1.  Because each balanced_rate is a fresh estimation
> > dependent solely on
> > 
> > - writeout bandwidth
> > - N, the number of dd tasks
> > 
> > in the past 200ms.
> > 
> > As long as a CONSTANT ratelimit (whatever value it is) is executed in
> > the past 200ms, we can get the same balanced_rate.
> > 
> >         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> > 
> > The resulted balanced_rate is independent of how large the CONSTANT
> > ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> > we'll see doubled dirty_rate and result in the same balanced_rate. 
> > 
> > In that manner, balance_rate_(i+1) is not really depending on the
> > value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> > to get the same balance_rate_(i+1) 
> 
> At best this argument says it doesn't matter what we use, making
> balance_rate_i an equally valid choice. However I don't buy this, your
> argument is broken, your CONSTANT_ratelimit breaks feedback but then you
> rely on the iterative form of feedback to finish your argument.
> 
> Consider:
> 
> 	r_(i+1) = r_i * ratio_i
> 
> you say, r_i := C for all i, then by definition ratio_i must be 1 and
> you've got nothing. The only way your conclusion can be right is by
> allowing the proper iteration, otherwise we'll never reach the
> equilibrium.
> 
> Now it is true you can introduce random perturbations in r_i at any
> given point and still end up in equilibrium, such is the power of
> iterative feedback, but that doesn't say you can do away with r_i. 

Sure there are always r_i.

Sorry what I mean CONSTANT_ratelimit is, it remains CONSTANT _inside_
every 200ms. There will be a series of different CONSTANT values for
each 200ms, which is roughly (r_i * pos_ratio_i).

> > > something like:
> > > 
> > > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > > 
> > > The former is a complete feedback loop, expressing the new value in the
> > > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > > causing the balance_rate to drop increasing the dirty_rate, and vice
> > > versa.
> > 
> > In principle, the bw_ratio works that way. However since
> > balance_rate_(i) is not the exact _executed_ ratelimit in
> > balance_dirty_pages().
> 
> This seems to be where your argument goes bad, the actually executed
> ratelimit is not important, the variance introduced by pos_ratio is
> purely for the benefit of the dirty page count. 
> 
> It doesn't matter for the balance_rate. Without pos_ratio, the dirty
> page count would stay stable (ignoring all these oscillations and other
> fun things), and therefore it is the balance_rate we should be using for
> the iterative feedback.

Nope. The dirty page count can always stay stable somewhere (but not
necessarily at setpoint) purely by the pos_ratio feedback, as illustrated
by Vivek's example.

But that's not the balance state we want. Although the pos_ratio
feedback all by itself is strong enough to keep (dirty_rate == write_bw),
the ideal state is to achieve pos_ratio=1 and eliminate its feedback
error as much as possible, so as to get smooth task_ratelimit.

We may take this viewpoint: a "successful" balance_rate should help
keep pos_ratio around 1.0 in long term.

> > > (*) which is the form I expected and why I thought your primary feedback
> > > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> >  
> > Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> No, because iterative feedback has the form: 
> 
> 	new = old $op $feedback-term
> 

The problem is, the pos_ratio feedback will jump in and prematurely make
$feedback-term = 1, thus rendering the pure rate feedback weak/useless.

> > > Then when you use the balance_rate to actually throttle tasks you apply
> > > your secondary control steering the dirty page count, yielding:
> > > 
> > > 	task_rate = balance_rate * pos_ratio
> > 
> > Right. Note the above formula is not a derived one, 
> 
> Agreed, its not a derived expression but the originator of the dirty
> page count control.
> 
> > but an original
> > one that later leads to pos_ratio showing up in the calculation of
> > balanced_rate.
> 
> That's where I disagree :-)
> 
> > > >   and task_ratelimit_200ms happen to can be estimated from
> > > > 
> > > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > > 
> > > >   We may alternatively record every task_ratelimit executed in the
> > > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > > >   way we take the "superfluous" pos_ratio out of sight :) 
> > > 
> > > Right, so I'm not at all sure that makes sense, its not immediately
> > > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > > all. 
> > 
> > task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> > by balance_dirty_pages(). So this is an original formula:
> > 
> >         task_ratelimit = balance_rate * pos_ratio
> > 
> > task_ratelimit_200ms is also used as an original data source in
> > 
> >         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> But that's exactly where you conflate the positional feedback with the
> throughput feedback, the effective ratelimit includes the positional
> feedback so that the dirty page count can move around, but that is
> completely orthogonal to the throughput feedback since the throughout
> thing would leave the dirty count constant (ideal case again).
> 
> That is, yes the iterative feedback still works because you still got
> your primary feedback in place, but the addition of pos_ratio in the
> feedback loop is a pure perturbation and doesn't matter one whit.

The problem is that pure rate feedback is not possible because
pos_ratio also takes part in altering the task rate...

> > Then we try to estimate task_ratelimit_200ms by assuming all tasks
> > have been executing the same CONSTANT ratelimit in
> > balance_dirty_pages(). Hence we get
> > 
> >         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio
> 
> But this just cannot be true (and, as argued above, is completely
> unnecessary). 
> 
> Consider the case where the dirty count is way below the setpoint but
> the base ratelimit is pretty accurate. In that case we would start out
> by creating very low task ratelimits such that the dirty count can

s/low/high/

> increase. Once we match the setpoint we go back to the base ratelimit.
> The average over those 200ms would be <1, but since we're right at the
> setpoint when we do the base ratelimit feedback we pick exactly 1. 

Yeah that's the kind of error introduced by the CONSTANT ratelimit.
Which could be pretty large in small memory boxes. Given that
pos_ratio will fluctuate more anyway when memory and hence the
dirty control scope is small, such rate estimation errors are tolerable.

> Anyway, its completely irrelevant.. :-)

Yeah, that's one step further to discuss all kinds of possible errors
on top of the basic theory :)

> > > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > > 
> > > How can there not be a relation between balance_rate_(i+1) and
> > > balance_rate_(i) ? 
> > 
> > In this manner: even though balance_rate_(i) is somehow used for
> > calculating balance_rate_(i+1), the latter will evaluate to the same
> > value given whatever balance_rate_(i).
> 
> But only if you allow for the iterative feedback to work, you absolutely
> need that balance_rate_(i), without that its completely broken.

Agreed.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24 18:00                             ` Vivek Goyal
@ 2011-08-25  3:19                               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-25  3:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 02:00:58AM +0800, Vivek Goyal wrote:
> On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.

Thank you very much :)

> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.

That's right. There must be some instantaneous feedback to react to
fast workload changes. With pos_ratio providing this capability, the
estimated balanced rate can take time to follow.

Note that pos_ratio by itself is enough to limit dirty pages within
the [freerun, limit] control scope. The cost of (temporarily) large
error in balanced rate is, task_ratelimit will be fluctuating much
more, due to the fact pos_ratio will depart from 1.0 (to the point it
can fully compensate for the rate errors) and dirty pages approaching
@freerun or @limit where the slope of pos_ratio goes sharp.

The correct estimation of balanced rate serves to drive pos_ratio back
to 1.0, where it has the most flat slope.

> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
> 						        write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> 						        dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.

That's right.

> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5

Yes. The bad thing is, if the above equation (of pure rate feedback)
is used, the system is going to remain in that position-imbalanced
state forever, which is bad for the smoothness of task_ratelimit.

> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
> 						      write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> 						      dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.

Right, that's a good alternative viewpoint to the below one.

  						  write_bw	
  bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
  						  dirty_bw

(1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
(2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

> A related question though I should have asked you this long back. How does
> throttling based on rate helps. Why we could not just work with two
> pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> And then throttle task gradually to achieve smooth throttling behavior.
> IOW, what property does rate provide which is not available just by
> looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> limit the way you have done for gloabl setpoint and throttle tasks
> accordingly?

Good question. If we have no idea of the balanced rate at all, but
still want to limit dirty pages within the range [freerun, limit],
all we can do is to throttle the task at eg. 1TB/s at @freerun and
0 at @limit. Then you get a really sharp control line which will make
task_ratelimit fluctuate like mad...

So the balanced rate estimation is the key to get smooth task_ratelimit,
while pos_ratio is the ultimate guarantee for the dirty pages range.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-25  3:19                               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-25  3:19 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 25, 2011 at 02:00:58AM +0800, Vivek Goyal wrote:
> On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > > You somehow directly jump to  
> > > 
> > > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > > 
> > > without explaining why following will not work.
> > > 
> > > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> > 
> > Thanks for asking that, it's probably the root of confusions, so let
> > me answer it standalone.
> > 
> > It's actually pretty simple to explain this equation:
> > 
> >                                                write_bw
> >         balanced_rate = task_ratelimit_200ms * ----------       (1)
> >                                                dirty_rate
> > 
> > If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> > for the past 200ms, we are going to measure the overall bdi dirty rate
> > 
> >         dirty_rate = N * task_ratelimit_200ms                   (2)
> > 
> > put (2) into (1) we get
> > 
> >         balanced_rate = write_bw / N                            (3)
> > 
> > So equation (1) is the right estimation to get the desired target (3).
> > 
> > 
> > As for
> > 
> >                                                   write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
> >                                                   dirty_rate
> > 
> > Let's compare it with the "expanded" form of (1):
> > 
> >                                                               write_bw
> >         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
> >                                                               dirty_rate
> > 
> > So the difference lies in pos_ratio.
> > 
> > Believe it or not, it's exactly the seemingly use of pos_ratio that
> > makes (5) independent(*) of the position control.
> > 
> > Why? Look at (4), assume the system is in a state
> > 
> > - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> > - dirty position is not balanced, for example pos_ratio = 0.5
> > 
> > balance_dirty_pages() will be rate limiting each tasks at half the
> > balanced dirty rate, yielding a measured
> > 
> >         dirty_rate = write_bw / 2                               (6)
> > 
> > Put (6) into (4), we get
> > 
> >         balanced_rate_(i+1) = balanced_rate_(i) * 2
> >                             = (write_bw / N) * 2
> > 
> > That means, any position imbalance will lead to balanced_rate
> > estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> > always get the right balanced dirty ratelimit value whether or not
> > (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> > dirty position control.
> > 
> > (*) independent as in real values, not the seemingly relations in equation
> 
> Ok, I think I am beginning to see your point. Let me just elaborate on
> the example you gave.

Thank you very much :)

> Assume a system is completely balanced and a task is writing at 100MB/s
> rate.
> 
> write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1
> 
> bdi->dirty_ratelimit = 100MB/s
> 
> Now another tasks starts dirtying the page cache on same bdi. Number of 
> dirty pages should go up pretty fast and likely position ratio feedback
> will kick in to reduce the dirtying rate. (rate based feedback does not
> kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.

That's right. There must be some instantaneous feedback to react to
fast workload changes. With pos_ratio providing this capability, the
estimated balanced rate can take time to follow.

Note that pos_ratio by itself is enough to limit dirty pages within
the [freerun, limit] control scope. The cost of (temporarily) large
error in balanced rate is, task_ratelimit will be fluctuating much
more, due to the fact pos_ratio will depart from 1.0 (to the point it
can fully compensate for the rate errors) and dirty pages approaching
@freerun or @limit where the slope of pos_ratio goes sharp.

The correct estimation of balanced rate serves to drive pos_ratio back
to 1.0, where it has the most flat slope.

> Assume new pos_ratio is .5
> 
> So new throttle rate for both the tasks is 50MB/s.
> 
> bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
> task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s
> 
> Now lets say 200ms have passed and rate base feedback is reevaluated.
> 
> 						        write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
> 						        dirty_bw
> 
> bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s
> 
> Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
> that did not happen. And reason being that there are two feedback control
> loops and pos_ratio loops reacts to imbalances much more quickly. Because
> previous loop has already reacted to the imbalance and reduced the
> dirtying rate of task, rate based loop does not try to adjust anything
> and thinks everything is just fine.

That's right.

> Things are fine in the sense that still dirty_rate == write_bw but
> system is not balanced in terms of number of dirty pages and pos_ratio=.5

Yes. The bad thing is, if the above equation (of pure rate feedback)
is used, the system is going to remain in that position-imbalanced
state forever, which is bad for the smoothness of task_ratelimit.

> So you are trying to make one feedback loop aware of second loop so that
> if second loop is unbalanced, first loop reacts to that as well and not
> just look at dirty_rate and write_bw. So refining new balanced rate by
> pos_ratio helps.
> 						      write_bw	
> bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
> 						      dirty_bw
> 
> Now if global dirty pages are imbalanced, balanced rate will still go
> down despite the fact that dirty_bw == write_bw. This will lead to
> further reduction in task dirty rate. Which in turn will lead to reduced
> number of dirty rate and should eventually lead to pos_ratio=1.

Right, that's a good alternative viewpoint to the below one.

  						  write_bw	
  bdi->dirty_ratelimit_(i+1) = task_ratelimit_i * ---------
  						  dirty_bw

(1) the periodic rate estimation uses that to refresh the balanced rate on every 200ms
(2) as long as the rate estimation is correct, pos_ratio is able to drive itself to 1.0

> A related question though I should have asked you this long back. How does
> throttling based on rate helps. Why we could not just work with two
> pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
> And then throttle task gradually to achieve smooth throttling behavior.
> IOW, what property does rate provide which is not available just by
> looking at per bdi dirty pages. Can't we come up with bdi setpoint and
> limit the way you have done for gloabl setpoint and throttle tasks
> accordingly?

Good question. If we have no idea of the balanced rate at all, but
still want to limit dirty pages within the range [freerun, limit],
all we can do is to throttle the task at eg. 1TB/s at @freerun and
0 at @limit. Then you get a really sharp control line which will make
task_ratelimit fluctuate like mad...

So the balanced rate estimation is the key to get smooth task_ratelimit,
while pos_ratio is the ultimate guarantee for the dirty pages range.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24  0:12                           ` Wu Fengguang
@ 2011-08-24 18:00                             ` Vivek Goyal
  -1 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-24 18:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation

Ok, I think I am beginning to see your point. Let me just elaborate on
the example you gave.

Assume a system is completely balanced and a task is writing at 100MB/s
rate.

write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1

bdi->dirty_ratelimit = 100MB/s

Now another tasks starts dirtying the page cache on same bdi. Number of 
dirty pages should go up pretty fast and likely position ratio feedback
will kick in to reduce the dirtying rate. (rate based feedback does not
kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
Assume new pos_ratio is .5

So new throttle rate for both the tasks is 50MB/s.

bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s

Now lets say 200ms have passed and rate base feedback is reevaluated.

						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
						      dirty_bw

bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s

Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
that did not happen. And reason being that there are two feedback control
loops and pos_ratio loops reacts to imbalances much more quickly. Because
previous loop has already reacted to the imbalance and reduced the
dirtying rate of task, rate based loop does not try to adjust anything
and thinks everything is just fine.

Things are fine in the sense that still dirty_rate == write_bw but
system is not balanced in terms of number of dirty pages and pos_ratio=.5

So you are trying to make one feedback loop aware of second loop so that
if second loop is unbalanced, first loop reacts to that as well and not
just look at dirty_rate and write_bw. So refining new balanced rate by
pos_ratio helps.
						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
						      dirty_bw

Now if global dirty pages are imbalanced, balanced rate will still go
down despite the fact that dirty_bw == write_bw. This will lead to
further reduction in task dirty rate. Which in turn will lead to reduced
number of dirty rate and should eventually lead to pos_ratio=1.

A related question though I should have asked you this long back. How does
throttling based on rate helps. Why we could not just work with two
pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
And then throttle task gradually to achieve smooth throttling behavior.
IOW, what property does rate provide which is not available just by
looking at per bdi dirty pages. Can't we come up with bdi setpoint and
limit the way you have done for gloabl setpoint and throttle tasks
accordingly?

Thanks
Vivek

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 18:00                             ` Vivek Goyal
  0 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-24 18:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 24, 2011 at 08:12:58AM +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation

Ok, I think I am beginning to see your point. Let me just elaborate on
the example you gave.

Assume a system is completely balanced and a task is writing at 100MB/s
rate.

write_bw = dirty_rate = 100MB/s, pos_ratio = 1; N=1

bdi->dirty_ratelimit = 100MB/s

Now another tasks starts dirtying the page cache on same bdi. Number of 
dirty pages should go up pretty fast and likely position ratio feedback
will kick in to reduce the dirtying rate. (rate based feedback does not
kick in till next 200ms) and pos_ratio feedback seems to be instantaneous.
Assume new pos_ratio is .5

So new throttle rate for both the tasks is 50MB/s.

bdi->dirty_ratelimit = 100MB/s (a feedback has not kicked in yet)
task_ratelimit = bdi->dirty_ratelimit * pos_ratio = 100 *.5 = 50MB/s

Now lets say 200ms have passed and rate base feedback is reevaluated.

						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * ---------
						      dirty_bw

bdi->dirty_ratelimit_(i+1) = 100 * 100/100 = 100MB/s

Ideally bdi->dirty_ratelimit should have now become 50MB/s as N=2 but 
that did not happen. And reason being that there are two feedback control
loops and pos_ratio loops reacts to imbalances much more quickly. Because
previous loop has already reacted to the imbalance and reduced the
dirtying rate of task, rate based loop does not try to adjust anything
and thinks everything is just fine.

Things are fine in the sense that still dirty_rate == write_bw but
system is not balanced in terms of number of dirty pages and pos_ratio=.5

So you are trying to make one feedback loop aware of second loop so that
if second loop is unbalanced, first loop reacts to that as well and not
just look at dirty_rate and write_bw. So refining new balanced rate by
pos_ratio helps.
						      write_bw	
bdi->dirty_ratelimit_(i+1) = bdi->dirty_ratelimit_i * --------- * pos_ratio
						      dirty_bw

Now if global dirty pages are imbalanced, balanced rate will still go
down despite the fact that dirty_bw == write_bw. This will lead to
further reduction in task dirty rate. Which in turn will lead to reduced
number of dirty rate and should eventually lead to pos_ratio=1.

A related question though I should have asked you this long back. How does
throttling based on rate helps. Why we could not just work with two
pos_ratios. One is gloabl postion ratio and other is bdi position ratio.
And then throttle task gradually to achieve smooth throttling behavior.
IOW, what property does rate provide which is not available just by
looking at per bdi dirty pages. Can't we come up with bdi setpoint and
limit the way you have done for gloabl setpoint and throttle tasks
accordingly?

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-24  0:12                           ` Wu Fengguang
@ 2011-08-24 16:12                             ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-24 16:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation


The assumption here is that N is a constant.. in the above case
pos_ratio would eventually end up at 1 and things would be good again. I
see your argument about oscillations, but I think you can introduce
similar effects by varying N.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 16:12                             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-24 16:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Vivek Goyal, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Wed, 2011-08-24 at 08:12 +0800, Wu Fengguang wrote:
> > You somehow directly jump to  
> > 
> > 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> > 
> > without explaining why following will not work.
> > 
> > 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate
> 
> Thanks for asking that, it's probably the root of confusions, so let
> me answer it standalone.
> 
> It's actually pretty simple to explain this equation:
> 
>                                                write_bw
>         balanced_rate = task_ratelimit_200ms * ----------       (1)
>                                                dirty_rate
> 
> If there are N dd tasks, each task is throttled at task_ratelimit_200ms
> for the past 200ms, we are going to measure the overall bdi dirty rate
> 
>         dirty_rate = N * task_ratelimit_200ms                   (2)
> 
> put (2) into (1) we get
> 
>         balanced_rate = write_bw / N                            (3)
> 
> So equation (1) is the right estimation to get the desired target (3).
> 
> 
> As for
> 
>                                                   write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
>                                                   dirty_rate
> 
> Let's compare it with the "expanded" form of (1):
> 
>                                                               write_bw
>         balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
>                                                               dirty_rate
> 
> So the difference lies in pos_ratio.
> 
> Believe it or not, it's exactly the seemingly use of pos_ratio that
> makes (5) independent(*) of the position control.
> 
> Why? Look at (4), assume the system is in a state
> 
> - dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
> - dirty position is not balanced, for example pos_ratio = 0.5
> 
> balance_dirty_pages() will be rate limiting each tasks at half the
> balanced dirty rate, yielding a measured
> 
>         dirty_rate = write_bw / 2                               (6)
> 
> Put (6) into (4), we get
> 
>         balanced_rate_(i+1) = balanced_rate_(i) * 2
>                             = (write_bw / N) * 2
> 
> That means, any position imbalance will lead to balanced_rate
> estimation errors if we follow (4). Whereas if (1)/(5) is used, we
> always get the right balanced dirty ratelimit value whether or not
> (pos_ratio == 1.0), hence make the rate estimation independent(*) of
> dirty position control.
> 
> (*) independent as in real values, not the seemingly relations in equation


The assumption here is that N is a constant.. in the above case
pos_ratio would eventually end up at 1 and things would be good again. I
see your argument about oscillations, but I think you can introduce
similar effects by varying N.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 14:15                       ` Wu Fengguang
  (?)
@ 2011-08-24 15:57                         ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term

> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 15:57                         ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term

> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24 15:57                         ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-24 15:57 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 22:15 +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) 

At best this argument says it doesn't matter what we use, making
balance_rate_i an equally valid choice. However I don't buy this, your
argument is broken, your CONSTANT_ratelimit breaks feedback but then you
rely on the iterative form of feedback to finish your argument.

Consider:

	r_(i+1) = r_i * ratio_i

you say, r_i := C for all i, then by definition ratio_i must be 1 and
you've got nothing. The only way your conclusion can be right is by
allowing the proper iteration, otherwise we'll never reach the
equilibrium.

Now it is true you can introduce random perturbations in r_i at any
given point and still end up in equilibrium, such is the power of
iterative feedback, but that doesn't say you can do away with r_i. 

> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().

This seems to be where your argument goes bad, the actually executed
ratelimit is not important, the variance introduced by pos_ratio is
purely for the benefit of the dirty page count. 

It doesn't matter for the balance_rate. Without pos_ratio, the dirty
page count would stay stable (ignoring all these oscillations and other
fun things), and therefore it is the balance_rate we should be using for
the iterative feedback.

> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.

No, because iterative feedback has the form: 

	new = old $op $feedback-term

> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, 

Agreed, its not a derived expression but the originator of the dirty
page count control.

> but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.

That's where I disagree :-)

> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

But that's exactly where you conflate the positional feedback with the
throughput feedback, the effective ratelimit includes the positional
feedback so that the dirty page count can move around, but that is
completely orthogonal to the throughput feedback since the throughout
thing would leave the dirty count constant (ideal case again).

That is, yes the iterative feedback still works because you still got
your primary feedback in place, but the addition of pos_ratio in the
feedback loop is a pure perturbation and doesn't matter one whit.

> Then we try to estimate task_ratelimit_200ms by assuming all tasks
> have been executing the same CONSTANT ratelimit in
> balance_dirty_pages(). Hence we get
> 
>         task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

But this just cannot be true (and, as argued above, is completely
unnecessary). 

Consider the case where the dirty count is way below the setpoint but
the base ratelimit is pretty accurate. In that case we would start out
by creating very low task ratelimits such that the dirty count can
increase. Once we match the setpoint we go back to the base ratelimit.
The average over those 200ms would be <1, but since we're right at the
setpoint when we do the base ratelimit feedback we pick exactly 1. 

Anyway, its completely irrelevant.. :-)

> > >   There is fundamentally no dependency between balanced_rate_(i+1) and
> > >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> > >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> > >   200ms, then it get the balanced rate from the dirty_rate feedback.
> > 
> > How can there not be a relation between balance_rate_(i+1) and
> > balance_rate_(i) ? 
> 
> In this manner: even though balance_rate_(i) is somehow used for
> calculating balance_rate_(i+1), the latter will evaluate to the same
> value given whatever balance_rate_(i).

But only if you allow for the iterative feedback to work, you absolutely
need that balance_rate_(i), without that its completely broken.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 20:24         ` Jan Kara
@ 2011-08-24  3:16           ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
> span?

OK, I'll follow your suggestion to use

        span = 8 * write_bw, for single bdi case 
        span = bdi_thresh, for JBOD case
        x_intercept = setpoint + span;

It does make sense to squeeze the bdi_dirty fluctuation range a bit by
doubling span and making the control line more sharp.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24  3:16           ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-24  3:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
> span?

OK, I'll follow your suggestion to use

        span = 8 * write_bw, for single bdi case 
        span = bdi_thresh, for JBOD case
        x_intercept = setpoint + span;

It does make sense to squeeze the bdi_dirty fluctuation range a bit by
doubling span and making the control line more sharp.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 17:47                         ` Vivek Goyal
@ 2011-08-24  0:12                           ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-24  0:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> You somehow directly jump to  
> 
> 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> without explaining why following will not work.
> 
> 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks for asking that, it's probably the root of confusions, so let
me answer it standalone.

It's actually pretty simple to explain this equation:

                                               write_bw
        balanced_rate = task_ratelimit_200ms * ----------       (1)
                                               dirty_rate

If there are N dd tasks, each task is throttled at task_ratelimit_200ms
for the past 200ms, we are going to measure the overall bdi dirty rate

        dirty_rate = N * task_ratelimit_200ms                   (2)

put (2) into (1) we get

        balanced_rate = write_bw / N                            (3)

So equation (1) is the right estimation to get the desired target (3).


As for

                                                  write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
                                                  dirty_rate

Let's compare it with the "expanded" form of (1):

                                                              write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
                                                              dirty_rate

So the difference lies in pos_ratio.

Believe it or not, it's exactly the seemingly use of pos_ratio that
makes (5) independent(*) of the position control.

Why? Look at (4), assume the system is in a state

- dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
- dirty position is not balanced, for example pos_ratio = 0.5

balance_dirty_pages() will be rate limiting each tasks at half the
balanced dirty rate, yielding a measured

        dirty_rate = write_bw / 2                               (6)

Put (6) into (4), we get

        balanced_rate_(i+1) = balanced_rate_(i) * 2
                            = (write_bw / N) * 2

That means, any position imbalance will lead to balanced_rate
estimation errors if we follow (4). Whereas if (1)/(5) is used, we
always get the right balanced dirty ratelimit value whether or not
(pos_ratio == 1.0), hence make the rate estimation independent(*) of
dirty position control.

(*) independent as in real values, not the seemingly relations in equation

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-24  0:12                           ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-24  0:12 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

> You somehow directly jump to  
> 
> 	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 
> without explaining why following will not work.
> 
> 	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks for asking that, it's probably the root of confusions, so let
me answer it standalone.

It's actually pretty simple to explain this equation:

                                               write_bw
        balanced_rate = task_ratelimit_200ms * ----------       (1)
                                               dirty_rate

If there are N dd tasks, each task is throttled at task_ratelimit_200ms
for the past 200ms, we are going to measure the overall bdi dirty rate

        dirty_rate = N * task_ratelimit_200ms                   (2)

put (2) into (1) we get

        balanced_rate = write_bw / N                            (3)

So equation (1) is the right estimation to get the desired target (3).


As for

                                                  write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * ----------    (4)
                                                  dirty_rate

Let's compare it with the "expanded" form of (1):

                                                              write_bw
        balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio * ----------      (5)
                                                              dirty_rate

So the difference lies in pos_ratio.

Believe it or not, it's exactly the seemingly use of pos_ratio that
makes (5) independent(*) of the position control.

Why? Look at (4), assume the system is in a state

- dirty rate is already balanced, ie. balanced_rate_(i) = write_bw / N
- dirty position is not balanced, for example pos_ratio = 0.5

balance_dirty_pages() will be rate limiting each tasks at half the
balanced dirty rate, yielding a measured

        dirty_rate = write_bw / 2                               (6)

Put (6) into (4), we get

        balanced_rate_(i+1) = balanced_rate_(i) * 2
                            = (write_bw / N) * 2

That means, any position imbalance will lead to balanced_rate
estimation errors if we follow (4). Whereas if (1)/(5) is used, we
always get the right balanced dirty ratelimit value whether or not
(pos_ratio == 1.0), hence make the rate estimation independent(*) of
dirty position control.

(*) independent as in real values, not the seemingly relations in equation

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 14:15                       ` Wu Fengguang
@ 2011-08-23 17:47                         ` Vivek Goyal
  -1 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-23 17:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 10:15:04PM +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) if not considering estimation
> errors. Note that the estimation errors mainly come from the
> fluctuations in dirty_rate.
> 
> That may well be what's already in your mind, just that we disagree
> about the terms ;)
> 
> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().
> 
> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> > With the above balance_rate is an independent variable that tracks the
> > write bandwidth. Now possibly you'd want a low-pass filter on that since
> > your bw_ratio is a bit funny in the head, but that's another story.
> 
> Yeah.
> 
> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.
> 
> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 

I think above calculates to.

 task_ratelimit = balanced_rate * pos_ratio
or
 task_ratelimit = task_ratelimit_200ms * write_bw / dirty_rate * pos_ratio
or
 task_ratelimit = balance_rate * pos_ratio  * write_bw / dirty_rate * pos_ratio
or
								    2
 task_ratelimit = balance_rate * write_bw / dirty_rate * (pos_ratio)

And the question is why not.

 task_ratelimit = prev-balance_rate * write_bw / dirty_rate * pos_ratio

Which sounds intutive as comapred to former one.

You somehow directly jump to  

	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

without explaining why following will not work.

	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks
Vivek

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 17:47                         ` Vivek Goyal
  0 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-23 17:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 10:15:04PM +0800, Wu Fengguang wrote:
> On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > > - not a factor at all for updating balanced_rate (whether or not we do (2))
> > >   well, in this concept: the balanced_rate formula inherently does not
> > >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> > >   based on the ratelimit executed for the past 200ms:
> > > 
> > >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> > 
> > Ok, this is where it all goes funny..
> > 
> > So if you want completely separated feedback loops I would expect
> 
> If call it feedback loops, then it's a series of independent feedback
> loops of depth 1.  Because each balanced_rate is a fresh estimation
> dependent solely on
> 
> - writeout bandwidth
> - N, the number of dd tasks
> 
> in the past 200ms.
> 
> As long as a CONSTANT ratelimit (whatever value it is) is executed in
> the past 200ms, we can get the same balanced_rate.
> 
>         balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate
> 
> The resulted balanced_rate is independent of how large the CONSTANT
> ratelimit is, because if we start with a doubled CONSTANT ratelimit,
> we'll see doubled dirty_rate and result in the same balanced_rate. 
> 
> In that manner, balance_rate_(i+1) is not really depending on the
> value of balance_rate_(i): whatever balance_rate_(i) is, we are going
> to get the same balance_rate_(i+1) if not considering estimation
> errors. Note that the estimation errors mainly come from the
> fluctuations in dirty_rate.
> 
> That may well be what's already in your mind, just that we disagree
> about the terms ;)
> 
> > something like:
> > 
> > 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> > 
> > The former is a complete feedback loop, expressing the new value in the
> > old value (*) with bw_ratio as feedback parameter; if we throttled too
> > much, the dirty_rate will have dropped and the bw_ratio will be <1
> > causing the balance_rate to drop increasing the dirty_rate, and vice
> > versa.
> 
> In principle, the bw_ratio works that way. However since
> balance_rate_(i) is not the exact _executed_ ratelimit in
> balance_dirty_pages().
> 
> > (*) which is the form I expected and why I thought your primary feedback
> > loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
>  
> Because the executed ratelimit was rate_(i) * pos_ratio.
> 
> > With the above balance_rate is an independent variable that tracks the
> > write bandwidth. Now possibly you'd want a low-pass filter on that since
> > your bw_ratio is a bit funny in the head, but that's another story.
> 
> Yeah.
> 
> > Then when you use the balance_rate to actually throttle tasks you apply
> > your secondary control steering the dirty page count, yielding:
> > 
> > 	task_rate = balance_rate * pos_ratio
> 
> Right. Note the above formula is not a derived one, but an original
> one that later leads to pos_ratio showing up in the calculation of
> balanced_rate.
> 
> > >   and task_ratelimit_200ms happen to can be estimated from
> > > 
> > >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> > 
> > >   We may alternatively record every task_ratelimit executed in the
> > >   past 200ms and average them all to get task_ratelimit_200ms. In this
> > >   way we take the "superfluous" pos_ratio out of sight :) 
> > 
> > Right, so I'm not at all sure that makes sense, its not immediately
> > evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> > clear to me why your primary feedback loop uses task_ratelimit_200ms at
> > all. 
> 
> task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
> by balance_dirty_pages(). So this is an original formula:
> 
>         task_ratelimit = balance_rate * pos_ratio
> 
> task_ratelimit_200ms is also used as an original data source in
> 
>         balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate
> 

I think above calculates to.

 task_ratelimit = balanced_rate * pos_ratio
or
 task_ratelimit = task_ratelimit_200ms * write_bw / dirty_rate * pos_ratio
or
 task_ratelimit = balance_rate * pos_ratio  * write_bw / dirty_rate * pos_ratio
or
								    2
 task_ratelimit = balance_rate * write_bw / dirty_rate * (pos_ratio)

And the question is why not.

 task_ratelimit = prev-balance_rate * write_bw / dirty_rate * pos_ratio

Which sounds intutive as comapred to former one.

You somehow directly jump to  

	balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

without explaining why following will not work.

	balanced_rate_(i+1) = balance_rate(i) * write_bw / dirty_rate

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 10:01                     ` Peter Zijlstra
@ 2011-08-23 14:36                       ` Vivek Goyal
  -1 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-23 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 12:01:00PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..

Exactly. This is where it gets confusing and is bone of contention.

> 
> So if you want completely separated feedback loops I would expect
> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 

I agree. This makes sense. IOW.
						      write_bw
bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_(n-1) * -------
						      dirty_rate

> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

I think you meant.

"if we throttled too much, the dirty_rate will have dropped and the bw_ratio
 will be >1 causing the balance_rate to increase hence increasing the
 dirty_rate, and vice versa."

> 
> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> 
> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.
> 
> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio
> 
> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 
> 

We I thought that this is evident that.

task_ratelimit = balanced_rate * pos_ratio

What is not evident to me is following.

balanced_rate_(i+1) = task_ratelimit_200ms * pos_ratio.

Instead, like you, I also thought that following is more obivious.

balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio

Thanks
Vivek

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 14:36                       ` Vivek Goyal
  0 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-23 14:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 12:01:00PM +0200, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..

Exactly. This is where it gets confusing and is bone of contention.

> 
> So if you want completely separated feedback loops I would expect
> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 

I agree. This makes sense. IOW.
						      write_bw
bdi->dirty_ratelimit_n = bdi->dirty_ratelimit_(n-1) * -------
						      dirty_rate

> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

I think you meant.

"if we throttled too much, the dirty_rate will have dropped and the bw_ratio
 will be >1 causing the balance_rate to increase hence increasing the
 dirty_rate, and vice versa."

> 
> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio
> 
> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.
> 
> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio
> 
> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 
> 

We I thought that this is evident that.

task_ratelimit = balanced_rate * pos_ratio

What is not evident to me is following.

balanced_rate_(i+1) = task_ratelimit_200ms * pos_ratio.

Instead, like you, I also thought that following is more obivious.

balanced_rate_(i+1) = balanced_rate_(i) * pos_ratio

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23 10:01                     ` Peter Zijlstra
@ 2011-08-23 14:15                       ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-23 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..
> 
> So if you want completely separated feedback loops I would expect

If call it feedback loops, then it's a series of independent feedback
loops of depth 1.  Because each balanced_rate is a fresh estimation
dependent solely on

- writeout bandwidth
- N, the number of dd tasks

in the past 200ms.

As long as a CONSTANT ratelimit (whatever value it is) is executed in
the past 200ms, we can get the same balanced_rate.

        balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate

The resulted balanced_rate is independent of how large the CONSTANT
ratelimit is, because if we start with a doubled CONSTANT ratelimit,
we'll see doubled dirty_rate and result in the same balanced_rate. 

In that manner, balance_rate_(i+1) is not really depending on the
value of balance_rate_(i): whatever balance_rate_(i) is, we are going
to get the same balance_rate_(i+1) if not considering estimation
errors. Note that the estimation errors mainly come from the
fluctuations in dirty_rate.

That may well be what's already in your mind, just that we disagree
about the terms ;)

> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 
> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

In principle, the bw_ratio works that way. However since
balance_rate_(i) is not the exact _executed_ ratelimit in
balance_dirty_pages().

> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

Because the executed ratelimit was rate_(i) * pos_ratio.

> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.

Yeah.

> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio

Right. Note the above formula is not a derived one, but an original
one that later leads to pos_ratio showing up in the calculation of
balanced_rate.

> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 

task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
by balance_dirty_pages(). So this is an original formula:

        task_ratelimit = balance_rate * pos_ratio

task_ratelimit_200ms is also used as an original data source in

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Then we try to estimate task_ratelimit_200ms by assuming all tasks
have been executing the same CONSTANT ratelimit in
balance_dirty_pages(). Hence we get

        task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

> >   There is fundamentally no dependency between balanced_rate_(i+1) and
> >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> >   200ms, then it get the balanced rate from the dirty_rate feedback.
> 
> How can there not be a relation between balance_rate_(i+1) and
> balance_rate_(i) ? 

In this manner: even though balance_rate_(i) is somehow used for
calculating balance_rate_(i+1), the latter will evaluate to the same
value given whatever balance_rate_(i).

That is, there is two dependencies, the seemingly dependency in the
formula, and the effective dependency in the data values.

Thank,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 14:15                       ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-23 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 23, 2011 at 06:01:00PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> > - not a factor at all for updating balanced_rate (whether or not we do (2))
> >   well, in this concept: the balanced_rate formula inherently does not
> >   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
> >   based on the ratelimit executed for the past 200ms:
> > 
> >           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio
> 
> Ok, this is where it all goes funny..
> 
> So if you want completely separated feedback loops I would expect

If call it feedback loops, then it's a series of independent feedback
loops of depth 1.  Because each balanced_rate is a fresh estimation
dependent solely on

- writeout bandwidth
- N, the number of dd tasks

in the past 200ms.

As long as a CONSTANT ratelimit (whatever value it is) is executed in
the past 200ms, we can get the same balanced_rate.

        balanced_rate = CONSTANT_ratelimit * write_bw / dirty_rate

The resulted balanced_rate is independent of how large the CONSTANT
ratelimit is, because if we start with a doubled CONSTANT ratelimit,
we'll see doubled dirty_rate and result in the same balanced_rate. 

In that manner, balance_rate_(i+1) is not really depending on the
value of balance_rate_(i): whatever balance_rate_(i) is, we are going
to get the same balance_rate_(i+1) if not considering estimation
errors. Note that the estimation errors mainly come from the
fluctuations in dirty_rate.

That may well be what's already in your mind, just that we disagree
about the terms ;)

> something like:
> 
> 	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms
> 
> The former is a complete feedback loop, expressing the new value in the
> old value (*) with bw_ratio as feedback parameter; if we throttled too
> much, the dirty_rate will have dropped and the bw_ratio will be <1
> causing the balance_rate to drop increasing the dirty_rate, and vice
> versa.

In principle, the bw_ratio works that way. However since
balance_rate_(i) is not the exact _executed_ ratelimit in
balance_dirty_pages().

> (*) which is the form I expected and why I thought your primary feedback
> loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

Because the executed ratelimit was rate_(i) * pos_ratio.

> With the above balance_rate is an independent variable that tracks the
> write bandwidth. Now possibly you'd want a low-pass filter on that since
> your bw_ratio is a bit funny in the head, but that's another story.

Yeah.

> Then when you use the balance_rate to actually throttle tasks you apply
> your secondary control steering the dirty page count, yielding:
> 
> 	task_rate = balance_rate * pos_ratio

Right. Note the above formula is not a derived one, but an original
one that later leads to pos_ratio showing up in the calculation of
balanced_rate.

> >   and task_ratelimit_200ms happen to can be estimated from
> > 
> >           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio
> 
> >   We may alternatively record every task_ratelimit executed in the
> >   past 200ms and average them all to get task_ratelimit_200ms. In this
> >   way we take the "superfluous" pos_ratio out of sight :) 
> 
> Right, so I'm not at all sure that makes sense, its not immediately
> evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
> clear to me why your primary feedback loop uses task_ratelimit_200ms at
> all. 

task_ratelimit is used and hence defined to be (balance_rate * pos_ratio)
by balance_dirty_pages(). So this is an original formula:

        task_ratelimit = balance_rate * pos_ratio

task_ratelimit_200ms is also used as an original data source in

        balanced_rate = task_ratelimit_200ms * write_bw / dirty_rate

Then we try to estimate task_ratelimit_200ms by assuming all tasks
have been executing the same CONSTANT ratelimit in
balance_dirty_pages(). Hence we get

        task_ratelimit_200ms ~= prev_balance_rate * pos_ratio

> >   There is fundamentally no dependency between balanced_rate_(i+1) and
> >   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
> >   only asks for _whatever_ CONSTANT task ratelimit to be executed for
> >   200ms, then it get the balanced rate from the dirty_rate feedback.
> 
> How can there not be a relation between balance_rate_(i+1) and
> balance_rate_(i) ? 

In this manner: even though balance_rate_(i) is somehow used for
calculating balance_rate_(i+1), the latter will evaluate to the same
value given whatever balance_rate_(i).

That is, there is two dependencies, the seemingly dependency in the
formula, and the effective dependency in the data values.

Thank,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-23  3:40                   ` Wu Fengguang
  (?)
@ 2011-08-23 10:01                     ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 10:01                     ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23 10:01                     ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-23 10:01 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-23 at 11:40 +0800, Wu Fengguang wrote:
> - not a factor at all for updating balanced_rate (whether or not we do (2))
>   well, in this concept: the balanced_rate formula inherently does not
>   derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
>   based on the ratelimit executed for the past 200ms:
> 
>           balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

Ok, this is where it all goes funny..

So if you want completely separated feedback loops I would expect
something like:

	balance_rate_(i+1) = balance_rate_(i) * bw_ratio   ; every 200ms

The former is a complete feedback loop, expressing the new value in the
old value (*) with bw_ratio as feedback parameter; if we throttled too
much, the dirty_rate will have dropped and the bw_ratio will be <1
causing the balance_rate to drop increasing the dirty_rate, and vice
versa.

(*) which is the form I expected and why I thought your primary feedback
loop looked like: rate_(i+1) = rate_(i) * pos_ratio * bw_ratio

With the above balance_rate is an independent variable that tracks the
write bandwidth. Now possibly you'd want a low-pass filter on that since
your bw_ratio is a bit funny in the head, but that's another story.

Then when you use the balance_rate to actually throttle tasks you apply
your secondary control steering the dirty page count, yielding:

	task_rate = balance_rate * pos_ratio

>   and task_ratelimit_200ms happen to can be estimated from
> 
>           task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

>   We may alternatively record every task_ratelimit executed in the
>   past 200ms and average them all to get task_ratelimit_200ms. In this
>   way we take the "superfluous" pos_ratio out of sight :) 

Right, so I'm not at all sure that makes sense, its not immediately
evident that <task_ratelimit> ~= balance_rate * pos_ratio. Nor is it
clear to me why your primary feedback loop uses task_ratelimit_200ms at
all. 

>   There is fundamentally no dependency between balanced_rate_(i+1) and
>   balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
>   only asks for _whatever_ CONSTANT task ratelimit to be executed for
>   200ms, then it get the balanced rate from the dirty_rate feedback.

How can there not be a relation between balance_rate_(i+1) and
balance_rate_(i) ? 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-22 15:38                 ` Peter Zijlstra
@ 2011-08-23  3:40                   ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 22, 2011 at 11:38:07PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > 
> > To start with,
> > 
> >                                                 write_bw
> >         ref_bw = task_ratelimit_in_past_200ms * --------
> >                                                 dirty_bw
> > 
> > where
> >         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> > 
> > > > Now all of the above would seem to suggest:
> > > > 
> > > >   dirty_ratelimit := ref_bw
> > 
> > Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> > started with exactly the above equation when I got choked by pure
> > pos_bw based feedback control (as mentioned in the reply to Jan's
> > email) and introduced the ref_bw estimation as the way out.
> > 
> > But there are some imperfections in ref_bw, too. Which makes it not
> > suitable for direct use:
> > 
> > 1) large fluctuations
> 
> OK, understood.
> 
> > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> > becomes unbalanced match, which leads to large systematical errors
> > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> > be compensated smoothly.
> 
> OK.
> 
> > 3) since we ultimately want to
> > 
> > - keep the dirty pages around the setpoint as long time as possible
> > - keep the fluctuations of task ratelimit as small as possible
> 
> Fair enough ;-)
> 
> > the update policy used for (2) also serves the above goals nicely:
> > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> > point to bring up dirty_ratelimit in a hurry and to hurt both the
> > above two goals.
> 
> Right, so still I feel somewhat befuddled, so we have:
> 
> 	dirty_ratelimit - rate at which we throttle dirtiers as
> 			  estimated upto 200ms ago.

Note that bdi->dirty_ratelimit is supposed to be the balanced
ratelimit, ie. (write_bw / N), regardless whether dirty pages meets
the setpoint.

In _concept_, the bdi balanced ratelimit is updated _independent_ of
the position control embodied in the task ratelimit calculation.

A lot of confusions seem to come from the seemingly inter-twisted rate
and position controls, however in my mind, there are two levels of
relationship:

1) work fundamentally independent of each other, each tries to fulfill
   one single target (either balanced rate or balanced position)

2) _based_ on (1), completely optional, try to constraint the rate update 
   to get more stable ->dirty_ratelimit and more balanced dirty position

Note that (2) is not a must even if there are systematic errors in
balanced_rate calculation. For example, the v8 patchset only does (1)
and hence do simple

        bdi->dirty_ratelimit = balanced_rate;

And it can still balance at some point (though not exactly around the setpoint):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/balance_dirty_pages-pages.png

Even if ext4 has mis-matched (dirty_rate:write_bw ~= 3:2) hence
introduced systematic errors in balanced_rate:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/global_dirtied_written.png

> 	pos_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in dirty pages around its target

So pos_ratio is

- is a _limiting_ factor rather than an _adjusting_ factor for
  updating ->dirty_ratelimit (when do (2))

- not a factor at all for updating balanced_rate (whether or not we do (2))
  well, in this concept: the balanced_rate formula inherently does not
  derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
  based on the ratelimit executed for the past 200ms:

          balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

  and task_ratelimit_200ms happen to can be estimated from

          task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

  There is fundamentally no dependency between balanced_rate_(i+1) and
  balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
  only asks for _whatever_ CONSTANT task ratelimit to be executed for
  200ms, then it get the balanced rate from the dirty_rate feedback.

  We may alternatively record every task_ratelimit executed in the
  past 200ms and average them all to get task_ratelimit_200ms. In this
  way we take the "superfluous" pos_ratio out of sight :)

> 	bw_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in input/output bandwidth
> 
> and we need to basically do:
> 
> 	dirty_ratelimit *= pos_ratio * bw_ratio

So there is even no such recursing at all:

        balanced_rate *= bw_ratio

Each balanced_rate is estimated from the start, based on each 200ms period.

> to update the dirty_ratelimit to reflect the current state. However per
> 1) and 2) bw_ratio is crappy and hard to fix.
> 
> So you propose to update dirty_ratelimit only if both pos_ratio and
> bw_ratio point in the same direction, however that would result in:
> 
>   if (pos_ratio < UNIT && bw_ratio < UNIT ||
>       pos_ratio > UNIT && bw_ratio > UNIT) {
> 	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
> 	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
>   }

We start by doing this for (1):

        dirty_ratelimit = balanced_rate

and then try to refine it for (1)+(2):

        dirty_ratelimit => balanced_rate, but limit the progress by pos_ratio

> > > > However for that you use:
> > > > 
> > > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > > 
> > > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > > >         dirty_ratelimit = min(ref_bw, pos_bw);
> > 
> > The above are merely constraints to the dirty_ratelimit update.
> > It serves to
> > 
> > 1) stop adjusting the rate when it's against the position control
> >    target (the adjusted rate will slow down the progress of dirty
> >    pages going back to setpoint).
> 
> Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
> they point in different directions however:
> 
>  0.5 < 1 &&  0.5 * 1.1 < 1
> 
> so your code will in fact update the dirty_ratelimit, even though the
> two factors point in opposite directions.

It does not work that way since pos_ratio does not take part in the
multiplication. However I admit that the tests

        (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
        (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)

don't aim to avoid all unnecessary updates, and it may even stop some
rightful updates. It's not possible at all to act perfect. It's merely
a rule that sounds "reasonable" in theory and works reasonably good in
practice :) I'd be happy to try more if there are better ones.

> > 2) limit the step size. pos_bw is changing values step by step,
> >    leaving a consistent trace comparing to the randomly jumping
> >    ref_bw. pos_bw also has smaller errors in stable state and normally
> >    have larger errors when there are big errors in rate. So it's a
> >    pretty good limiting factor for the step size of dirty_ratelimit.
> 
> OK, so that's the min/max stuff, however it only works because you use
> pos_bw and ref_bw instead of the fully separated factors.

Yes, the min/max stuff is for limiting the step size. The "limiting"
intention can be made more clear if written as

        delta = balanced_rate - base_rate;

        if (delta > pos_rate - base_rate)
            delta = pos_rate - base_rate;

        delta /= 8;

> > Hope the above elaboration helps :)
> 
> A little.. 

And now? ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-23  3:40                   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-23  3:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 22, 2011 at 11:38:07PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> > On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > 
> > To start with,
> > 
> >                                                 write_bw
> >         ref_bw = task_ratelimit_in_past_200ms * --------
> >                                                 dirty_bw
> > 
> > where
> >         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> > 
> > > > Now all of the above would seem to suggest:
> > > > 
> > > >   dirty_ratelimit := ref_bw
> > 
> > Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> > started with exactly the above equation when I got choked by pure
> > pos_bw based feedback control (as mentioned in the reply to Jan's
> > email) and introduced the ref_bw estimation as the way out.
> > 
> > But there are some imperfections in ref_bw, too. Which makes it not
> > suitable for direct use:
> > 
> > 1) large fluctuations
> 
> OK, understood.
> 
> > 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> > becomes unbalanced match, which leads to large systematical errors
> > in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> > be compensated smoothly.
> 
> OK.
> 
> > 3) since we ultimately want to
> > 
> > - keep the dirty pages around the setpoint as long time as possible
> > - keep the fluctuations of task ratelimit as small as possible
> 
> Fair enough ;-)
> 
> > the update policy used for (2) also serves the above goals nicely:
> > if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> > and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> > point to bring up dirty_ratelimit in a hurry and to hurt both the
> > above two goals.
> 
> Right, so still I feel somewhat befuddled, so we have:
> 
> 	dirty_ratelimit - rate at which we throttle dirtiers as
> 			  estimated upto 200ms ago.

Note that bdi->dirty_ratelimit is supposed to be the balanced
ratelimit, ie. (write_bw / N), regardless whether dirty pages meets
the setpoint.

In _concept_, the bdi balanced ratelimit is updated _independent_ of
the position control embodied in the task ratelimit calculation.

A lot of confusions seem to come from the seemingly inter-twisted rate
and position controls, however in my mind, there are two levels of
relationship:

1) work fundamentally independent of each other, each tries to fulfill
   one single target (either balanced rate or balanced position)

2) _based_ on (1), completely optional, try to constraint the rate update 
   to get more stable ->dirty_ratelimit and more balanced dirty position

Note that (2) is not a must even if there are systematic errors in
balanced_rate calculation. For example, the v8 patchset only does (1)
and hence do simple

        bdi->dirty_ratelimit = balanced_rate;

And it can still balance at some point (though not exactly around the setpoint):

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/balance_dirty_pages-pages.png

Even if ext4 has mis-matched (dirty_rate:write_bw ~= 3:2) hence
introduced systematic errors in balanced_rate:

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/3G-bs=1M/ext4-1dd-1M-8p-2942M-20:10-3.0.0-next-20110802+-2011-08-08.19:47/global_dirtied_written.png

> 	pos_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in dirty pages around its target

So pos_ratio is

- is a _limiting_ factor rather than an _adjusting_ factor for
  updating ->dirty_ratelimit (when do (2))

- not a factor at all for updating balanced_rate (whether or not we do (2))
  well, in this concept: the balanced_rate formula inherently does not
  derive the balanced_rate_(i+1) from balanced_rate_i. Rather it's
  based on the ratelimit executed for the past 200ms:

          balanced_rate_(i+1) = task_ratelimit_200ms * bw_ratio

  and task_ratelimit_200ms happen to can be estimated from

          task_ratelimit_200ms ~= balanced_rate_i * pos_ratio

  There is fundamentally no dependency between balanced_rate_(i+1) and
  balanced_rate_i/task_ratelimit_200ms: the balanced_rate estimation
  only asks for _whatever_ CONSTANT task ratelimit to be executed for
  200ms, then it get the balanced rate from the dirty_rate feedback.

  We may alternatively record every task_ratelimit executed in the
  past 200ms and average them all to get task_ratelimit_200ms. In this
  way we take the "superfluous" pos_ratio out of sight :)

> 	bw_ratio	- ratio adjusting the dirty_ratelimit
> 			  for variance in input/output bandwidth
> 
> and we need to basically do:
> 
> 	dirty_ratelimit *= pos_ratio * bw_ratio

So there is even no such recursing at all:

        balanced_rate *= bw_ratio

Each balanced_rate is estimated from the start, based on each 200ms period.

> to update the dirty_ratelimit to reflect the current state. However per
> 1) and 2) bw_ratio is crappy and hard to fix.
> 
> So you propose to update dirty_ratelimit only if both pos_ratio and
> bw_ratio point in the same direction, however that would result in:
> 
>   if (pos_ratio < UNIT && bw_ratio < UNIT ||
>       pos_ratio > UNIT && bw_ratio > UNIT) {
> 	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
> 	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
>   }

We start by doing this for (1):

        dirty_ratelimit = balanced_rate

and then try to refine it for (1)+(2):

        dirty_ratelimit => balanced_rate, but limit the progress by pos_ratio

> > > > However for that you use:
> > > > 
> > > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > > 
> > > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > > >         dirty_ratelimit = min(ref_bw, pos_bw);
> > 
> > The above are merely constraints to the dirty_ratelimit update.
> > It serves to
> > 
> > 1) stop adjusting the rate when it's against the position control
> >    target (the adjusted rate will slow down the progress of dirty
> >    pages going back to setpoint).
> 
> Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
> they point in different directions however:
> 
>  0.5 < 1 &&  0.5 * 1.1 < 1
> 
> so your code will in fact update the dirty_ratelimit, even though the
> two factors point in opposite directions.

It does not work that way since pos_ratio does not take part in the
multiplication. However I admit that the tests

        (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
        (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)

don't aim to avoid all unnecessary updates, and it may even stop some
rightful updates. It's not possible at all to act perfect. It's merely
a rule that sounds "reasonable" in theory and works reasonably good in
practice :) I'd be happy to try more if there are better ones.

> > 2) limit the step size. pos_bw is changing values step by step,
> >    leaving a consistent trace comparing to the randomly jumping
> >    ref_bw. pos_bw also has smaller errors in stable state and normally
> >    have larger errors when there are big errors in rate. So it's a
> >    pretty good limiting factor for the step size of dirty_ratelimit.
> 
> OK, so that's the min/max stuff, however it only works because you use
> pos_bw and ref_bw instead of the fully separated factors.

Yes, the min/max stuff is for limiting the step size. The "limiting"
intention can be made more clear if written as

        delta = balanced_rate - base_rate;

        if (delta > pos_rate - base_rate)
            delta = pos_rate - base_rate;

        delta /= 8;

> > Hope the above elaboration helps :)
> 
> A little.. 

And now? ;)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 14:20               ` Wu Fengguang
  (?)
@ 2011-08-22 15:38                 ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-22 15:38                 ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-22 15:38                 ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-22 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 22:20 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> To start with,
> 
>                                                 write_bw
>         ref_bw = task_ratelimit_in_past_200ms * --------
>                                                 dirty_bw
> 
> where
>         task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio
> 
> > > Now all of the above would seem to suggest:
> > > 
> > >   dirty_ratelimit := ref_bw
> 
> Right, ideally ref_bw is the balanced dirty ratelimit. I actually
> started with exactly the above equation when I got choked by pure
> pos_bw based feedback control (as mentioned in the reply to Jan's
> email) and introduced the ref_bw estimation as the way out.
> 
> But there are some imperfections in ref_bw, too. Which makes it not
> suitable for direct use:
> 
> 1) large fluctuations

OK, understood.

> 2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
> becomes unbalanced match, which leads to large systematical errors
> in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
> be compensated smoothly.

OK.

> 3) since we ultimately want to
> 
> - keep the dirty pages around the setpoint as long time as possible
> - keep the fluctuations of task ratelimit as small as possible

Fair enough ;-)

> the update policy used for (2) also serves the above goals nicely:
> if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
> and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
> point to bring up dirty_ratelimit in a hurry and to hurt both the
> above two goals.

Right, so still I feel somewhat befuddled, so we have:

	dirty_ratelimit - rate at which we throttle dirtiers as
			  estimated upto 200ms ago.

	pos_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in dirty pages around its target

	bw_ratio	- ratio adjusting the dirty_ratelimit
			  for variance in input/output bandwidth

and we need to basically do:

	dirty_ratelimit *= pos_ratio * bw_ratio

to update the dirty_ratelimit to reflect the current state. However per
1) and 2) bw_ratio is crappy and hard to fix.

So you propose to update dirty_ratelimit only if both pos_ratio and
bw_ratio point in the same direction, however that would result in:

  if (pos_ratio < UNIT && bw_ratio < UNIT ||
      pos_ratio > UNIT && bw_ratio > UNIT) {
	dirty_ratelimit = (dirty_ratelimit * pos_ratio) / UNIT;
	dirty_ratelimit = (dirty_ratelimit * bw_ratio) / UNIT;
  }

> > > However for that you use:
> > > 
> > >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> > >         dirty_ratelimit = max(ref_bw, pos_bw);
> > > 
> > >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> > >         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> The above are merely constraints to the dirty_ratelimit update.
> It serves to
> 
> 1) stop adjusting the rate when it's against the position control
>    target (the adjusted rate will slow down the progress of dirty
>    pages going back to setpoint).

Not strictly speaking, suppose pos_ratio = 0.5 and bw_ratio = 1.1, then
they point in different directions however:

 0.5 < 1 &&  0.5 * 1.1 < 1

so your code will in fact update the dirty_ratelimit, even though the
two factors point in opposite directions.

> 2) limit the step size. pos_bw is changing values step by step,
>    leaving a consistent trace comparing to the randomly jumping
>    ref_bw. pos_bw also has smaller errors in stable state and normally
>    have larger errors when there are big errors in rate. So it's a
>    pretty good limiting factor for the step size of dirty_ratelimit.

OK, so that's the min/max stuff, however it only works because you use
pos_bw and ref_bw instead of the fully separated factors.

> Hope the above elaboration helps :)

A little.. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-19  2:53     ` Vivek Goyal
@ 2011-08-19  3:25       ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-19  3:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:53:21AM +0800, Vivek Goyal wrote:
> On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:
> 
> [..]
> > +/*
> > + * Dirty position control.
> > + *
> > + * (o) global/bdi setpoints
> > + *
> > + * We want the dirty pages be balanced around the global/bdi setpoints.
> > + * When the number of dirty pages is higher/lower than the setpoint, the
> > + * dirty position control ratio (and hence task dirty ratelimit) will be
> > + * decreased/increased to bring the dirty pages back to the setpoint.
> > + *
> > + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> > + *
> > + *     if (dirty < setpoint) scale up   pos_ratio
> > + *     if (dirty > setpoint) scale down pos_ratio
> > + *
> > + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> > + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> > + *
> > + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> > + *
> > + * (o) global control line
> > + *
> > + *     ^ pos_ratio
> > + *     |
> > + *     |            |<===== global dirty control scope ======>|
> > + * 2.0 .............*
> > + *     |            .*
> > + *     |            . *
> > + *     |            .   *
> > + *     |            .     *
> > + *     |            .        *
> > + *     |            .            *
> > + * 1.0 ................................*
> > + *     |            .                  .     *
> > + *     |            .                  .          *
> > + *     |            .                  .              *
> > + *     |            .                  .                 *
> > + *     |            .                  .                    *
> > + *   0 +------------.------------------.----------------------*------------->
> > + *           freerun^          setpoint^                 limit^   dirty pages
> > + *
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0                 setpoint                     x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> > + */
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
> > +	x_intercept = setpoint + 2 * span;
> > +
> 
> Hi Fengguang,
> 
> Few very basic queries.
> 
> - Why can't we use the same formula for bdi position ratio as gloabl
>   position ratio. Are you not looking for similar proporties. Near the
>   set point variation is less and away from setup poing throttling is
>   faster.

The changelog has more details, however I hope the rephrased summary
can answer this question better.

Firstly, for single bdi case, the different bdi/global formula is
complementing each other, where the bdi's slope is proportional to the
writeout bandwidth, while the global one is scaling to memory size.
In huge memory system, the global position feedback becomes very weak
(even far away from the setpoint).  This is where the bdi control line
can help pull the dirty pages to the setpoint.

Secondly, for JBOD case, the global/bdi dirty thresholds are
fundamentally different. The global one is stable and strong limit,
while the bdi one is fluctuating and hence only suitable be taken as a
weak limit. The other reason to make it a weak limit is, there are
valid situations that (bdi_dirty >> bdi_thresh) and it's desirable to
throttle the dirtier in reasonable small rate rather than to hard
throttle it.

> - In the bdi calculation, setpoint seems to be in number of pages and 
>   limit (x_intercept) seems to be a combination of nr pages + pages/sec.
>   Why it is different from gloabl setpoint and limit. I mean could this
>   not have been like global calculation where we try to keep bdi_dirty
>   close to bdi_thresh and calculate pos_ratio. 

Because the bdi dirty pages are observed to typically fluctuate up to
1-second worth of data. So the write_bw used here is really (1s * write_bw).

> - In global pos_ratio calculation terminology used is "limit" while
>   the same thing seems be being meintioned as x_intercept in bdi position
>   ratio calculation.

Yes. Because the bdi control lines don't intent to do hard limit at all.

It's actually possible for x_intercept to become larger than the global limit.
This means the it's a memory tight system (or the storage is super fast)
where the bdi dirty pages will inevitably fluctuate a lot (up to write_bw).
We just let go of them and let the global formula take the control.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-19  3:25       ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-19  3:25 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 19, 2011 at 10:53:21AM +0800, Vivek Goyal wrote:
> On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:
> 
> [..]
> > +/*
> > + * Dirty position control.
> > + *
> > + * (o) global/bdi setpoints
> > + *
> > + * We want the dirty pages be balanced around the global/bdi setpoints.
> > + * When the number of dirty pages is higher/lower than the setpoint, the
> > + * dirty position control ratio (and hence task dirty ratelimit) will be
> > + * decreased/increased to bring the dirty pages back to the setpoint.
> > + *
> > + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> > + *
> > + *     if (dirty < setpoint) scale up   pos_ratio
> > + *     if (dirty > setpoint) scale down pos_ratio
> > + *
> > + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> > + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> > + *
> > + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> > + *
> > + * (o) global control line
> > + *
> > + *     ^ pos_ratio
> > + *     |
> > + *     |            |<===== global dirty control scope ======>|
> > + * 2.0 .............*
> > + *     |            .*
> > + *     |            . *
> > + *     |            .   *
> > + *     |            .     *
> > + *     |            .        *
> > + *     |            .            *
> > + * 1.0 ................................*
> > + *     |            .                  .     *
> > + *     |            .                  .          *
> > + *     |            .                  .              *
> > + *     |            .                  .                 *
> > + *     |            .                  .                    *
> > + *   0 +------------.------------------.----------------------*------------->
> > + *           freerun^          setpoint^                 limit^   dirty pages
> > + *
> > + * (o) bdi control lines
> > + *
> > + * The control lines for the global/bdi setpoints both stretch up to @limit.
> > + * The below figure illustrates the main bdi control line with an auxiliary
> > + * line extending it to @limit.
> > + *
> > + *   o
> > + *     o
> > + *       o                                      [o] main control line
> > + *         o                                    [*] auxiliary control line
> > + *           o
> > + *             o
> > + *               o
> > + *                 o
> > + *                   o
> > + *                     o
> > + *                       o--------------------- balance point, rate scale = 1
> > + *                       | o
> > + *                       |   o
> > + *                       |     o
> > + *                       |       o
> > + *                       |         o
> > + *                       |           o
> > + *                       |             o------- connect point, rate scale = 1/2
> > + *                       |               .*
> > + *                       |                 .   *
> > + *                       |                   .      *
> > + *                       |                     .         *
> > + *                       |                       .           *
> > + *                       |                         .              *
> > + *                       |                           .                 *
> > + *  [--------------------+-----------------------------.--------------------*]
> > + *  0                 setpoint                     x_intercept           limit
> > + *
> > + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> > + * normal if it starts high in situations like
> > + * - start writing to a slow SD card and a fast disk at the same time. The SD
> > + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> > + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> > + */
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
> > +	x_intercept = setpoint + 2 * span;
> > +
> 
> Hi Fengguang,
> 
> Few very basic queries.
> 
> - Why can't we use the same formula for bdi position ratio as gloabl
>   position ratio. Are you not looking for similar proporties. Near the
>   set point variation is less and away from setup poing throttling is
>   faster.

The changelog has more details, however I hope the rephrased summary
can answer this question better.

Firstly, for single bdi case, the different bdi/global formula is
complementing each other, where the bdi's slope is proportional to the
writeout bandwidth, while the global one is scaling to memory size.
In huge memory system, the global position feedback becomes very weak
(even far away from the setpoint).  This is where the bdi control line
can help pull the dirty pages to the setpoint.

Secondly, for JBOD case, the global/bdi dirty thresholds are
fundamentally different. The global one is stable and strong limit,
while the bdi one is fluctuating and hence only suitable be taken as a
weak limit. The other reason to make it a weak limit is, there are
valid situations that (bdi_dirty >> bdi_thresh) and it's desirable to
throttle the dirtier in reasonable small rate rather than to hard
throttle it.

> - In the bdi calculation, setpoint seems to be in number of pages and 
>   limit (x_intercept) seems to be a combination of nr pages + pages/sec.
>   Why it is different from gloabl setpoint and limit. I mean could this
>   not have been like global calculation where we try to keep bdi_dirty
>   close to bdi_thresh and calculate pos_ratio. 

Because the bdi dirty pages are observed to typically fluctuate up to
1-second worth of data. So the write_bw used here is really (1s * write_bw).

> - In global pos_ratio calculation terminology used is "limit" while
>   the same thing seems be being meintioned as x_intercept in bdi position
>   ratio calculation.

Yes. Because the bdi control lines don't intent to do hard limit at all.

It's actually possible for x_intercept to become larger than the global limit.
This means the it's a memory tight system (or the storage is super fast)
where the bdi dirty pages will inevitably fluctuate a lot (up to write_bw).
We just let go of them and let the global formula take the control.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-19  2:53     ` Vivek Goyal
  -1 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 setpoint                     x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
> +	x_intercept = setpoint + 2 * span;
> +

Hi Fengguang,

Few very basic queries.

- Why can't we use the same formula for bdi position ratio as gloabl
  position ratio. Are you not looking for similar proporties. Near the
  set point variation is less and away from setup poing throttling is
  faster.

- In the bdi calculation, setpoint seems to be in number of pages and 
  limit (x_intercept) seems to be a combination of nr pages + pages/sec.
  Why it is different from gloabl setpoint and limit. I mean could this
  not have been like global calculation where we try to keep bdi_dirty
  close to bdi_thresh and calculate pos_ratio. 

- In global pos_ratio calculation terminology used is "limit" while
  the same thing seems be being meintioned as x_intercept in bdi position
  ratio calculation.

Am I missing something very basic here.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-19  2:53     ` Vivek Goyal
  0 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-19  2:53 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 16, 2011 at 10:20:08AM +0800, Wu Fengguang wrote:

[..]
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 setpoint                     x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
> +	x_intercept = setpoint + 2 * span;
> +

Hi Fengguang,

Few very basic queries.

- Why can't we use the same formula for bdi position ratio as gloabl
  position ratio. Are you not looking for similar proporties. Near the
  set point variation is less and away from setup poing throttling is
  faster.

- In the bdi calculation, setpoint seems to be in number of pages and 
  limit (x_intercept) seems to be a combination of nr pages + pages/sec.
  Why it is different from gloabl setpoint and limit. I mean could this
  not have been like global calculation where we try to keep bdi_dirty
  close to bdi_thresh and calculate pos_ratio. 

- In global pos_ratio calculation terminology used is "limit" while
  the same thing seems be being meintioned as x_intercept in bdi position
  ratio calculation.

Am I missing something very basic here.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-18  4:18           ` Wu Fengguang
@ 2011-08-18 19:16             ` Jan Kara
  -1 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-18 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18 19:16             ` Jan Kara
  0 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-18 19:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 18-08-11 12:18:01, Wu Fengguang wrote:
> > > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > > +	 */
> > > > > +	setpoint = (freerun + limit) / 2;
> > > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > > +		    limit - setpoint + 1);
> > > > > +	pos_ratio = x;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > > +
> > > > > +	/*
> > > > > +	 * bdi setpoint
> >   OK, so if I understand the code right, we now have basic pos_ratio based
> > on global situation. Now, in the following code, we might scale pos_ratio
> > further down, if bdi_dirty is too much over bdi's share, right?
> 
> Right.
> 
> > Do we also want to scale pos_ratio up, if we are under bdi's share?
> 
> Yes.
> 
> > If yes, do we really want to do it even if global pos_ratio < 1
> > (i.e. we are over global setpoint)?
> 
> Yes. It's safe because the bdi pos_ratio scale is linear and the
> global pos_ratio scale will quickly drop to 0 near @limit, thus
> counter-acting any > 1 bdi pos_ratio.
  OK. I just wanted to make sure I understand it right :-). I can see
arguments for all the different choices so let's see how it works in
practice...

> > > > > +	 *
> > > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> >                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> > bdi_setpoint to distinguish clearly from the global value.
> 
> OK. I'll add a new variable bdi_setpoint, too, to make it consistent
> all over the places.
> 
> > > > > +	 *
> > > > > +	 * The main bdi control line is a linear function that subjects to
> > > > > +	 *
> > > > > +	 * (1) f(setpoint) = 1.0
> > > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > > +	 *
> > > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > > +	 * regularly within range
> > > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > > +	 * fluctuation range for pos_ratio.
> > > > > +	 *
> > > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > > +	 * own size, so move the slope over accordingly.
> > > > > +	 */
> > > > > +	if (unlikely(bdi_thresh > thresh))
> > > > > +		bdi_thresh = thresh;
> > > > > +	/*
> > > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > > +	 */
> > > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > > +	/*
> > > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > > +	 */
> > > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > > +		       thresh + 1);
> > > >   I think you can slightly simplify this to:
> > > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > > 
> > > Good idea!
> > > 
> > > > > +	x_intercept = setpoint + 2 * span;
> >    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> > ~3*bdi_thresh...
> 
> Right.
> 
> > So maybe you should use bdi_thresh/2 in the computation of span?
> 
> Given that at some configurations bdi_thresh can fluctuate to its own
> size, I guess the current slope of control line is sharp enough.
> 
> Given equations
> 
>         span = (x_intercept - bdi_setpoint) / 2
>         k = df/dx = -0.5 / span
> 
> and the values
> 
>         span = bdi_thresh
>         dx = bdi_thresh
> 
> we get
> 
>         df = - dx / (2 * span) = - 1/2
> 
> That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
> hence task ratelimit will fluctuate by -1/2. This is probably more
> than the users can tolerate already?
  OK, let's try that.

> ---
> Subject: writeback: dirty position control
> Date: Wed Mar 02 16:04:18 CST 2011
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> The slope of the bdi control line should be
> 
> 1) large enough to pull the dirty pages to setpoint reasonably fast
> 
> 2) small enough to avoid big fluctuations in the resulted pos_ratio and
>    hence task ratelimit
> 
> Since the fluctuation range of the bdi dirty pages is typically observed
> to be within 1-second worth of data, the bdi control line's slope is
> selected to be a linear function of bdi write bandwidth, so that it can
> adapt to slow/fast storage devices well.
> 
> Assume the bdi control line
> 
> 	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)
> 
> where k is the negative slope.
> 
> If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
> are fluctuating in range
> 
> 	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],
> 
> we get slope
> 
> 	k = - 1 / (8 * write_bw)
> 
> Let pos_ratio(x_intercept) = 0, we get the parameter used in code:
> 
> 	x_intercept = bdi_setpoint + 8 * write_bw
> 
> The global/bdi slopes are nicely complementing each other when the
> system has only one major bdi (indicated by bdi_thresh ~= thresh):
> 
> 1) slope of global control line    => scaling to the control scope size
> 2) slope of main bdi control line  => scaling to the write bandwidth
> 
> so that
> 
> - in memory tight systems, (1) becomes strong enough to squeeze dirty
>   pages inside the control scope
> 
> - in large memory systems where the "gravity" of (1) for pulling the
>   dirty pages to setpoint is too weak, (2) can back (1) up and drive
>   dirty pages to bdi_setpoint ~= setpoint reasonably fast.
> 
> Unfortunately in JBOD setups, the fluctuation range of bdi threshold
> is related to memory size due to the interferences between disks.  In
> this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.
> 
> peter: use 3rd order polynomial for the global control line
> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  OK, I like this patch now. You can add
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c         |    2 
>  include/linux/writeback.h |    1 
>  mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
>  3 files changed, 209 insertions(+), 6 deletions(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define RATELIMIT_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
>  	return x + 1;	/* Ensure that we never return 0 */
>  }
>  
> +static unsigned long dirty_freerun_ceiling(unsigned long thresh,
> +					   unsigned long bg_thresh)
> +{
> +	return (thresh + bg_thresh) / 2;
> +}
> +
>  static unsigned long hard_dirty_limit(unsigned long thresh)
>  {
>  	return max(thresh, global_dirty_limit);
> @@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + * We want the dirty pages be balanced around the global/bdi setpoints.
> + * When the number of dirty pages is higher/lower than the setpoint, the
> + * dirty position control ratio (and hence task dirty ratelimit) will be
> + * decreased/increased to bring the dirty pages back to the setpoint.
> + *
> + *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
> + *
> + *     if (dirty < setpoint) scale up   pos_ratio
> + *     if (dirty > setpoint) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
> + *
> + *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
> + *
> + * (o) global control line
> + *
> + *     ^ pos_ratio
> + *     |
> + *     |            |<===== global dirty control scope ======>|
> + * 2.0 .............*
> + *     |            .*
> + *     |            . *
> + *     |            .   *
> + *     |            .     *
> + *     |            .        *
> + *     |            .            *
> + * 1.0 ................................*
> + *     |            .                  .     *
> + *     |            .                  .          *
> + *     |            .                  .              *
> + *     |            .                  .                 *
> + *     |            .                  .                    *
> + *   0 +------------.------------------.----------------------*------------->
> + *           freerun^          setpoint^                 limit^   dirty pages
> + *
> + * (o) bdi control lines
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * The below figure illustrates the main bdi control line with an auxiliary
> + * line extending it to @limit.
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, rate scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, rate scale = 1/2
> + *                       |<-- span --->| .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0              bdi_setpoint                    x_intercept           limit
> + *
> + * The auxiliary control line allows smoothly throttling bdi_dirty down to
> + * normal if it starts high in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
> + * - the bdi dirty thresh drops quickly due to change of JBOD workload
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* dirty pages' target balance point */
> +	unsigned long bdi_setpoint;
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                           setpoint - dirty 3
> +	 *        f(dirty) := 1.0 + (----------------)
> +	 *                           limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx      <= 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * We have computed basic pos_ratio above based on global situation. If
> +	 * the bdi is over/under its share of dirty pages, we want to scale
> +	 * pos_ratio further down/up. That is done by the following policies:
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
> +	 * for various filesystems, so choose a slope that can yield in a
> +	 * reasonable 12.5% fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly and choose a slope that
> +	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
> +	 */
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
> +	 *
> +	 *                        x_intercept - bdi_dirty
> +	 *                     := --------------------------
> +	 *                        x_intercept - bdi_setpoint
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(bdi_setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:
> +	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
> +	bdi_setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 *
> +	 *        bdi_thresh                  thresh - bdi_thresh
> +	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
> +	 *          thresh                          thresh
> +	 */
> +	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
> +								(u64)x >> 16;
> +	x_intercept = bdi_setpoint + 2 * span;
> +
> +	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			bdi_setpoint += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> @@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
> @@ -629,6 +828,7 @@ snapshot:
>  
>  static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>  				 unsigned long thresh,
> +				 unsigned long bg_thresh,
>  				 unsigned long dirty,
>  				 unsigned long bdi_thresh,
>  				 unsigned long bdi_dirty,
> @@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
>  	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>  		return;
>  	spin_lock(&bdi->wb.list_lock);
> -	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
> -			       start_time);
> +	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> +			       bdi_thresh, bdi_dirty, start_time);
>  	spin_unlock(&bdi->wb.list_lock);
>  }
>  
> @@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
>  		 * catch-up. This avoids (excessively) small writeouts
>  		 * when the bdi limits are ramping up.
>  		 */
> -		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
> +		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
> +						      background_thresh))
>  			break;
>  
>  		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
> @@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
> -				     bdi_thresh, bdi_dirty, start_time);
> +		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> +				     nr_dirty, bdi_thresh, bdi_dirty,
> +				     start_time);
>  
>  		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
>  		 * Unstable writes are a feature of certain networked
> --- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
> @@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
>  static void wb_update_bandwidth(struct bdi_writeback *wb,
>  				unsigned long start_time)
>  {
> -	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
> +	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
>  }
>  
>  /*
> --- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
> +++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
> @@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
>  
>  void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>  			    unsigned long thresh,
> +			    unsigned long bg_thresh,
>  			    unsigned long dirty,
>  			    unsigned long bdi_thresh,
>  			    unsigned long bdi_dirty,
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-18  4:18           ` Wu Fengguang
@ 2011-08-18  4:41             ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Hi Jan,

> > > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > > easily 500 MB, that happens quite often I imagine?
> > > 
> > > That's fine because I no longer target "bdi_thresh" as some limiting
> > > factor as the global "thresh". Due to it being unstable in small
> > > memory JBOD systems, which is the big and unique problem in JBOD.
> >   I see. Given the control mechanism below, I think we can try this idea
> > and see whether it makes problems in practice or not. But the fact that
> > bdi_thresh is no longer treated as limit should be noted in a changelog -
> > probably of the last patch (although that is already too long for my taste
> > so I'll look into how we could make it shorter so that average developer
> > has enough patience to read it ;).
> 
> Good point. I'll make it a comment in the last patch.

Just added this comment:

+               /*
+                * bdi_thresh is not treated as some limiting factor as
+                * dirty_thresh, due to reasons
+                * - in JBOD setup, bdi_thresh can fluctuate a lot
+                * - in a system with HDD and USB key, the USB key may somehow
+                *   go into state (bdi_dirty >> bdi_thresh) either because
+                *   bdi_dirty starts high, or because bdi_thresh drops low.
+                *   In this case we don't want to hard throttle the USB key
+                *   dirtiers for 100 seconds until bdi_dirty drops under
+                *   bdi_thresh. Instead the auxiliary bdi control line in
+                *   bdi_position_ratio() will let the dirtier task progress
+                *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+                */
                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18  4:41             ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Hi Jan,

> > > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > > easily 500 MB, that happens quite often I imagine?
> > > 
> > > That's fine because I no longer target "bdi_thresh" as some limiting
> > > factor as the global "thresh". Due to it being unstable in small
> > > memory JBOD systems, which is the big and unique problem in JBOD.
> >   I see. Given the control mechanism below, I think we can try this idea
> > and see whether it makes problems in practice or not. But the fact that
> > bdi_thresh is no longer treated as limit should be noted in a changelog -
> > probably of the last patch (although that is already too long for my taste
> > so I'll look into how we could make it shorter so that average developer
> > has enough patience to read it ;).
> 
> Good point. I'll make it a comment in the last patch.

Just added this comment:

+               /*
+                * bdi_thresh is not treated as some limiting factor as
+                * dirty_thresh, due to reasons
+                * - in JBOD setup, bdi_thresh can fluctuate a lot
+                * - in a system with HDD and USB key, the USB key may somehow
+                *   go into state (bdi_dirty >> bdi_thresh) either because
+                *   bdi_dirty starts high, or because bdi_thresh drops low.
+                *   In this case we don't want to hard throttle the USB key
+                *   dirtiers for 100 seconds until bdi_dirty drops under
+                *   bdi_thresh. Instead the auxiliary bdi control line in
+                *   bdi_position_ratio() will let the dirtier task progress
+                *   at some rate <= (write_bw / 2) for bringing down bdi_dirty.
+                */
                bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 20:24         ` Jan Kara
@ 2011-08-18  4:18           ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 18, 2011 at 04:24:14AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> > On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > > +					unsigned long thresh,
> > > > +					unsigned long bg_thresh,
> > > > +					unsigned long dirty,
> > > > +					unsigned long bdi_thresh,
> > > > +					unsigned long bdi_dirty)
> > > > +{
> > > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > > +	unsigned long x_intercept;
> > > > +	unsigned long setpoint;		/* the target balance point */
> > > > +	unsigned long span;
> > > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > > +	long x;
> > > > +
> > > > +	if (unlikely(dirty >= limit))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * global setpoint
> > > > +	 *
> > > > +	 *                         setpoint - dirty 3
> > > > +	 *        f(dirty) := 1 + (----------------)
> > > > +	 *                         limit - setpoint
> > > > +	 *
> > > > +	 * it's a 3rd order polynomial that subjects to
> > > > +	 *
> > > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > > +	 * (3) f(limit)    = 0   => the hard limit
> > > > +	 * (4) df/dx       < 0	 => negative feedback control
>                           ^^^ Strictly speaking this is <= 0

Ah yes, it can be 0 right at the setpoint. 

> > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > +	 */
> > > > +	setpoint = (freerun + limit) / 2;
> > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > +		    limit - setpoint + 1);
> > > > +	pos_ratio = x;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * bdi setpoint
>   OK, so if I understand the code right, we now have basic pos_ratio based
> on global situation. Now, in the following code, we might scale pos_ratio
> further down, if bdi_dirty is too much over bdi's share, right?

Right.

> Do we also want to scale pos_ratio up, if we are under bdi's share?

Yes.

> If yes, do we really want to do it even if global pos_ratio < 1
> (i.e. we are over global setpoint)?

Yes. It's safe because the bdi pos_ratio scale is linear and the
global pos_ratio scale will quickly drop to 0 near @limit, thus
counter-acting any > 1 bdi pos_ratio.

> Maybe we could update the comment with something like:
>  * We have computed basic pos_ratio above based on global situation. If the
>  * bdi is over its share of dirty pages, we want to scale pos_ratio further
>  * down. That is done by the following mechanism:
> and now describe how updating works.

OK.

> > > > +	 *
> > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
>                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> bdi_setpoint to distinguish clearly from the global value.

OK. I'll add a new variable bdi_setpoint, too, to make it consistent
all over the places.

> > > > +	 *
> > > > +	 * The main bdi control line is a linear function that subjects to
> > > > +	 *
> > > > +	 * (1) f(setpoint) = 1.0
> > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > +	 *
> > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > +	 * regularly within range
> > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > +	 * fluctuation range for pos_ratio.
> > > > +	 *
> > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > +	 * own size, so move the slope over accordingly.
> > > > +	 */
> > > > +	if (unlikely(bdi_thresh > thresh))
> > > > +		bdi_thresh = thresh;
> > > > +	/*
> > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > +	 */
> > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > +	/*
> > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > +	 */
> > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > +		       thresh + 1);
> > >   I think you can slightly simplify this to:
> > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > 
> > Good idea!
> > 
> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh...

Right.

> So maybe you should use bdi_thresh/2 in the computation of span?

Given that at some configurations bdi_thresh can fluctuate to its own
size, I guess the current slope of control line is sharp enough.

Given equations

        span = (x_intercept - bdi_setpoint) / 2
        k = df/dx = -0.5 / span

and the values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / (2 * span) = - 1/2

That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
hence task ratelimit will fluctuate by -1/2. This is probably more
than the users can tolerate already?

btw. the connection point of main/auxiliary control lines are at

        (x_intercept + bdi_setpoint) / 2 

as shown in the graph of the below updated patch.

> > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > easily 500 MB, that happens quite often I imagine?
> > 
> > That's fine because I no longer target "bdi_thresh" as some limiting
> > factor as the global "thresh". Due to it being unstable in small
> > memory JBOD systems, which is the big and unique problem in JBOD.
>   I see. Given the control mechanism below, I think we can try this idea
> and see whether it makes problems in practice or not. But the fact that
> bdi_thresh is no longer treated as limit should be noted in a changelog -
> probably of the last patch (although that is already too long for my taste
> so I'll look into how we could make it shorter so that average developer
> has enough patience to read it ;).

Good point. I'll make it a comment in the last patch.

Thanks,
Fengguang
---
Subject: writeback: dirty position control
Date: Wed Mar 02 16:04:18 CST 2011

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |<-- span --->| .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following policies:
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, so choose a slope that can yield in a
+	 * reasonable 12.5% fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                  thresh - bdi_thresh
+	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
+	 *          thresh                          thresh
+	 */
+	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
+								(u64)x >> 16;
+	x_intercept = bdi_setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +828,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-18  4:18           ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-18  4:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 18, 2011 at 04:24:14AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> > On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > > +					unsigned long thresh,
> > > > +					unsigned long bg_thresh,
> > > > +					unsigned long dirty,
> > > > +					unsigned long bdi_thresh,
> > > > +					unsigned long bdi_dirty)
> > > > +{
> > > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > > +	unsigned long x_intercept;
> > > > +	unsigned long setpoint;		/* the target balance point */
> > > > +	unsigned long span;
> > > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > > +	long x;
> > > > +
> > > > +	if (unlikely(dirty >= limit))
> > > > +		return 0;
> > > > +
> > > > +	/*
> > > > +	 * global setpoint
> > > > +	 *
> > > > +	 *                         setpoint - dirty 3
> > > > +	 *        f(dirty) := 1 + (----------------)
> > > > +	 *                         limit - setpoint
> > > > +	 *
> > > > +	 * it's a 3rd order polynomial that subjects to
> > > > +	 *
> > > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > > +	 * (3) f(limit)    = 0   => the hard limit
> > > > +	 * (4) df/dx       < 0	 => negative feedback control
>                           ^^^ Strictly speaking this is <= 0

Ah yes, it can be 0 right at the setpoint. 

> > > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > > +	 */
> > > > +	setpoint = (freerun + limit) / 2;
> > > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > > +		    limit - setpoint + 1);
> > > > +	pos_ratio = x;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > > +
> > > > +	/*
> > > > +	 * bdi setpoint
>   OK, so if I understand the code right, we now have basic pos_ratio based
> on global situation. Now, in the following code, we might scale pos_ratio
> further down, if bdi_dirty is too much over bdi's share, right?

Right.

> Do we also want to scale pos_ratio up, if we are under bdi's share?

Yes.

> If yes, do we really want to do it even if global pos_ratio < 1
> (i.e. we are over global setpoint)?

Yes. It's safe because the bdi pos_ratio scale is linear and the
global pos_ratio scale will quickly drop to 0 near @limit, thus
counter-acting any > 1 bdi pos_ratio.

> Maybe we could update the comment with something like:
>  * We have computed basic pos_ratio above based on global situation. If the
>  * bdi is over its share of dirty pages, we want to scale pos_ratio further
>  * down. That is done by the following mechanism:
> and now describe how updating works.

OK.

> > > > +	 *
> > > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
>                   ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
> bdi_setpoint to distinguish clearly from the global value.

OK. I'll add a new variable bdi_setpoint, too, to make it consistent
all over the places.

> > > > +	 *
> > > > +	 * The main bdi control line is a linear function that subjects to
> > > > +	 *
> > > > +	 * (1) f(setpoint) = 1.0
> > > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > > +	 *
> > > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > > +	 * regularly within range
> > > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > > +	 * fluctuation range for pos_ratio.
> > > > +	 *
> > > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > > +	 * own size, so move the slope over accordingly.
> > > > +	 */
> > > > +	if (unlikely(bdi_thresh > thresh))
> > > > +		bdi_thresh = thresh;
> > > > +	/*
> > > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > > +	 */
> > > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > > +	setpoint = setpoint * (u64)x >> 16;
> > > > +	/*
> > > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > > +	 */
> > > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > > +		       thresh + 1);
> > >   I think you can slightly simplify this to:
> > > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> > 
> > Good idea!
> > 
> > > > +	x_intercept = setpoint + 2 * span;
>    ^^ BTW, why do you have 2*span here? It can result in x_intercept being
> ~3*bdi_thresh...

Right.

> So maybe you should use bdi_thresh/2 in the computation of span?

Given that at some configurations bdi_thresh can fluctuate to its own
size, I guess the current slope of control line is sharp enough.

Given equations

        span = (x_intercept - bdi_setpoint) / 2
        k = df/dx = -0.5 / span

and the values

        span = bdi_thresh
        dx = bdi_thresh

we get

        df = - dx / (2 * span) = - 1/2

That means, when bdi_dirty deviates bdi_thresh apart, pos_ratio and
hence task ratelimit will fluctuate by -1/2. This is probably more
than the users can tolerate already?

btw. the connection point of main/auxiliary control lines are at

        (x_intercept + bdi_setpoint) / 2 

as shown in the graph of the below updated patch.

> > >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > > easily 500 MB, that happens quite often I imagine?
> > 
> > That's fine because I no longer target "bdi_thresh" as some limiting
> > factor as the global "thresh". Due to it being unstable in small
> > memory JBOD systems, which is the big and unique problem in JBOD.
>   I see. Given the control mechanism below, I think we can try this idea
> and see whether it makes problems in practice or not. But the fact that
> bdi_thresh is no longer treated as limit should be noted in a changelog -
> probably of the last patch (although that is already too long for my taste
> so I'll look into how we could make it shorter so that average developer
> has enough patience to read it ;).

Good point. I'll make it a comment in the last patch.

Thanks,
Fengguang
---
Subject: writeback: dirty position control
Date: Wed Mar 02 16:04:18 CST 2011

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - bdi_setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range

	[bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2],

we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = bdi_setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to bdi_setpoint ~= setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  212 +++++++++++++++++++++++++++++++++++-
 3 files changed, 209 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-18 12:15:24.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,196 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |<-- span --->| .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0              bdi_setpoint                    x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi_setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* dirty pages' target balance point */
+	unsigned long bdi_setpoint;
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                           setpoint - dirty 3
+	 *        f(dirty) := 1.0 + (----------------)
+	 *                           limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx      <= 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * We have computed basic pos_ratio above based on global situation. If
+	 * the bdi is over/under its share of dirty pages, we want to scale
+	 * pos_ratio further down/up. That is done by the following policies:
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2]
+	 * for various filesystems, so choose a slope that can yield in a
+	 * reasonable 12.5% fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly and choose a slope that
+	 * yields 50% pos_ratio fluctuation when bdi_thresh is suddenly doubled.
+	 */
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint)
+	 *
+	 *                        x_intercept - bdi_dirty
+	 *                     := --------------------------
+	 *                        x_intercept - bdi_setpoint
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(bdi_setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = bdi_setpoint + 8 * write_bw
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:
+	 * 	bdi_setpoint = setpoint * bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh + 1);
+	bdi_setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 *
+	 *        bdi_thresh                  thresh - bdi_thresh
+	 * span = ---------- * (4*write_bw) + ------------------- * bdi_thresh
+	 *          thresh                          thresh
+	 */
+	span = (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) *
+								(u64)x >> 16;
+	x_intercept = bdi_setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > bdi_setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			bdi_setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - bdi_setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +791,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +828,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +837,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +879,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +924,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-17 20:35:34.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-17 20:35:22.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-17 20:35:34.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 13:23     ` Wu Fengguang
@ 2011-08-17 20:24         ` Jan Kara
  2011-08-17 20:24         ` Jan Kara
  1 sibling, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-17 20:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hi Fengguang,

On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > +					unsigned long thresh,
> > > +					unsigned long bg_thresh,
> > > +					unsigned long dirty,
> > > +					unsigned long bdi_thresh,
> > > +					unsigned long bdi_dirty)
> > > +{
> > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > +	unsigned long x_intercept;
> > > +	unsigned long setpoint;		/* the target balance point */
> > > +	unsigned long span;
> > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > +	long x;
> > > +
> > > +	if (unlikely(dirty >= limit))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * global setpoint
> > > +	 *
> > > +	 *                         setpoint - dirty 3
> > > +	 *        f(dirty) := 1 + (----------------)
> > > +	 *                         limit - setpoint
> > > +	 *
> > > +	 * it's a 3rd order polynomial that subjects to
> > > +	 *
> > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > +	 * (3) f(limit)    = 0   => the hard limit
> > > +	 * (4) df/dx       < 0	 => negative feedback control
                          ^^^ Strictly speaking this is <= 0

> > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > +	 */
> > > +	setpoint = (freerun + limit) / 2;
> > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > +		    limit - setpoint + 1);
> > > +	pos_ratio = x;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > +
> > > +	/*
> > > +	 * bdi setpoint
  OK, so if I understand the code right, we now have basic pos_ratio based
on global situation. Now, in the following code, we might scale pos_ratio
further down, if bdi_dirty is too much over bdi's share, right? Do we also
want to scale pos_ratio up, if we are under bdi's share? If yes, do we
really want to do it even if global pos_ratio < 1 (i.e. we are over global
setpoint)?

Maybe we could update the comment with something like:
 * We have computed basic pos_ratio above based on global situation. If the
 * bdi is over its share of dirty pages, we want to scale pos_ratio further
 * down. That is done by the following mechanism:
and now describe how updating works.

> > > +	 *
> > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
                  ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
bdi_setpoint to distinguish clearly from the global value.

> > > +	 *
> > > +	 * The main bdi control line is a linear function that subjects to
> > > +	 *
> > > +	 * (1) f(setpoint) = 1.0
> > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > +	 *
> > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > +	 * regularly within range
> > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > +	 * fluctuation range for pos_ratio.
> > > +	 *
> > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > +	 * own size, so move the slope over accordingly.
> > > +	 */
> > > +	if (unlikely(bdi_thresh > thresh))
> > > +		bdi_thresh = thresh;
> > > +	/*
> > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > +	 */
> > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > +	setpoint = setpoint * (u64)x >> 16;
> > > +	/*
> > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > +	 */
> > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > +		       thresh + 1);
> >   I think you can slightly simplify this to:
> > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> 
> Good idea!
> 
> > > +	x_intercept = setpoint + 2 * span;
   ^^ BTW, why do you have 2*span here? It can result in x_intercept being
~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
span?

> >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > easily 500 MB, that happens quite often I imagine?
> 
> That's fine because I no longer target "bdi_thresh" as some limiting
> factor as the global "thresh". Due to it being unstable in small
> memory JBOD systems, which is the big and unique problem in JBOD.
  I see. Given the control mechanism below, I think we can try this idea
and see whether it makes problems in practice or not. But the fact that
bdi_thresh is no longer treated as limit should be noted in a changelog -
probably of the last patch (although that is already too long for my taste
so I'll look into how we could make it shorter so that average developer
has enough patience to read it ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-17 20:24         ` Jan Kara
  0 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-17 20:24 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Peter Zijlstra, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hi Fengguang,

On Wed 17-08-11 21:23:47, Wu Fengguang wrote:
> On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
> > > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > > +					unsigned long thresh,
> > > +					unsigned long bg_thresh,
> > > +					unsigned long dirty,
> > > +					unsigned long bdi_thresh,
> > > +					unsigned long bdi_dirty)
> > > +{
> > > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > > +	unsigned long limit = hard_dirty_limit(thresh);
> > > +	unsigned long x_intercept;
> > > +	unsigned long setpoint;		/* the target balance point */
> > > +	unsigned long span;
> > > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > > +	long x;
> > > +
> > > +	if (unlikely(dirty >= limit))
> > > +		return 0;
> > > +
> > > +	/*
> > > +	 * global setpoint
> > > +	 *
> > > +	 *                         setpoint - dirty 3
> > > +	 *        f(dirty) := 1 + (----------------)
> > > +	 *                         limit - setpoint
> > > +	 *
> > > +	 * it's a 3rd order polynomial that subjects to
> > > +	 *
> > > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > > +	 * (2) f(setpoint) = 1.0 => the balance point
> > > +	 * (3) f(limit)    = 0   => the hard limit
> > > +	 * (4) df/dx       < 0	 => negative feedback control
                          ^^^ Strictly speaking this is <= 0

> > > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > > +	 *     => fast response on large errors; small oscillation near setpoint
> > > +	 */
> > > +	setpoint = (freerun + limit) / 2;
> > > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > > +		    limit - setpoint + 1);
> > > +	pos_ratio = x;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > > +
> > > +	/*
> > > +	 * bdi setpoint
  OK, so if I understand the code right, we now have basic pos_ratio based
on global situation. Now, in the following code, we might scale pos_ratio
further down, if bdi_dirty is too much over bdi's share, right? Do we also
want to scale pos_ratio up, if we are under bdi's share? If yes, do we
really want to do it even if global pos_ratio < 1 (i.e. we are over global
setpoint)?

Maybe we could update the comment with something like:
 * We have computed basic pos_ratio above based on global situation. If the
 * bdi is over its share of dirty pages, we want to scale pos_ratio further
 * down. That is done by the following mechanism:
and now describe how updating works.

> > > +	 *
> > > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
                  ^^^^^^^ bdi_dirty?             ^^^ maybe I'd name it
bdi_setpoint to distinguish clearly from the global value.

> > > +	 *
> > > +	 * The main bdi control line is a linear function that subjects to
> > > +	 *
> > > +	 * (1) f(setpoint) = 1.0
> > > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > > +	 *
> > > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > > +	 * regularly within range
> > > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > > +	 * fluctuation range for pos_ratio.
> > > +	 *
> > > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > > +	 * own size, so move the slope over accordingly.
> > > +	 */
> > > +	if (unlikely(bdi_thresh > thresh))
> > > +		bdi_thresh = thresh;
> > > +	/*
> > > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > > +	 */
> > > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > > +	setpoint = setpoint * (u64)x >> 16;
> > > +	/*
> > > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > > +	 */
> > > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > > +		       thresh + 1);
> >   I think you can slightly simplify this to:
> > (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;
> 
> Good idea!
> 
> > > +	x_intercept = setpoint + 2 * span;
   ^^ BTW, why do you have 2*span here? It can result in x_intercept being
~3*bdi_thresh... So maybe you should use bdi_thresh/2 in the computation of
span?

> >   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> > easily 500 MB, that happens quite often I imagine?
> 
> That's fine because I no longer target "bdi_thresh" as some limiting
> factor as the global "thresh". Due to it being unstable in small
> memory JBOD systems, which is the big and unique problem in JBOD.
  I see. Given the control mechanism below, I think we can try this idea
and see whether it makes problems in practice or not. But the fact that
bdi_thresh is no longer treated as limit should be noted in a changelog -
probably of the last patch (although that is already too long for my taste
so I'll look into how we could make it shorter so that average developer
has enough patience to read it ;).

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-17 13:23     ` Wu Fengguang
@ 2011-08-17 13:49         ` Wu Fengguang
  2011-08-17 20:24         ` Jan Kara
  1 sibling, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > +		if (x_intercept < limit) {
> > > +			x_intercept = limit;	/* auxiliary control line */
> > > +			setpoint += span;
> > > +			pos_ratio >>= 1;
> > > +		}
> >   And here you stretch the control area upto the global dirty limit. I
> > understand you maybe don't want to be really strict and cut control area at
> > bdi_thresh but your choice looks like too benevolent - when you have
> > several active bdi's with different speeds this will effectively erase
> > difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> > bdi_dirty really heavily exceeds bdi_thresh.
> 
> Yes the auxiliary control line could be very flat (small slope).
> 
> However it's not normal for the bdi dirty pages to fall into the
> range of auxiliary control line at all. And once it takes effect, 
> the pos_ratio is at most 0.5 (which is the value at the connection
> point with the main bdi control line) which is strong enough to pull
> the dirty pages off the auxiliary bdi control line and into the scope
> of main bdi control line.
> 
> The auxiliary control line is intended for bringing down the bdi_dirty
> of the USB key before 250s (where the "pos bandwidth" line keeps low): 
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png
> 
> After that the bdi_dirty will fluctuate around bdi_thresh and won't
> grow high and step into the scope of the auxiliary control line.

Note that the main/auxiliary bdi control lines won't take effect at
the same time: the main bdi control lines works around and under the
bdi setpoint, and the auxiliary line takes over in the higher scope up
to @limit.

In the 1UKEY+1HDD test case, the bdi_dirty of the UKEY rushes at the
free run stage when global dirty pages are smaller than (thresh+bg_thresh)/2.

So it will be initially under the control the auxiliary line. Hence the
dirtier task will progress at 1/4 to 1/2 of the UKEY's write bandwidth. 
This will bring down the bdi_dirty reasonably fast while still allowing
the dirtier task to make some progress.

The connection point of the main/auxiliary control lines has pos_ratio=0.5.

After 250 second, the main bdi control line takes over, indicated by
the bdi_dirty fluctuating around bdi setpoint and the position rate
(green line) fluctuating around the base ratelimit(blue line).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-17 13:49         ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > > +		if (x_intercept < limit) {
> > > +			x_intercept = limit;	/* auxiliary control line */
> > > +			setpoint += span;
> > > +			pos_ratio >>= 1;
> > > +		}
> >   And here you stretch the control area upto the global dirty limit. I
> > understand you maybe don't want to be really strict and cut control area at
> > bdi_thresh but your choice looks like too benevolent - when you have
> > several active bdi's with different speeds this will effectively erase
> > difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> > bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> > bdi_dirty really heavily exceeds bdi_thresh.
> 
> Yes the auxiliary control line could be very flat (small slope).
> 
> However it's not normal for the bdi dirty pages to fall into the
> range of auxiliary control line at all. And once it takes effect, 
> the pos_ratio is at most 0.5 (which is the value at the connection
> point with the main bdi control line) which is strong enough to pull
> the dirty pages off the auxiliary bdi control line and into the scope
> of main bdi control line.
> 
> The auxiliary control line is intended for bringing down the bdi_dirty
> of the USB key before 250s (where the "pos bandwidth" line keeps low): 
> 
> http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png
> 
> After that the bdi_dirty will fluctuate around bdi_thresh and won't
> grow high and step into the scope of the auxiliary control line.

Note that the main/auxiliary bdi control lines won't take effect at
the same time: the main bdi control lines works around and under the
bdi setpoint, and the auxiliary line takes over in the higher scope up
to @limit.

In the 1UKEY+1HDD test case, the bdi_dirty of the UKEY rushes at the
free run stage when global dirty pages are smaller than (thresh+bg_thresh)/2.

So it will be initially under the control the auxiliary line. Hence the
dirtier task will progress at 1/4 to 1/2 of the UKEY's write bandwidth. 
This will bring down the bdi_dirty reasonably fast while still allowing
the dirtier task to make some progress.

The connection point of the main/auxiliary control lines has pos_ratio=0.5.

After 250 second, the main bdi control line takes over, indicated by
the bdi_dirty fluctuating around bdi setpoint and the position rate
(green line) fluctuating around the base ratelimit(blue line).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16 19:41     ` Jan Kara
  (?)
@ 2011-08-17 13:23     ` Wu Fengguang
  2011-08-17 13:49         ` Wu Fengguang
  2011-08-17 20:24         ` Jan Kara
  -1 siblings, 2 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-17 13:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

Hi Jan,

On Wed, Aug 17, 2011 at 03:41:12AM +0800, Jan Kara wrote:
>   Hello Fengguang,
> 
>   this patch is much easier to read than in older versions! Good work!

Thank you :)

> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +					unsigned long thresh,
> > +					unsigned long bg_thresh,
> > +					unsigned long dirty,
> > +					unsigned long bdi_thresh,
> > +					unsigned long bdi_dirty)
> > +{
> > +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> > +	unsigned long limit = hard_dirty_limit(thresh);
> > +	unsigned long x_intercept;
> > +	unsigned long setpoint;		/* the target balance point */
> > +	unsigned long span;
> > +	long long pos_ratio;		/* for scaling up/down the rate limit */
> > +	long x;
> > +
> > +	if (unlikely(dirty >= limit))
> > +		return 0;
> > +
> > +	/*
> > +	 * global setpoint
> > +	 *
> > +	 *                         setpoint - dirty 3
> > +	 *        f(dirty) := 1 + (----------------)
> > +	 *                         limit - setpoint
> > +	 *
> > +	 * it's a 3rd order polynomial that subjects to
> > +	 *
> > +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> > +	 * (2) f(setpoint) = 1.0 => the balance point
> > +	 * (3) f(limit)    = 0   => the hard limit
> > +	 * (4) df/dx       < 0	 => negative feedback control
> > +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> > +	 *     => fast response on large errors; small oscillation near setpoint
> > +	 */
> > +	setpoint = (freerun + limit) / 2;
> > +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> > +		    limit - setpoint + 1);
> > +	pos_ratio = x;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> > +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> > +
> > +	/*
> > +	 * bdi setpoint
> > +	 *
> > +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> > +	 *
> > +	 * The main bdi control line is a linear function that subjects to
> > +	 *
> > +	 * (1) f(setpoint) = 1.0
> > +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> > +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> > +	 *
> > +	 * For single bdi case, the dirty pages are observed to fluctuate
> > +	 * regularly within range
> > +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> > +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> > +	 * fluctuation range for pos_ratio.
> > +	 *
> > +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> > +	 * own size, so move the slope over accordingly.
> > +	 */
> > +	if (unlikely(bdi_thresh > thresh))
> > +		bdi_thresh = thresh;
> > +	/*
> > +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> > +	 */
> > +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> > +	setpoint = setpoint * (u64)x >> 16;
> > +	/*
> > +	 * Use span=(4*write_bw) in single bdi case as indicated by
> > +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> > +	 */
> > +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> > +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> > +		       thresh + 1);
>   I think you can slightly simplify this to:
> (thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;

Good idea!

> > +	x_intercept = setpoint + 2 * span;
>   What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
> easily 500 MB, that happens quite often I imagine?

That's fine because I no longer target "bdi_thresh" as some limiting
factor as the global "thresh". Due to it being unstable in small
memory JBOD systems, which is the big and unique problem in JBOD.

> > +
> > +	if (unlikely(bdi_dirty > setpoint + span)) {
> > +		if (unlikely(bdi_dirty > limit))
> > +			return 0;
>   Shouldn't this be bdi_thresh instead of limit? I understand this is a
> hard limit but with more bdis this condition is rather weak and almost
> never true.

Yeah, I mean @limit. @bdi_thresh is made weak in IO-less
balance_dirty_pages() in order to get reasonable smooth dirty rate in
the face of a fluctuating @bdi_thresh.

The tradeoff is to let bdi dirty pages fluctuate more or less freely,
as long as they don't drop low and risk IO queue underflow. The
attached patch tries to prevent the underflow (which is good but not
perfect).

> > +		if (x_intercept < limit) {
> > +			x_intercept = limit;	/* auxiliary control line */
> > +			setpoint += span;
> > +			pos_ratio >>= 1;
> > +		}
>   And here you stretch the control area upto the global dirty limit. I
> understand you maybe don't want to be really strict and cut control area at
> bdi_thresh but your choice looks like too benevolent - when you have
> several active bdi's with different speeds this will effectively erase
> difference between them, won't it? E.g. with 10 bdi's (x_intercept -
> bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
> bdi_dirty really heavily exceeds bdi_thresh.

Yes the auxiliary control line could be very flat (small slope).

However it's not normal for the bdi dirty pages to fall into the
range of auxiliary control line at all. And once it takes effect, 
the pos_ratio is at most 0.5 (which is the value at the connection
point with the main bdi control line) which is strong enough to pull
the dirty pages off the auxiliary bdi control line and into the scope
of main bdi control line.

The auxiliary control line is intended for bringing down the bdi_dirty
of the USB key before 250s (where the "pos bandwidth" line keeps low): 

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v6/1UKEY+1HDD-3G/ext4-4dd-1M-8p-2945M-20%25-2.6.38-rc5-dt6+-2011-02-22-09-46/balance_dirty_pages-pages.png

After that the bdi_dirty will fluctuate around bdi_thresh and won't
grow high and step into the scope of the auxiliary control line.

> So wouldn't it be better to
> just make sure control area is reasonably large (e.g. at least 16 MB) to
> allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
> limit?

In order to take bdi_thresh as some semi-strict limit, we need to make
it very stable at first..otherwise the whole control system may fluctuate
violently.

Thanks,
Fengguang

> > +	}
> > +	pos_ratio *= x_intercept - bdi_dirty;
> > +	do_div(pos_ratio, x_intercept - setpoint + 1);
> > +
> > +	return pos_ratio;
> > +}
> > +
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

[-- Attachment #2: bdi-reserve-area --]
[-- Type: text/plain, Size: 2539 bytes --]

Subject: writeback: dirty position control - bdi reserve area
Date: Thu Aug 04 22:16:46 CST 2011

Keep a minimal pool of dirty pages for each bdi, so that the disk IO
queues won't underrun.

It's particularly useful for JBOD and small memory system.

XXX:
When memory is small (in comparison to write bandwidth), this control
line may result in (pos_ratio > 1) at the setpoint and push the dirty
pages high. This is more or less intended because the bdi is in the
danger of IO queue underflow. However the global dirty pages, when
pushed close to limit, will eventually conteract our desire to push up
the low bdi_dirty. In low memory JBOD tests we do see disks
under-utilized from time to time.

One scheme that may completely fix this is to add a BDI_queue_empty to
indicate the block IO queue emptiness (but still there may be in flight
IOs on the driver/hardware side) and to unthrottle the tasks regardless
of the global limit on seeing BDI_queue_empty.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-16 09:06:46.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-16 09:06:50.000000000 +0800
@@ -488,6 +488,16 @@ unsigned long bdi_dirty_limit(struct bac
  *   0 +------------.------------------.----------------------*------------->
  *           freerun^          setpoint^                 limit^   dirty pages
  *
+ * (o) bdi reserve area
+ *
+ * The bdi reserve area tries to keep a reasonable number of dirty pages for
+ * preventing block queue underrun.
+ *
+ * reserve area, scale up rate as dirty pages drop low
+ * |<----------------------------------------------->|
+ * |-------------------------------------------------------*-------|----------
+ * 0                                           bdi setpoint^       ^bdi_thresh
+ *
  * (o) bdi control lines
  *
  * The control lines for the global/bdi setpoints both stretch up to @limit.
@@ -571,6 +581,19 @@ static unsigned long bdi_position_ratio(
 	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
 
 	/*
+	 * bdi reserve area, safeguard against dirty pool underrun and disk idle
+	 */
+	x_intercept = min(bdi->avg_write_bandwidth + 2 * MIN_WRITEBACK_PAGES,
+			  freerun);
+	if (bdi_dirty < x_intercept) {
+		if (bdi_dirty > x_intercept / 8) {
+			pos_ratio *= x_intercept;
+			do_div(pos_ratio, bdi_dirty);
+		} else
+			pos_ratio *= 8;
+	}
+
+	/*
 	 * bdi setpoint
 	 *
 	 *        f(dirty) := 1.0 + k * (dirty - setpoint)

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-16 19:41     ` Jan Kara
  -1 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-16 19:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hello Fengguang,

  this patch is much easier to read than in older versions! Good work!

> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
  I think you can slightly simplify this to:
(thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;


> +	x_intercept = setpoint + 2 * span;
  What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
easily 500 MB, that happens quite often I imagine?

> +
> +	if (unlikely(bdi_dirty > setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
  Shouldn't this be bdi_thresh instead of limit? I understand this is a
hard limit but with more bdis this condition is rather weak and almost
never true.

> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			setpoint += span;
> +			pos_ratio >>= 1;
> +		}
  And here you stretch the control area upto the global dirty limit. I
understand you maybe don't want to be really strict and cut control area at
bdi_thresh but your choice looks like too benevolent - when you have
several active bdi's with different speeds this will effectively erase
difference between them, won't it? E.g. with 10 bdi's (x_intercept -
bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
bdi_dirty really heavily exceeds bdi_thresh. So wouldn't it be better to
just make sure control area is reasonably large (e.g. at least 16 MB) to
allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
limit?

> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16 19:41     ` Jan Kara
  0 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-16 19:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Peter Zijlstra, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

  Hello Fengguang,

  this patch is much easier to read than in older versions! Good work!

> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long bg_thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long x_intercept;
> +	unsigned long setpoint;		/* the target balance point */
> +	unsigned long span;
> +	long long pos_ratio;		/* for scaling up/down the rate limit */
> +	long x;
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 *
> +	 *                         setpoint - dirty 3
> +	 *        f(dirty) := 1 + (----------------)
> +	 *                         limit - setpoint
> +	 *
> +	 * it's a 3rd order polynomial that subjects to
> +	 *
> +	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
> +	 * (2) f(setpoint) = 1.0 => the balance point
> +	 * (3) f(limit)    = 0   => the hard limit
> +	 * (4) df/dx       < 0	 => negative feedback control
> +	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> +	 *     => fast response on large errors; small oscillation near setpoint
> +	 */
> +	setpoint = (freerun + limit) / 2;
> +	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
> +		    limit - setpoint + 1);
> +	pos_ratio = x;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> +	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> +
> +	/*
> +	 * bdi setpoint
> +	 *
> +	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
> +	 *
> +	 * The main bdi control line is a linear function that subjects to
> +	 *
> +	 * (1) f(setpoint) = 1.0
> +	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
> +	 *     or equally: x_intercept = setpoint + 8 * write_bw
> +	 *
> +	 * For single bdi case, the dirty pages are observed to fluctuate
> +	 * regularly within range
> +	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
> +	 * for various filesystems, where (2) can yield in a reasonable 12.5%
> +	 * fluctuation range for pos_ratio.
> +	 *
> +	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
> +	 * own size, so move the slope over accordingly.
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	/*
> +	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
> +	 */
> +	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
> +	setpoint = setpoint * (u64)x >> 16;
> +	/*
> +	 * Use span=(4*write_bw) in single bdi case as indicated by
> +	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
> +	 */
> +	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
> +		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
> +		       thresh + 1);
  I think you can slightly simplify this to:
(thresh - bdi_thresh + 4 * bdi->avg_write_bandwidth) * (u64)x >> 16;


> +	x_intercept = setpoint + 2 * span;
  What if x_intercept >  bdi_thresh? Since 8*bdi->avg_write_bandwidth is
easily 500 MB, that happens quite often I imagine?

> +
> +	if (unlikely(bdi_dirty > setpoint + span)) {
> +		if (unlikely(bdi_dirty > limit))
> +			return 0;
  Shouldn't this be bdi_thresh instead of limit? I understand this is a
hard limit but with more bdis this condition is rather weak and almost
never true.

> +		if (x_intercept < limit) {
> +			x_intercept = limit;	/* auxiliary control line */
> +			setpoint += span;
> +			pos_ratio >>= 1;
> +		}
  And here you stretch the control area upto the global dirty limit. I
understand you maybe don't want to be really strict and cut control area at
bdi_thresh but your choice looks like too benevolent - when you have
several active bdi's with different speeds this will effectively erase
difference between them, won't it? E.g. with 10 bdi's (x_intercept -
bdi_dirty) / (x_intercept - setpoint) is going to be close to 1 unless
bdi_dirty really heavily exceeds bdi_thresh. So wouldn't it be better to
just make sure control area is reasonably large (e.g. at least 16 MB) to
allow BDI to ramp up it's bdi_thresh but don't extend it upto global dirty
limit?

> +	}
> +	pos_ratio *= x_intercept - bdi_dirty;
> +	do_div(pos_ratio, x_intercept - setpoint + 1);
> +
> +	return pos_ratio;
> +}
> +

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09  2:08     ` Vivek Goyal
@ 2011-08-16  8:59       ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

> > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> > that the resulted task rate limit can drive the dirty pages back to the
> > global/bdi setpoints.
> > 
> 
> IMHO, "position_ratio" is not necessarily very intutive. Can there be
> a better name? Based on your slides, it is scaling factor applied to
> task rate limit depending on how well we are doing in terms of meeting
> our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
> that make sense and be little more intutive? 

Yeah position_ratio is some scale factor to the dirty rate, and I
added a comment for that. On the other hand position_ratio does
reflect the underlying "position control of dirty pages" logic. So
over time it should be reasonably understandable in the other way :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:59       ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:59 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

> > bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> > that the resulted task rate limit can drive the dirty pages back to the
> > global/bdi setpoints.
> > 
> 
> IMHO, "position_ratio" is not necessarily very intutive. Can there be
> a better name? Based on your slides, it is scaling factor applied to
> task rate limit depending on how well we are doing in terms of meeting
> our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
> that make sense and be little more intutive? 

Yeah position_ratio is some scale factor to the dirty rate, and I
added a comment for that. On the other hand position_ratio does
reflect the underlying "position control of dirty pages" logic. So
over time it should be reasonably understandable in the other way :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-10 21:40             ` Vivek Goyal
@ 2011-08-16  8:55               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

Sorry it made such a big confusion to you. I hope Peter's 3rd order
polynomial abstraction in v9 can clarify the concepts a lot.

As for the old global control line

                       origin - dirty
           pos_ratio = --------------           (1)
                       origin - goal

where

        origin = 4 * thresh                     (2)

effectively decides the slope of the line. The use of @limit in code

        origin = max(4 * thresh, limit)         (3)

is merely to safeguard the rare case that (2) might result in negative
pos_ratio in (1).

I have another patch to add a "brake" area immediately below @limit
that will scale pos_ratio down to 0. However that's no longer
necessary with the 3rd order polynomial solution. 

Note that @limit will normally be equal to @thresh except in the rare
case that @thresh is suddenly knocked down and @limit is taking time
to follow it.

Thanks,
Fengguang

> Hi Fengguang,
> 
> Ok, so just trying to understand this pos_ratio little better.
> 
> You have following basic formula.
> 
>                      origin - dirty
>          pos_ratio = --------------
>                      origin - goal
> 
> Terminology is very confusing and following is my understanding. 
> 
> - setpoint == goal
> 
>   setpoint is the point where we would like our number of dirty pages to
>   be and at this point pos_ratio = 1. For global dirty this number seems
>   to be (thresh - thresh / DIRTY_SCOPE) 
> 
> - thresh
>   dirty page threshold calculated from dirty_ratio (Certain percentage of
>   total memory).
> 
> - Origin (seems to be equivalent of limit)
> 
>   This seems to be the reference point/limit we don't want to cross and
>   distance from this limit basically decides the pos_ratio. Closer we
>   are to limit, lower the pos_ratio and further we are higher the
>   pos_ratio.
> 
> So threshold is just a number which helps us determine goal and limit.
> 
> goal = thresh - thresh / DIRTY_SCOPE
> limit = 4*thresh
> 
> So goal is where we want to be and we start throttling the task more as
> we move away goal and approach limit. We keep the limit high enough
> so that (origin-dirty) does not become negative entity.
> 
> So we do expect to cross "thresh" otherwise thresh itself could have
> served as limit?
> 
> If my understanding is right, then can we get rid of terms "setpoint" and
> "origin". Would it be easier to understand the things if we just talk
> in terms of "goal" and "limit" and how these are derived from "thresh".
> 
> 	thresh == soft limit
> 	limit == 4*thresh (hard limit)
> 	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
> 						be in steady state).
>                      limit - dirty
>          pos_ratio = --------------
>                      limit - goal
> 
> Thanks
> Vivek

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:55               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:55 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

Hi Vivek,

Sorry it made such a big confusion to you. I hope Peter's 3rd order
polynomial abstraction in v9 can clarify the concepts a lot.

As for the old global control line

                       origin - dirty
           pos_ratio = --------------           (1)
                       origin - goal

where

        origin = 4 * thresh                     (2)

effectively decides the slope of the line. The use of @limit in code

        origin = max(4 * thresh, limit)         (3)

is merely to safeguard the rare case that (2) might result in negative
pos_ratio in (1).

I have another patch to add a "brake" area immediately below @limit
that will scale pos_ratio down to 0. However that's no longer
necessary with the 3rd order polynomial solution. 

Note that @limit will normally be equal to @thresh except in the rare
case that @thresh is suddenly knocked down and @limit is taking time
to follow it.

Thanks,
Fengguang

> Hi Fengguang,
> 
> Ok, so just trying to understand this pos_ratio little better.
> 
> You have following basic formula.
> 
>                      origin - dirty
>          pos_ratio = --------------
>                      origin - goal
> 
> Terminology is very confusing and following is my understanding. 
> 
> - setpoint == goal
> 
>   setpoint is the point where we would like our number of dirty pages to
>   be and at this point pos_ratio = 1. For global dirty this number seems
>   to be (thresh - thresh / DIRTY_SCOPE) 
> 
> - thresh
>   dirty page threshold calculated from dirty_ratio (Certain percentage of
>   total memory).
> 
> - Origin (seems to be equivalent of limit)
> 
>   This seems to be the reference point/limit we don't want to cross and
>   distance from this limit basically decides the pos_ratio. Closer we
>   are to limit, lower the pos_ratio and further we are higher the
>   pos_ratio.
> 
> So threshold is just a number which helps us determine goal and limit.
> 
> goal = thresh - thresh / DIRTY_SCOPE
> limit = 4*thresh
> 
> So goal is where we want to be and we start throttling the task more as
> we move away goal and approach limit. We keep the limit high enough
> so that (origin-dirty) does not become negative entity.
> 
> So we do expect to cross "thresh" otherwise thresh itself could have
> served as limit?
> 
> If my understanding is right, then can we get rid of terms "setpoint" and
> "origin". Would it be easier to understand the things if we just talk
> in terms of "goal" and "limit" and how these are derived from "thresh".
> 
> 	thresh == soft limit
> 	limit == 4*thresh (hard limit)
> 	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
> 						be in steady state).
>                      limit - dirty
>          pos_ratio = --------------
>                      limit - goal
> 
> Thanks
> Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11 11:14                   ` Jan Kara
@ 2011-08-16  8:35                     ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 07:14:23PM +0800, Jan Kara wrote:
> On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > > >                     origin - dirty
> > > > > >         pos_ratio = --------------
> > > > > >                     origin - goal 
> > > > > 
> > > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > > pos_ratio == 1.0:
> > > > > 
> > > > > OK, so basically you want a linear function for which:
> > > > > 
> > > > > f(goal) = 1 and has a root somewhere > goal.
> > > > > 
> > > > > (that one line is much more informative than all your graphs put
> > > > > together, one can start from there and derive your function)
> > > > > 
> > > > > That does indeed get you the above function, now what does it mean? 
> > > > 
> > > > So going by:
> > > > 
> > > >                                          write_bw
> > > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > > >                                          dirty_bw
> > > 
> > >   Actually, thinking about these formulas, why do we even bother with
> > > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > > Couldn't we just have a feedback loop (probably similar to the one
> > > computing pos_ratio) which will maintain single value - ratelimit? When we
> > > are getting close to dirty limit, we will scale ratelimit down, when we
> > > will be getting significantly below dirty limit, we will scale the
> > > ratelimit up.  Because looking at the formulas it seems to me that the net
> > > effect is the same - pos_ratio basically overrules everything... 
> > 
> > Good question. That is actually one of the early approaches I tried.
> > It somehow worked, however the resulted ratelimit is not only slow
> > responding, but also oscillating all the time.
>   Yes, I think I vaguely remember that.
> 
> > This is due to the imperfections
> > 
> > 1) pos_ratio at best only provides a "direction" for adjusting the
> >    ratelimit. There is only vague clues that if pos_ratio is small,
> >    the errors in ratelimit should be small.
> > 
> > 2) Due to time-lag, the assumptions in (1) about "direction" and
> >    "error size" can be wrong. The ratelimit may already be
> >    over-adjusted when the dirty pages take time to approach the
> >    setpoint. The larger memory, the more time lag, the easier to
> >    overshoot and oscillate.
> > 
> > 3) dirty pages are constantly fluctuating around the setpoint,
> >    so is pos_ratio.
> > 
> > With (1) and (2), it's a control system very susceptible to disturbs.
> > With (3) we get constant disturbs. Well I had very hard time and
> > played dirty tricks (which you may never want to know ;-) trying to
> > tradeoff between response time and stableness..
>   Yes, I can see especially 2) is a problem. But I don't understand why
> your current formula would be that much different. As Peter decoded from
> your code, your current formula is:
>                                         write_bw
>  ref_bw = dirty_ratelimit * pos_ratio * --------
>                                         dirty_bw
> 
> while previously it was essentially:
>  ref_bw = dirty_ratelimit * pos_ratio

Sorry what's the code you are referring to? Does the changelog in the
newly posted patchset make the ref_bw calculation and dirty_ratelimit
updating more clear?

> So what is so magical about computing write_bw and dirty_bw separately? Is
> it because previously you did not use derivation of distance from the goal
> for updating pos_ratio? Because in your current formula write_bw/dirty_bw
> is a derivation of position...

dirty_bw is the main feedback. If we are throttling too much, the
resulting dirty_bw will be lowered than write_bw. Thus 

                                      write_bw
   ref_bw = ratelimit_in_past_200ms * --------
                                      dirty_bw

will give us a higher ref_bw than ratelimit_in_past_200ms. For pure
dd workload, the computed ref_bw by the above formula is exactly the
balanced rate (if not considering trivial errors).

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  8:35                     ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  8:35 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 07:14:23PM +0800, Jan Kara wrote:
> On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> > On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > > >                     origin - dirty
> > > > > >         pos_ratio = --------------
> > > > > >                     origin - goal 
> > > > > 
> > > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > > pos_ratio == 1.0:
> > > > > 
> > > > > OK, so basically you want a linear function for which:
> > > > > 
> > > > > f(goal) = 1 and has a root somewhere > goal.
> > > > > 
> > > > > (that one line is much more informative than all your graphs put
> > > > > together, one can start from there and derive your function)
> > > > > 
> > > > > That does indeed get you the above function, now what does it mean? 
> > > > 
> > > > So going by:
> > > > 
> > > >                                          write_bw
> > > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > > >                                          dirty_bw
> > > 
> > >   Actually, thinking about these formulas, why do we even bother with
> > > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > > Couldn't we just have a feedback loop (probably similar to the one
> > > computing pos_ratio) which will maintain single value - ratelimit? When we
> > > are getting close to dirty limit, we will scale ratelimit down, when we
> > > will be getting significantly below dirty limit, we will scale the
> > > ratelimit up.  Because looking at the formulas it seems to me that the net
> > > effect is the same - pos_ratio basically overrules everything... 
> > 
> > Good question. That is actually one of the early approaches I tried.
> > It somehow worked, however the resulted ratelimit is not only slow
> > responding, but also oscillating all the time.
>   Yes, I think I vaguely remember that.
> 
> > This is due to the imperfections
> > 
> > 1) pos_ratio at best only provides a "direction" for adjusting the
> >    ratelimit. There is only vague clues that if pos_ratio is small,
> >    the errors in ratelimit should be small.
> > 
> > 2) Due to time-lag, the assumptions in (1) about "direction" and
> >    "error size" can be wrong. The ratelimit may already be
> >    over-adjusted when the dirty pages take time to approach the
> >    setpoint. The larger memory, the more time lag, the easier to
> >    overshoot and oscillate.
> > 
> > 3) dirty pages are constantly fluctuating around the setpoint,
> >    so is pos_ratio.
> > 
> > With (1) and (2), it's a control system very susceptible to disturbs.
> > With (3) we get constant disturbs. Well I had very hard time and
> > played dirty tricks (which you may never want to know ;-) trying to
> > tradeoff between response time and stableness..
>   Yes, I can see especially 2) is a problem. But I don't understand why
> your current formula would be that much different. As Peter decoded from
> your code, your current formula is:
>                                         write_bw
>  ref_bw = dirty_ratelimit * pos_ratio * --------
>                                         dirty_bw
> 
> while previously it was essentially:
>  ref_bw = dirty_ratelimit * pos_ratio

Sorry what's the code you are referring to? Does the changelog in the
newly posted patchset make the ref_bw calculation and dirty_ratelimit
updating more clear?

> So what is so magical about computing write_bw and dirty_bw separately? Is
> it because previously you did not use derivation of distance from the goal
> for updating pos_ratio? Because in your current formula write_bw/dirty_bw
> is a derivation of position...

dirty_bw is the main feedback. If we are throttling too much, the
resulting dirty_bw will be lowered than write_bw. Thus 

                                      write_bw
   ref_bw = ratelimit_in_past_200ms * --------
                                      dirty_bw

will give us a higher ref_bw than ratelimit_in_past_200ms. For pure
dd workload, the computed ref_bw by the above formula is exactly the
balanced rate (if not considering trivial errors).

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* [PATCH 2/5] writeback: dirty position control
  2011-08-16  2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang
  2011-08-16  2:20   ` Wu Fengguang
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13157 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,



^ permalink raw reply	[flat|nested] 169+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13460 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-16  2:20   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-16  2:20 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Peter Zijlstra, Wu Fengguang, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 13460 bytes --]

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

The slope of the bdi control line should be

1) large enough to pull the dirty pages to setpoint reasonably fast

2) small enough to avoid big fluctuations in the resulted pos_ratio and
   hence task ratelimit

Since the fluctuation range of the bdi dirty pages is typically observed
to be within 1-second worth of data, the bdi control line's slope is
selected to be a linear function of bdi write bandwidth, so that it can
adapt to slow/fast storage devices well.

Assume the bdi control line

	pos_ratio = 1.0 + k * (dirty - setpoint)

where k is the negative slope.

If targeting for 12.5% fluctuation range in pos_ratio when dirty pages
are fluctuating in range [setpoint - write_bw/2, setpoint + write_bw/2],
we get slope

	k = - 1 / (8 * write_bw)

Let pos_ratio(x_intercept) = 0, we get the parameter used in code:

	x_intercept = setpoint + 8 * write_bw

The global/bdi slopes are nicely complementing each other when the
system has only one major bdi (indicated by bdi_thresh ~= thresh):

1) slope of global control line    => scaling to the control scope size
2) slope of main bdi control line  => scaling to the write bandwidth

so that

- in memory tight systems, (1) becomes strong enough to squeeze dirty
  pages inside the control scope

- in large memory systems where the "gravity" of (1) for pulling the
  dirty pages to setpoint is too weak, (2) can back (1) up and drive
  dirty pages to setpoint reasonably fast.

Unfortunately in JBOD setups, the fluctuation range of bdi threshold
is related to memory size due to the interferences between disks.  In
this case, the bdi slope will be weighted sum of write_bw and bdi_thresh.

peter: use 3rd order polynomial for the global control line

CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |    2 
 include/linux/writeback.h |    1 
 mm/page-writeback.c       |  196 +++++++++++++++++++++++++++++++++++-
 3 files changed, 193 insertions(+), 6 deletions(-)

--- linux-next.orig/mm/page-writeback.c	2011-08-14 18:03:49.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-14 21:33:39.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define RATELIMIT_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -411,6 +413,12 @@ unsigned long determine_dirtyable_memory
 	return x + 1;	/* Ensure that we never return 0 */
 }
 
+static unsigned long dirty_freerun_ceiling(unsigned long thresh,
+					   unsigned long bg_thresh)
+{
+	return (thresh + bg_thresh) / 2;
+}
+
 static unsigned long hard_dirty_limit(unsigned long thresh)
 {
 	return max(thresh, global_dirty_limit);
@@ -495,6 +503,180 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ * We want the dirty pages be balanced around the global/bdi setpoints.
+ * When the number of dirty pages is higher/lower than the setpoint, the
+ * dirty position control ratio (and hence task dirty ratelimit) will be
+ * decreased/increased to bring the dirty pages back to the setpoint.
+ *
+ *     pos_ratio = 1 << RATELIMIT_CALC_SHIFT
+ *
+ *     if (dirty < setpoint) scale up   pos_ratio
+ *     if (dirty > setpoint) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_setpoint) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_setpoint) scale down pos_ratio
+ *
+ *     task_ratelimit = balanced_rate * pos_ratio >> RATELIMIT_CALC_SHIFT
+ *
+ * (o) global control line
+ *
+ *     ^ pos_ratio
+ *     |
+ *     |            |<===== global dirty control scope ======>|
+ * 2.0 .............*
+ *     |            .*
+ *     |            . *
+ *     |            .   *
+ *     |            .     *
+ *     |            .        *
+ *     |            .            *
+ * 1.0 ................................*
+ *     |            .                  .     *
+ *     |            .                  .          *
+ *     |            .                  .              *
+ *     |            .                  .                 *
+ *     |            .                  .                    *
+ *   0 +------------.------------------.----------------------*------------->
+ *           freerun^          setpoint^                 limit^   dirty pages
+ *
+ * (o) bdi control lines
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * The below figure illustrates the main bdi control line with an auxiliary
+ * line extending it to @limit.
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, rate scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, rate scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 setpoint                     x_intercept           limit
+ *
+ * The auxiliary control line allows smoothly throttling bdi_dirty down to
+ * normal if it starts high in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to many times higher than bdi setpoint.
+ * - the bdi dirty thresh drops quickly due to change of JBOD workload
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long bg_thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long x_intercept;
+	unsigned long setpoint;		/* the target balance point */
+	unsigned long span;
+	long long pos_ratio;		/* for scaling up/down the rate limit */
+	long x;
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 *
+	 *                         setpoint - dirty 3
+	 *        f(dirty) := 1 + (----------------)
+	 *                         limit - setpoint
+	 *
+	 * it's a 3rd order polynomial that subjects to
+	 *
+	 * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
+	 * (2) f(setpoint) = 1.0 => the balance point
+	 * (3) f(limit)    = 0   => the hard limit
+	 * (4) df/dx       < 0	 => negative feedback control
+	 * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+	 *     => fast response on large errors; small oscillation near setpoint
+	 */
+	setpoint = (freerun + limit) / 2;
+	x = div_s64((setpoint - dirty) << RATELIMIT_CALC_SHIFT,
+		    limit - setpoint + 1);
+	pos_ratio = x;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+	/*
+	 * bdi setpoint
+	 *
+	 *        f(dirty) := 1.0 + k * (dirty - setpoint)
+	 *
+	 * The main bdi control line is a linear function that subjects to
+	 *
+	 * (1) f(setpoint) = 1.0
+	 * (2) k = - 1 / (8 * write_bw)  (in single bdi case)
+	 *     or equally: x_intercept = setpoint + 8 * write_bw
+	 *
+	 * For single bdi case, the dirty pages are observed to fluctuate
+	 * regularly within range
+	 *        [setpoint - write_bw/2, setpoint + write_bw/2]
+	 * for various filesystems, where (2) can yield in a reasonable 12.5%
+	 * fluctuation range for pos_ratio.
+	 *
+	 * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its
+	 * own size, so move the slope over accordingly.
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	/*
+	 * scale global setpoint to bdi's:  setpoint *= bdi_thresh / thresh
+	 */
+	x = div_u64((u64)bdi_thresh << 16, thresh | 1);
+	setpoint = setpoint * (u64)x >> 16;
+	/*
+	 * Use span=(4*write_bw) in single bdi case as indicated by
+	 * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case.
+	 */
+	span = div_u64((u64)bdi_thresh * (thresh - bdi_thresh) +
+		       (u64)(4 * bdi->avg_write_bandwidth) * bdi_thresh,
+		       thresh + 1);
+	x_intercept = setpoint + 2 * span;
+
+	if (unlikely(bdi_dirty > setpoint + span)) {
+		if (unlikely(bdi_dirty > limit))
+			return 0;
+		if (x_intercept < limit) {
+			x_intercept = limit;	/* auxiliary control line */
+			setpoint += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= x_intercept - bdi_dirty;
+	do_div(pos_ratio, x_intercept - setpoint + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)
@@ -593,6 +775,7 @@ static void global_update_bandwidth(unsi
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,
@@ -629,6 +812,7 @@ snapshot:
 
 static void bdi_update_bandwidth(struct backing_dev_info *bdi,
 				 unsigned long thresh,
+				 unsigned long bg_thresh,
 				 unsigned long dirty,
 				 unsigned long bdi_thresh,
 				 unsigned long bdi_dirty,
@@ -637,8 +821,8 @@ static void bdi_update_bandwidth(struct 
 	if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
 		return;
 	spin_lock(&bdi->wb.list_lock);
-	__bdi_update_bandwidth(bdi, thresh, dirty, bdi_thresh, bdi_dirty,
-			       start_time);
+	__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
+			       bdi_thresh, bdi_dirty, start_time);
 	spin_unlock(&bdi->wb.list_lock);
 }
 
@@ -679,7 +863,8 @@ static void balance_dirty_pages(struct a
 		 * catch-up. This avoids (excessively) small writeouts
 		 * when the bdi limits are ramping up.
 		 */
-		if (nr_dirty <= (background_thresh + dirty_thresh) / 2)
+		if (nr_dirty <= dirty_freerun_ceiling(dirty_thresh,
+						      background_thresh))
 			break;
 
 		bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh);
@@ -723,8 +908,9 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		bdi_update_bandwidth(bdi, dirty_thresh, nr_dirty,
-				     bdi_thresh, bdi_dirty, start_time);
+		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
+				     nr_dirty, bdi_thresh, bdi_dirty,
+				     start_time);
 
 		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
 		 * Unstable writes are a feature of certain networked
--- linux-next.orig/fs/fs-writeback.c	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-14 18:03:50.000000000 +0800
@@ -670,7 +670,7 @@ static inline bool over_bground_thresh(v
 static void wb_update_bandwidth(struct bdi_writeback *wb,
 				unsigned long start_time)
 {
-	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, start_time);
+	__bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
 }
 
 /*
--- linux-next.orig/include/linux/writeback.h	2011-08-14 18:03:45.000000000 +0800
+++ linux-next/include/linux/writeback.h	2011-08-14 18:03:50.000000000 +0800
@@ -154,6 +154,7 @@ unsigned long bdi_dirty_limit(struct bac
 
 void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long thresh,
+			    unsigned long bg_thresh,
 			    unsigned long dirty,
 			    unsigned long bdi_thresh,
 			    unsigned long bdi_dirty,


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 13:04             ` Peter Zijlstra
@ 2011-08-12 14:20               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Peter,

Sorry for the delay..

On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:

To start with,

                                                write_bw
        ref_bw = task_ratelimit_in_past_200ms * --------
                                                dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

> > Now all of the above would seem to suggest:
> > 
> >   dirty_ratelimit := ref_bw

Right, ideally ref_bw is the balanced dirty ratelimit. I actually
started with exactly the above equation when I got choked by pure
pos_bw based feedback control (as mentioned in the reply to Jan's
email) and introduced the ref_bw estimation as the way out.

But there are some imperfections in ref_bw, too. Which makes it not
suitable for direct use:

1) large fluctuations

The dirty_bw used for computing ref_bw is merely averaged in the
past 200ms (very small comparing to the 3s estimation period in
write_bw), which makes rather dispersed distribution of ref_bw.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png

Take a look at the blue [*] points in the above graph. I find it pretty
hard to average out the singular points by increasing the estimation
period. Considering that the averaging technique will introduce the
very undesirable time lags, I give it up totally. (btw, the write_bw
averaging time lag is much more acceptable because its impact is
one-way and therefore won't lead to oscillations.)

The one practical way is filtering -- the most large singular ref_bw
points can be filtered out effectively by remembering some prev_ref_bw
and prev_prev_ref_bw. However it cannot do away all of them. And the
remaining majority ref_bw points are still randomly dancing around the
ideal balanced rate. 

2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
becomes unbalanced match, which leads to large systematical errors
in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
be compensated smoothly. So let's face it. When some over-estimated
ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the
pos_bw will in turn become lower than ->dirty_ratelimit. So if we
consider both ref_bw and pos_bw and update ->dirty_ratelimit only when
they are on the same side of ->dirty_ratelimit, the systematical
errors in ref_bw won't be able to bring ->dirty_ratelimit too away.

The ref_bw estimation is also not accurate when near the max pause and
free run areas.

3) since we ultimately want to

- keep the dirty pages around the setpoint as long time as possible
- keep the fluctuations of task ratelimit as small as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
point to bring up dirty_ratelimit in a hurry and to hurt both the
above two goals.

> > However for that you use:
> > 
> >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> >         dirty_ratelimit = max(ref_bw, pos_bw);
> > 
> >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> >         dirty_ratelimit = min(ref_bw, pos_bw);

The above are merely constraints to the dirty_ratelimit update.
It serves to

1) stop adjusting the rate when it's against the position control
   target (the adjusted rate will slow down the progress of dirty
   pages going back to setpoint).

2) limit the step size. pos_bw is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   ref_bw. pos_bw also has smaller errors in stable state and normally
   have larger errors when there are big errors in rate. So it's a
   pretty good limiting factor for the step size of dirty_ratelimit.

> > You have:
> > 
> >   pos_bw = dirty_ratelimit * pos_ratio
> > 
> > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> > why are you ignoring the shift in output vs input rate there? 

Again, you need to understand pos_bw the other way.  Only (pos_bw -
dirty_ratelimit) matters here, which is exactly the position error.

> Could you elaborate on this primary feedback loop? Its the one part I
> don't feel I actually understand well.

Hope the above elaboration helps :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 14:20               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 14:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

Peter,

Sorry for the delay..

On Fri, Aug 12, 2011 at 09:04:19PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:

To start with,

                                                write_bw
        ref_bw = task_ratelimit_in_past_200ms * --------
                                                dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

> > Now all of the above would seem to suggest:
> > 
> >   dirty_ratelimit := ref_bw

Right, ideally ref_bw is the balanced dirty ratelimit. I actually
started with exactly the above equation when I got choked by pure
pos_bw based feedback control (as mentioned in the reply to Jan's
email) and introduced the ref_bw estimation as the way out.

But there are some imperfections in ref_bw, too. Which makes it not
suitable for direct use:

1) large fluctuations

The dirty_bw used for computing ref_bw is merely averaged in the
past 200ms (very small comparing to the 3s estimation period in
write_bw), which makes rather dispersed distribution of ref_bw.

http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/dirty-throttling-v8/8G/ext4-10dd-4k-32p-6802M-20:10-3.0.0-next-20110802+-2011-08-06.16:48/balance_dirty_pages-pages.png

Take a look at the blue [*] points in the above graph. I find it pretty
hard to average out the singular points by increasing the estimation
period. Considering that the averaging technique will introduce the
very undesirable time lags, I give it up totally. (btw, the write_bw
averaging time lag is much more acceptable because its impact is
one-way and therefore won't lead to oscillations.)

The one practical way is filtering -- the most large singular ref_bw
points can be filtered out effectively by remembering some prev_ref_bw
and prev_prev_ref_bw. However it cannot do away all of them. And the
remaining majority ref_bw points are still randomly dancing around the
ideal balanced rate. 

2) due to truncates and fs redirties, the (write_bw <=> dirty_bw)
becomes unbalanced match, which leads to large systematical errors
in ref_bw. The truncates, due to its possibly bumpy nature, can hardly
be compensated smoothly. So let's face it. When some over-estimated
ref_bw brings ->dirty_ratelimit high, higher than the setpoint, the
pos_bw will in turn become lower than ->dirty_ratelimit. So if we
consider both ref_bw and pos_bw and update ->dirty_ratelimit only when
they are on the same side of ->dirty_ratelimit, the systematical
errors in ref_bw won't be able to bring ->dirty_ratelimit too away.

The ref_bw estimation is also not accurate when near the max pause and
free run areas.

3) since we ultimately want to

- keep the dirty pages around the setpoint as long time as possible
- keep the fluctuations of task ratelimit as small as possible

the update policy used for (2) also serves the above goals nicely:
if for some reason the dirty pages are high (pos_bw < dirty_ratelimit),
and dirty_ratelimit is low (dirty_ratelimit < ref_bw), there is no
point to bring up dirty_ratelimit in a hurry and to hurt both the
above two goals.

> > However for that you use:
> > 
> >   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> >         dirty_ratelimit = max(ref_bw, pos_bw);
> > 
> >   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> >         dirty_ratelimit = min(ref_bw, pos_bw);

The above are merely constraints to the dirty_ratelimit update.
It serves to

1) stop adjusting the rate when it's against the position control
   target (the adjusted rate will slow down the progress of dirty
   pages going back to setpoint).

2) limit the step size. pos_bw is changing values step by step,
   leaving a consistent trace comparing to the randomly jumping
   ref_bw. pos_bw also has smaller errors in stable state and normally
   have larger errors when there are big errors in rate. So it's a
   pretty good limiting factor for the step size of dirty_ratelimit.

> > You have:
> > 
> >   pos_bw = dirty_ratelimit * pos_ratio
> > 
> > Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> > why are you ignoring the shift in output vs input rate there? 

Again, you need to understand pos_bw the other way.  Only (pos_bw -
dirty_ratelimit) matters here, which is exactly the position error.

> Could you elaborate on this primary feedback loop? Its the one part I
> don't feel I actually understand well.

Hope the above elaboration helps :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09 17:20             ` Peter Zijlstra
@ 2011-08-12 13:19               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:20:27AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint.

Yes.

> So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.

However the above function should better be interpreted as

                                            write_bw
    ref_bw = task_ratelimit_in_past_200ms * --------
                                            dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

It would be highly confusing if trying to find the direct "logical"
relationships between ref_bw and pos_ratio in the above equation.

> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1

Right.

> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.

Thanks to your reasoning that lead to the more elegant 

                            setpoint - dirty 3
   pos_ratio(dirty) := 1 + (----------------)
                            limit - setpoint

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:19               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 13:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Wed, Aug 10, 2011 at 01:20:27AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint.

Yes.

> So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.

However the above function should better be interpreted as

                                            write_bw
    ref_bw = task_ratelimit_in_past_200ms * --------
                                            dirty_bw

where
        task_ratelimit_in_past_200ms ~= dirty_ratelimit * pos_ratio

It would be highly confusing if trying to find the direct "logical"
relationships between ref_bw and pos_ratio in the above equation.

> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1

Right.

> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.

Thanks to your reasoning that lead to the more elegant 

                            setpoint - dirty 3
   pos_ratio(dirty) := 1 + (----------------)
                            limit - setpoint

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 12:59               ` Wu Fengguang
  (?)
@ 2011-08-12 13:08                 ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:08                 ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:08                 ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 20:59 +0800, Wu Fengguang wrote:
> On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> > On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > > 
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                  d
> > > 
> > btw, if you want steeper slopes for rampup and brake you can add another
> > factor like:
> > 
> >                  s - x 3
> >   f(x) :=  1 + a(-----)
> >                    d
> >  
> > And solve the whole f(l)=0 thing again to determine d in l and a.
> > 
> > For 0 < a < 1 the slopes increase.
> 
> Yes, we can leave it as a future tuning option. For now I'm pretty
> satisfied with the current function's shape :)

Oh for sure, it just occurred to me when looking at your plots and
thought I'd at least mention it.. You know something to poke at on a
rainy afternoon ;-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-12 13:04             ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:04             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 13:04             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 13:04 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
>         dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
>         dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there? 

Could you elaborate on this primary feedback loop? Its the one part I
don't feel I actually understand well.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 12:54             ` Peter Zijlstra
@ 2011-08-12 12:59               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                  d
> > 
> btw, if you want steeper slopes for rampup and brake you can add another
> factor like:
> 
>                  s - x 3
>   f(x) :=  1 + a(-----)
>                    d
>  
> And solve the whole f(l)=0 thing again to determine d in l and a.
> 
> For 0 < a < 1 the slopes increase.

Yes, we can leave it as a future tuning option. For now I'm pretty
satisfied with the current function's shape :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:59               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 12:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 08:54:17PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                  d
> > 
> btw, if you want steeper slopes for rampup and brake you can add another
> factor like:
> 
>                  s - x 3
>   f(x) :=  1 + a(-----)
>                    d
>  
> And solve the whole f(l)=0 thing again to determine d in l and a.
> 
> For 0 < a < 1 the slopes increase.

Yes, we can leave it as a future tuning option. For now I'm pretty
satisfied with the current function's shape :)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-12 12:54             ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:54             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:54             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 00:56 +0200, Peter Zijlstra wrote:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
btw, if you want steeper slopes for rampup and brake you can add another
factor like:

                 s - x 3
  f(x) :=  1 + a(-----)
                   d
 
And solve the whole f(l)=0 thing again to determine d in l and a.

For 0 < a < 1 the slopes increase.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12 11:07                     ` Wu Fengguang
  (?)
@ 2011-08-12 12:17                       ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:17                       ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 12:17                       ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12 12:17 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 19:07 +0800, Wu Fengguang wrote:
> Because pos_ratio was "unsigned long long"..

Ah! totally missed that ;-)

Yes looks good.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  9:47                 ` Peter Zijlstra
@ 2011-08-12 11:11                   ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:47:54PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                l - s
> > 
> 
> > Looks very neat, much simpler than the three curves solution!
> 
> Glad you like it, there is of course the small matter of real-world
> behaviour to consider, lets hope that works as well :-)

It magically meets all the criteria in my mind, not to mention it can
eliminate 2 extra patches. As for the tests, so far, so good :)

Your arithmetics are awesome!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 11:11                   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:47:54PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> > >                s - x 3
> > >  f(x) :=  1 + (-----)
> > >                l - s
> > 
> 
> > Looks very neat, much simpler than the three curves solution!
> 
> Glad you like it, there is of course the small matter of real-world
> behaviour to consider, lets hope that works as well :-)

It magically meets all the criteria in my mind, not to mention it can
eliminate 2 extra patches. As for the tests, so far, so good :)

Your arithmetics are awesome!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  9:45                   ` Peter Zijlstra
@ 2011-08-12 11:07                     ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:45:33PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> > Code is
> > 
> >         unsigned long freerun = (thresh + bg_thresh) / 2;
> > 
> >         setpoint = (limit + freerun) / 2;
> >         pos_ratio = abs(dirty - setpoint);
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, limit - setpoint + 1);
> 
> Why do you use do_div()? from the code those things are unsigned long,
> and you can divide that just fine.

Because pos_ratio was "unsigned long long"..

> Also, there's div64_s64 that can do signed divides for s64 types.
> That'll loose the extra conditionals you used for abs and putting the
> sign back.

Ah ok, good to know that :)

> >         x = pos_ratio;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> 
> So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
> solves to 6, which isn't going to be enough I figure since
> (dirty-setpoint) !< 64.
> 
> So you really need to use u64/s64 types here, unsigned long just won't
> do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.

Sure, here is the updated code:

        long long pos_ratio;            /* for scaling up/down the rate limit */
        long x;
       
        if (unlikely(dirty >= limit))
                return 0;

        /*
         * global setpoint
         *
         *                  setpoint - dirty 3
         * f(dirty) := 1 + (----------------)
         *                  limit - setpoint
         *
         * it's a 3rd order polynomial that subjects to
         *
         * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
         * (2) f(setpoint) = 1.0 => the balance point
         * (3) f(limit)    = 0   => the hard limit
         * (4) df/dx < 0         => negative feedback control
         * (5) the closer to setpoint, the smaller |df/dx| (and the reverse),
         *     => fast response on large errors; small oscillation near setpoint
         */
        setpoint = (limit + freerun) / 2;
        pos_ratio = (setpoint - dirty) << RATELIMIT_CALC_SHIFT;
        pos_ratio = div_s64(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio += 1 << RATELIMIT_CALC_SHIFT;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12 11:07                     ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12 11:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 05:45:33PM +0800, Peter Zijlstra wrote:
> On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> > Code is
> > 
> >         unsigned long freerun = (thresh + bg_thresh) / 2;
> > 
> >         setpoint = (limit + freerun) / 2;
> >         pos_ratio = abs(dirty - setpoint);
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, limit - setpoint + 1);
> 
> Why do you use do_div()? from the code those things are unsigned long,
> and you can divide that just fine.

Because pos_ratio was "unsigned long long"..

> Also, there's div64_s64 that can do signed divides for s64 types.
> That'll loose the extra conditionals you used for abs and putting the
> sign back.

Ah ok, good to know that :)

> >         x = pos_ratio;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> >         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
> 
> So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
> solves to 6, which isn't going to be enough I figure since
> (dirty-setpoint) !< 64.
> 
> So you really need to use u64/s64 types here, unsigned long just won't
> do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.

Sure, here is the updated code:

        long long pos_ratio;            /* for scaling up/down the rate limit */
        long x;
       
        if (unlikely(dirty >= limit))
                return 0;

        /*
         * global setpoint
         *
         *                  setpoint - dirty 3
         * f(dirty) := 1 + (----------------)
         *                  limit - setpoint
         *
         * it's a 3rd order polynomial that subjects to
         *
         * (1) f(freerun)  = 2.0 => rampup base_rate reasonably fast
         * (2) f(setpoint) = 1.0 => the balance point
         * (3) f(limit)    = 0   => the hard limit
         * (4) df/dx < 0         => negative feedback control
         * (5) the closer to setpoint, the smaller |df/dx| (and the reverse),
         *     => fast response on large errors; small oscillation near setpoint
         */
        setpoint = (limit + freerun) / 2;
        pos_ratio = (setpoint - dirty) << RATELIMIT_CALC_SHIFT;
        pos_ratio = div_s64(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
        pos_ratio += 1 << RATELIMIT_CALC_SHIFT;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
  (?)
@ 2011-08-12  9:47                 ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:47                 ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:47                 ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:47 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 10:43 +0800, Wu Fengguang wrote:
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 

> Looks very neat, much simpler than the three curves solution!

Glad you like it, there is of course the small matter of real-world
behaviour to consider, lets hope that works as well :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  5:45                 ` Wu Fengguang
  (?)
@ 2011-08-12  9:45                   ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.

>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:45                   ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.


>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  9:45                   ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-12  9:45 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, 2011-08-12 at 13:45 +0800, Wu Fengguang wrote:
> Code is
> 
>         unsigned long freerun = (thresh + bg_thresh) / 2;
> 
>         setpoint = (limit + freerun) / 2;
>         pos_ratio = abs(dirty - setpoint);
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, limit - setpoint + 1);

Why do you use do_div()? from the code those things are unsigned long,
and you can divide that just fine.

Also, there's div64_s64 that can do signed divides for s64 types.
That'll loose the extra conditionals you used for abs and putting the
sign back.

>         x = pos_ratio;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
>         pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;

So on 32bit with unsigned long that gets 32=2*(10+b) bits for x, that
solves to 6, which isn't going to be enough I figure since
(dirty-setpoint) !< 64.

So you really need to use u64/s64 types here, unsigned long just won't
do, with u64 you have 64=2(10+b) 22 bits for x, which should fit.


>         if (dirty > setpoint)
>                 pos_ratio = -pos_ratio;
>         pos_ratio += 1 << BANDWIDTH_CALC_SHIFT; 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
@ 2011-08-12  5:45                 ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12  5:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

I'll use the above one, which is more simple and elegant: 

        f(freerun)  = 2.0
        f(setpoint) = 1.0
        f(limit)    = 0

Code is

        unsigned long freerun = (thresh + bg_thresh) / 2;

        setpoint = (limit + freerun) / 2;
        pos_ratio = abs(dirty - setpoint);
        pos_ratio <<= BANDWIDTH_CALC_SHIFT;
        do_div(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        if (dirty > setpoint)
                pos_ratio = -pos_ratio;
        pos_ratio += 1 << BANDWIDTH_CALC_SHIFT;

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  5:45                 ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12  5:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

I'll use the above one, which is more simple and elegant: 

        f(freerun)  = 2.0
        f(setpoint) = 1.0
        f(limit)    = 0

Code is

        unsigned long freerun = (thresh + bg_thresh) / 2;

        setpoint = (limit + freerun) / 2;
        pos_ratio = abs(dirty - setpoint);
        pos_ratio <<= BANDWIDTH_CALC_SHIFT;
        do_div(pos_ratio, limit - setpoint + 1);
        x = pos_ratio;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        pos_ratio = pos_ratio * x >> BANDWIDTH_CALC_SHIFT;
        if (dirty > setpoint)
                pos_ratio = -pos_ratio;
        pos_ratio += 1 << BANDWIDTH_CALC_SHIFT;

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-12  2:43               ` Wu Fengguang
  (?)
@ 2011-08-12  3:18               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12  3:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: Type: text/plain, Size: 1306 bytes --]

Sorry forgot the 2 gnuplot figures, attached now.

> > Making our final function look like:
> > 
> >                s - x 3
> >  f(x) :=  1 + (-----)
> >                l - s
> 
> Very intuitive reasoning, thanks!
> 
> I substituted real numbers to the function assuming a mem=2GB system.
> 
> with limit=thresh:
> 
>         gnuplot> set xrange [60000:80000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3
> 
> with limit=thresh+thresh/DIRTY_SCOPE
> 
>         gnuplot> set xrange [60000:90000]
>         gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3
> 
> Figures attached.  The latter produces reasonably flat slope and I'll
> give it a spin in the dd tests :)
>  
> > You can clamp it at [0,2] or so.
> 
> Looking at the figures, we may even do without the clamp because it's
> already inside the range [0, 2].
> 
> > The implementation wouldn't be too horrid either, something like:
> > 
> > unsigned long bdi_pos_ratio(..)
> > {
> > 	if (dirty > limit)
> > 		return 0;
> > 
> > 	if (dirty < 2*setpoint - limit)
> > 		return 2 * SCALE;
> > 
> > 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> > 	xx = (x * x) / SCALE;
> > 	xxx = (xx * x) / SCALE;
> > 
> > 	return xxx;
> > }
> 
> Looks very neat, much simpler than the three curves solution!
> 
> Thanks,
> Fengguang

[-- Attachment #2: 3rd-order-limit=thresh+halfscope.png --]
[-- Type: image/png, Size: 30247 bytes --]

[-- Attachment #3: 3rd-order-limit=thresh.png --]
[-- Type: image/png, Size: 28785 bytes --]

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11 22:56             ` Peter Zijlstra
@ 2011-08-12  2:43               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 06:56:06AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> > 
> > pos_ratio seems to be the feedback on the deviation of the dirty pages
> > around its setpoint. So we adjust the reference bw (or rather ratelimit)
> > to take account of the shift in output vs input capacity as well as the
> > shift in dirty pages around its setpoint.
> > 
> > From that we derive the condition that: 
> > 
> >   pos_ratio(setpoint) := 1
> > 
> > Now in order to create a linear function we need one more condition. We
> > get one from the fact that once we hit the limit we should hard throttle
> > our writers. We get that by setting the ratelimit to 0, because, after
> > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> > 
> >   pos_ratio(limit) := 0
> > 
> > Using these two conditions we can solve the equations and get your:
> > 
> >                         limit - dirty
> >   pos_ratio(dirty) =  ----------------
> >                       limit - setpoint
> > 
> > Now, for some reason you chose not to use limit, but something like
> > min(limit, 4*thresh) something to do with the slope affecting the rate
> > of adjustment. This wants a comment someplace. 
> 
> Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
> your negative slope (df/dx < 0), simply because it implies your
> condition and because it expresses our hard stop at limit.

Right. That's good point.

> Also, while I know this is totally over the top, but..
> 
> I saw you added a ramp and brake area in future patches, so have you
> considered using a third order polynomial instead?

No I have not ;)

The 3 lines/curves should be a bit more flexible/configurable than the
single 3rd order polynomial.  However the 3rd order polynomial is sure
much more simple and consistent by removing the explicit rampup/brake
areas and curves.

> The simple:
> 
>  f(x) = -x^3 
> 
> has the 'right' shape, all we need is move it so that:
> 
>  f(s) = 1
> 
> and stretch it to put the single root at our limit. You'd get something
> like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
> Which, as required, is 1 at our setpoint and the factor d stretches the
> middle bit. Which has a single (real) root at: 
> 
>   x = s + d, 
> 
> by setting that to our limit, we get:
> 
>   d = l - s
> 
> Making our final function look like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                l - s

Very intuitive reasoning, thanks!

I substituted real numbers to the function assuming a mem=2GB system.

with limit=thresh:

        gnuplot> set xrange [60000:80000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

with limit=thresh+thresh/DIRTY_SCOPE

        gnuplot> set xrange [60000:90000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3

Figures attached.  The latter produces reasonably flat slope and I'll
give it a spin in the dd tests :)
 
> You can clamp it at [0,2] or so.

Looking at the figures, we may even do without the clamp because it's
already inside the range [0, 2].

> The implementation wouldn't be too horrid either, something like:
> 
> unsigned long bdi_pos_ratio(..)
> {
> 	if (dirty > limit)
> 		return 0;
> 
> 	if (dirty < 2*setpoint - limit)
> 		return 2 * SCALE;
> 
> 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> 	xx = (x * x) / SCALE;
> 	xxx = (xx * x) / SCALE;
> 
> 	return xxx;
> }

Looks very neat, much simpler than the three curves solution!

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-12  2:43               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-12  2:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Fri, Aug 12, 2011 at 06:56:06AM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> > 
> > pos_ratio seems to be the feedback on the deviation of the dirty pages
> > around its setpoint. So we adjust the reference bw (or rather ratelimit)
> > to take account of the shift in output vs input capacity as well as the
> > shift in dirty pages around its setpoint.
> > 
> > From that we derive the condition that: 
> > 
> >   pos_ratio(setpoint) := 1
> > 
> > Now in order to create a linear function we need one more condition. We
> > get one from the fact that once we hit the limit we should hard throttle
> > our writers. We get that by setting the ratelimit to 0, because, after
> > all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> > 
> >   pos_ratio(limit) := 0
> > 
> > Using these two conditions we can solve the equations and get your:
> > 
> >                         limit - dirty
> >   pos_ratio(dirty) =  ----------------
> >                       limit - setpoint
> > 
> > Now, for some reason you chose not to use limit, but something like
> > min(limit, 4*thresh) something to do with the slope affecting the rate
> > of adjustment. This wants a comment someplace. 
> 
> Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
> your negative slope (df/dx < 0), simply because it implies your
> condition and because it expresses our hard stop at limit.

Right. That's good point.

> Also, while I know this is totally over the top, but..
> 
> I saw you added a ramp and brake area in future patches, so have you
> considered using a third order polynomial instead?

No I have not ;)

The 3 lines/curves should be a bit more flexible/configurable than the
single 3rd order polynomial.  However the 3rd order polynomial is sure
much more simple and consistent by removing the explicit rampup/brake
areas and curves.

> The simple:
> 
>  f(x) = -x^3 
> 
> has the 'right' shape, all we need is move it so that:
> 
>  f(s) = 1
> 
> and stretch it to put the single root at our limit. You'd get something
> like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                  d
> 
> Which, as required, is 1 at our setpoint and the factor d stretches the
> middle bit. Which has a single (real) root at: 
> 
>   x = s + d, 
> 
> by setting that to our limit, we get:
> 
>   d = l - s
> 
> Making our final function look like:
> 
>                s - x 3
>  f(x) :=  1 + (-----)
>                l - s

Very intuitive reasoning, thanks!

I substituted real numbers to the function assuming a mem=2GB system.

with limit=thresh:

        gnuplot> set xrange [60000:80000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(80000-70000.0)**3

with limit=thresh+thresh/DIRTY_SCOPE

        gnuplot> set xrange [60000:90000]
        gnuplot> plot 1 +  (70000.0 - x)**3/(90000-70000.0)**3

Figures attached.  The latter produces reasonably flat slope and I'll
give it a spin in the dd tests :)
 
> You can clamp it at [0,2] or so.

Looking at the figures, we may even do without the clamp because it's
already inside the range [0, 2].

> The implementation wouldn't be too horrid either, something like:
> 
> unsigned long bdi_pos_ratio(..)
> {
> 	if (dirty > limit)
> 		return 0;
> 
> 	if (dirty < 2*setpoint - limit)
> 		return 2 * SCALE;
> 
> 	x = SCALE * (setpoint - dirty) / (limit - setpoint);
> 	xx = (x * x) / SCALE;
> 	xxx = (xx * x) / SCALE;
> 
> 	return xxx;
> }

Looks very neat, much simpler than the three curves solution!

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-11 22:56             ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}


^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 22:56             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 22:56             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-11 22:56 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 19:20 +0200, Peter Zijlstra wrote:
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw
> 
> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace. 

Ok, so I think that pos_ratio(limit) := 0, is a stronger condition than
your negative slope (df/dx < 0), simply because it implies your
condition and because it expresses our hard stop at limit.

Also, while I know this is totally over the top, but..

I saw you added a ramp and brake area in future patches, so have you
considered using a third order polynomial instead?

The simple:

 f(x) = -x^3 

has the 'right' shape, all we need is move it so that:

 f(s) = 1

and stretch it to put the single root at our limit. You'd get something
like:

               s - x 3
 f(x) :=  1 + (-----)
                 d

Which, as required, is 1 at our setpoint and the factor d stretches the
middle bit. Which has a single (real) root at: 

  x = s + d, 

by setting that to our limit, we get:

  d = l - s

Making our final function look like:

               s - x 3
 f(x) :=  1 + (-----)
               l - s

You can clamp it at [0,2] or so. The implementation wouldn't be too
horrid either, something like:

unsigned long bdi_pos_ratio(..)
{
	if (dirty > limit)
		return 0;

	if (dirty < 2*setpoint - limit)
		return 2 * SCALE;

	x = SCALE * (setpoint - dirty) / (limit - setpoint);
	xx = (x * x) / SCALE;
	xxx = (xx * x) / SCALE;

	return xxx;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-11  2:29                 ` Wu Fengguang
@ 2011-08-11 11:14                   ` Jan Kara
  -1 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-11 11:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > >                     origin - dirty
> > > > >         pos_ratio = --------------
> > > > >                     origin - goal 
> > > > 
> > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > pos_ratio == 1.0:
> > > > 
> > > > OK, so basically you want a linear function for which:
> > > > 
> > > > f(goal) = 1 and has a root somewhere > goal.
> > > > 
> > > > (that one line is much more informative than all your graphs put
> > > > together, one can start from there and derive your function)
> > > > 
> > > > That does indeed get you the above function, now what does it mean? 
> > > 
> > > So going by:
> > > 
> > >                                          write_bw
> > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > >                                          dirty_bw
> > 
> >   Actually, thinking about these formulas, why do we even bother with
> > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > Couldn't we just have a feedback loop (probably similar to the one
> > computing pos_ratio) which will maintain single value - ratelimit? When we
> > are getting close to dirty limit, we will scale ratelimit down, when we
> > will be getting significantly below dirty limit, we will scale the
> > ratelimit up.  Because looking at the formulas it seems to me that the net
> > effect is the same - pos_ratio basically overrules everything... 
> 
> Good question. That is actually one of the early approaches I tried.
> It somehow worked, however the resulted ratelimit is not only slow
> responding, but also oscillating all the time.
  Yes, I think I vaguely remember that.

> This is due to the imperfections
> 
> 1) pos_ratio at best only provides a "direction" for adjusting the
>    ratelimit. There is only vague clues that if pos_ratio is small,
>    the errors in ratelimit should be small.
> 
> 2) Due to time-lag, the assumptions in (1) about "direction" and
>    "error size" can be wrong. The ratelimit may already be
>    over-adjusted when the dirty pages take time to approach the
>    setpoint. The larger memory, the more time lag, the easier to
>    overshoot and oscillate.
> 
> 3) dirty pages are constantly fluctuating around the setpoint,
>    so is pos_ratio.
> 
> With (1) and (2), it's a control system very susceptible to disturbs.
> With (3) we get constant disturbs. Well I had very hard time and
> played dirty tricks (which you may never want to know ;-) trying to
> tradeoff between response time and stableness..
  Yes, I can see especially 2) is a problem. But I don't understand why
your current formula would be that much different. As Peter decoded from
your code, your current formula is:
                                        write_bw
 ref_bw = dirty_ratelimit * pos_ratio * --------
                                        dirty_bw

while previously it was essentially:
 ref_bw = dirty_ratelimit * pos_ratio

So what is so magical about computing write_bw and dirty_bw separately? Is
it because previously you did not use derivation of distance from the goal
for updating pos_ratio? Because in your current formula write_bw/dirty_bw
is a derivation of position...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11 11:14                   ` Jan Kara
  0 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-11 11:14 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, linux-fsdevel, Andrew Morton,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Thu 11-08-11 10:29:52, Wu Fengguang wrote:
> On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> > On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > > >                     origin - dirty
> > > > >         pos_ratio = --------------
> > > > >                     origin - goal 
> > > > 
> > > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > > pos_ratio == 1.0:
> > > > 
> > > > OK, so basically you want a linear function for which:
> > > > 
> > > > f(goal) = 1 and has a root somewhere > goal.
> > > > 
> > > > (that one line is much more informative than all your graphs put
> > > > together, one can start from there and derive your function)
> > > > 
> > > > That does indeed get you the above function, now what does it mean? 
> > > 
> > > So going by:
> > > 
> > >                                          write_bw
> > >   ref_bw = dirty_ratelimit * pos_ratio * --------
> > >                                          dirty_bw
> > 
> >   Actually, thinking about these formulas, why do we even bother with
> > computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> > Couldn't we just have a feedback loop (probably similar to the one
> > computing pos_ratio) which will maintain single value - ratelimit? When we
> > are getting close to dirty limit, we will scale ratelimit down, when we
> > will be getting significantly below dirty limit, we will scale the
> > ratelimit up.  Because looking at the formulas it seems to me that the net
> > effect is the same - pos_ratio basically overrules everything... 
> 
> Good question. That is actually one of the early approaches I tried.
> It somehow worked, however the resulted ratelimit is not only slow
> responding, but also oscillating all the time.
  Yes, I think I vaguely remember that.

> This is due to the imperfections
> 
> 1) pos_ratio at best only provides a "direction" for adjusting the
>    ratelimit. There is only vague clues that if pos_ratio is small,
>    the errors in ratelimit should be small.
> 
> 2) Due to time-lag, the assumptions in (1) about "direction" and
>    "error size" can be wrong. The ratelimit may already be
>    over-adjusted when the dirty pages take time to approach the
>    setpoint. The larger memory, the more time lag, the easier to
>    overshoot and oscillate.
> 
> 3) dirty pages are constantly fluctuating around the setpoint,
>    so is pos_ratio.
> 
> With (1) and (2), it's a control system very susceptible to disturbs.
> With (3) we get constant disturbs. Well I had very hard time and
> played dirty tricks (which you may never want to know ;-) trying to
> tradeoff between response time and stableness..
  Yes, I can see especially 2) is a problem. But I don't understand why
your current formula would be that much different. As Peter decoded from
your code, your current formula is:
                                        write_bw
 ref_bw = dirty_ratelimit * pos_ratio * --------
                                        dirty_bw

while previously it was essentially:
 ref_bw = dirty_ratelimit * pos_ratio

So what is so magical about computing write_bw and dirty_bw separately? Is
it because previously you did not use derivation of distance from the goal
for updating pos_ratio? Because in your current formula write_bw/dirty_bw
is a derivation of position...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-10 22:34               ` Jan Kara
@ 2011-08-11  2:29                 ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-11  2:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > >                     origin - dirty
> > > >         pos_ratio = --------------
> > > >                     origin - goal 
> > > 
> > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > pos_ratio == 1.0:
> > > 
> > > OK, so basically you want a linear function for which:
> > > 
> > > f(goal) = 1 and has a root somewhere > goal.
> > > 
> > > (that one line is much more informative than all your graphs put
> > > together, one can start from there and derive your function)
> > > 
> > > That does indeed get you the above function, now what does it mean? 
> > 
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> 
>   Actually, thinking about these formulas, why do we even bother with
> computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> Couldn't we just have a feedback loop (probably similar to the one
> computing pos_ratio) which will maintain single value - ratelimit? When we
> are getting close to dirty limit, we will scale ratelimit down, when we
> will be getting significantly below dirty limit, we will scale the
> ratelimit up.  Because looking at the formulas it seems to me that the net
> effect is the same - pos_ratio basically overrules everything... 

Good question. That is actually one of the early approaches I tried.
It somehow worked, however the resulted ratelimit is not only slow
responding, but also oscillating all the time.

This is due to the imperfections

1) pos_ratio at best only provides a "direction" for adjusting the
   ratelimit. There is only vague clues that if pos_ratio is small,
   the errors in ratelimit should be small.

2) Due to time-lag, the assumptions in (1) about "direction" and
   "error size" can be wrong. The ratelimit may already be
   over-adjusted when the dirty pages take time to approach the
   setpoint. The larger memory, the more time lag, the easier to
   overshoot and oscillate.

3) dirty pages are constantly fluctuating around the setpoint,
   so is pos_ratio.

With (1) and (2), it's a control system very susceptible to disturbs.
With (3) we get constant disturbs. Well I had very hard time and
played dirty tricks (which you may never want to know ;-) trying to
tradeoff between response time and stableness..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-11  2:29                 ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-11  2:29 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Thu, Aug 11, 2011 at 06:34:27AM +0800, Jan Kara wrote:
> On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> > On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > > >                     origin - dirty
> > > >         pos_ratio = --------------
> > > >                     origin - goal 
> > > 
> > > > which comes from the below [*] control line, so that when (dirty == goal),
> > > > pos_ratio == 1.0:
> > > 
> > > OK, so basically you want a linear function for which:
> > > 
> > > f(goal) = 1 and has a root somewhere > goal.
> > > 
> > > (that one line is much more informative than all your graphs put
> > > together, one can start from there and derive your function)
> > > 
> > > That does indeed get you the above function, now what does it mean? 
> > 
> > So going by:
> > 
> >                                          write_bw
> >   ref_bw = dirty_ratelimit * pos_ratio * --------
> >                                          dirty_bw
> 
>   Actually, thinking about these formulas, why do we even bother with
> computing all these factors like write_bw, dirty_bw, pos_ratio, ...
> Couldn't we just have a feedback loop (probably similar to the one
> computing pos_ratio) which will maintain single value - ratelimit? When we
> are getting close to dirty limit, we will scale ratelimit down, when we
> will be getting significantly below dirty limit, we will scale the
> ratelimit up.  Because looking at the formulas it seems to me that the net
> effect is the same - pos_ratio basically overrules everything... 

Good question. That is actually one of the early approaches I tried.
It somehow worked, however the resulted ratelimit is not only slow
responding, but also oscillating all the time.

This is due to the imperfections

1) pos_ratio at best only provides a "direction" for adjusting the
   ratelimit. There is only vague clues that if pos_ratio is small,
   the errors in ratelimit should be small.

2) Due to time-lag, the assumptions in (1) about "direction" and
   "error size" can be wrong. The ratelimit may already be
   over-adjusted when the dirty pages take time to approach the
   setpoint. The larger memory, the more time lag, the easier to
   overshoot and oscillate.

3) dirty pages are constantly fluctuating around the setpoint,
   so is pos_ratio.

With (1) and (2), it's a control system very susceptible to disturbs.
With (3) we get constant disturbs. Well I had very hard time and
played dirty tricks (which you may never want to know ;-) trying to
tradeoff between response time and stableness..

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09 17:20             ` Peter Zijlstra
@ 2011-08-10 22:34               ` Jan Kara
  -1 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-10 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw

  Actually, thinking about these formulas, why do we even bother with
computing all these factors like write_bw, dirty_bw, pos_ratio, ...
Couldn't we just have a feedback loop (probably similar to the one
computing pos_ratio) which will maintain single value - ratelimit? When we
are getting close to dirty limit, we will scale ratelimit down, when we
will be getting significantly below dirty limit, we will scale the
ratelimit up.  Because looking at the formulas it seems to me that the net
effect is the same - pos_ratio basically overrules everything... 

> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.
> 
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> 	dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> 	dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 22:34               ` Jan Kara
  0 siblings, 0 replies; 169+ messages in thread
From: Jan Kara @ 2011-08-10 22:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Wu Fengguang, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Vivek Goyal, Andrea Righi, linux-mm, LKML

On Tue 09-08-11 19:20:27, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> > >                     origin - dirty
> > >         pos_ratio = --------------
> > >                     origin - goal 
> > 
> > > which comes from the below [*] control line, so that when (dirty == goal),
> > > pos_ratio == 1.0:
> > 
> > OK, so basically you want a linear function for which:
> > 
> > f(goal) = 1 and has a root somewhere > goal.
> > 
> > (that one line is much more informative than all your graphs put
> > together, one can start from there and derive your function)
> > 
> > That does indeed get you the above function, now what does it mean? 
> 
> So going by:
> 
>                                          write_bw
>   ref_bw = dirty_ratelimit * pos_ratio * --------
>                                          dirty_bw

  Actually, thinking about these formulas, why do we even bother with
computing all these factors like write_bw, dirty_bw, pos_ratio, ...
Couldn't we just have a feedback loop (probably similar to the one
computing pos_ratio) which will maintain single value - ratelimit? When we
are getting close to dirty limit, we will scale ratelimit down, when we
will be getting significantly below dirty limit, we will scale the
ratelimit up.  Because looking at the formulas it seems to me that the net
effect is the same - pos_ratio basically overrules everything... 

> pos_ratio seems to be the feedback on the deviation of the dirty pages
> around its setpoint. So we adjust the reference bw (or rather ratelimit)
> to take account of the shift in output vs input capacity as well as the
> shift in dirty pages around its setpoint.
> 
> From that we derive the condition that: 
> 
>   pos_ratio(setpoint) := 1
> 
> Now in order to create a linear function we need one more condition. We
> get one from the fact that once we hit the limit we should hard throttle
> our writers. We get that by setting the ratelimit to 0, because, after
> all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:
> 
>   pos_ratio(limit) := 0
> 
> Using these two conditions we can solve the equations and get your:
> 
>                         limit - dirty
>   pos_ratio(dirty) =  ----------------
>                       limit - setpoint
> 
> Now, for some reason you chose not to use limit, but something like
> min(limit, 4*thresh) something to do with the slope affecting the rate
> of adjustment. This wants a comment someplace.
> 
> 
> Now all of the above would seem to suggest:
> 
>   dirty_ratelimit := ref_bw
> 
> However for that you use:
> 
>   if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
> 	dirty_ratelimit = max(ref_bw, pos_bw);
> 
>   if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
> 	dirty_ratelimit = min(ref_bw, pos_bw);
> 
> You have:
> 
>   pos_bw = dirty_ratelimit * pos_ratio
> 
> Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
> why are you ignoring the shift in output vs input rate there?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
@ 2011-08-10 21:40             ` Vivek Goyal
  -1 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-10 21:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 07:05:35AM +0800, Wu Fengguang wrote:
> On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> > >         goal = thresh - thresh / DIRTY_SCOPE;
> > >         origin = 4 * thresh;
> > >  
> > > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > > -               origin = limit;                 /* auxiliary control line */
> > > -               goal = (goal + origin) / 2;
> > > -               pos_ratio >>= 1;
> > > -       }
> > >         pos_ratio = origin - dirty;
> > >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> > >         do_div(pos_ratio, origin - goal + 1); 
> 
> FYI I've updated the fix to the below one, so that @limit will be used
> as the origin in the rare case of (4*thresh < dirty).
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
> @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
>  	 * global setpoint
>  	 */
>  	goal = thresh - thresh / DIRTY_SCOPE;
> -	origin = 4 * thresh;
> +	origin = max(4 * thresh, limit);

Hi Fengguang,

Ok, so just trying to understand this pos_ratio little better.

You have following basic formula.

                     origin - dirty
         pos_ratio = --------------
                     origin - goal

Terminology is very confusing and following is my understanding. 

- setpoint == goal

  setpoint is the point where we would like our number of dirty pages to
  be and at this point pos_ratio = 1. For global dirty this number seems
  to be (thresh - thresh / DIRTY_SCOPE) 

- thresh
  dirty page threshold calculated from dirty_ratio (Certain percentage of
  total memory).

- Origin (seems to be equivalent of limit)

  This seems to be the reference point/limit we don't want to cross and
  distance from this limit basically decides the pos_ratio. Closer we
  are to limit, lower the pos_ratio and further we are higher the
  pos_ratio.

So threshold is just a number which helps us determine goal and limit.

goal = thresh - thresh / DIRTY_SCOPE
limit = 4*thresh

So goal is where we want to be and we start throttling the task more as
we move away goal and approach limit. We keep the limit high enough
so that (origin-dirty) does not become negative entity.

So we do expect to cross "thresh" otherwise thresh itself could have
served as limit?

If my understanding is right, then can we get rid of terms "setpoint" and
"origin". Would it be easier to understand the things if we just talk
in terms of "goal" and "limit" and how these are derived from "thresh".

	thresh == soft limit
	limit == 4*thresh (hard limit)
	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
						be in steady state).
                     limit - dirty
         pos_ratio = --------------
                     limit - goal

Thanks
Vivek

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 21:40             ` Vivek Goyal
  0 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-10 21:40 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-fsdevel, Andrew Morton, Jan Kara,
	Christoph Hellwig, Dave Chinner, Greg Thelen, Minchan Kim,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 07:05:35AM +0800, Wu Fengguang wrote:
> On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> > On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> > >         goal = thresh - thresh / DIRTY_SCOPE;
> > >         origin = 4 * thresh;
> > >  
> > > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > > -               origin = limit;                 /* auxiliary control line */
> > > -               goal = (goal + origin) / 2;
> > > -               pos_ratio >>= 1;
> > > -       }
> > >         pos_ratio = origin - dirty;
> > >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> > >         do_div(pos_ratio, origin - goal + 1); 
> 
> FYI I've updated the fix to the below one, so that @limit will be used
> as the origin in the rare case of (4*thresh < dirty).
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
> @@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
>  	 * global setpoint
>  	 */
>  	goal = thresh - thresh / DIRTY_SCOPE;
> -	origin = 4 * thresh;
> +	origin = max(4 * thresh, limit);

Hi Fengguang,

Ok, so just trying to understand this pos_ratio little better.

You have following basic formula.

                     origin - dirty
         pos_ratio = --------------
                     origin - goal

Terminology is very confusing and following is my understanding. 

- setpoint == goal

  setpoint is the point where we would like our number of dirty pages to
  be and at this point pos_ratio = 1. For global dirty this number seems
  to be (thresh - thresh / DIRTY_SCOPE) 

- thresh
  dirty page threshold calculated from dirty_ratio (Certain percentage of
  total memory).

- Origin (seems to be equivalent of limit)

  This seems to be the reference point/limit we don't want to cross and
  distance from this limit basically decides the pos_ratio. Closer we
  are to limit, lower the pos_ratio and further we are higher the
  pos_ratio.

So threshold is just a number which helps us determine goal and limit.

goal = thresh - thresh / DIRTY_SCOPE
limit = 4*thresh

So goal is where we want to be and we start throttling the task more as
we move away goal and approach limit. We keep the limit high enough
so that (origin-dirty) does not become negative entity.

So we do expect to cross "thresh" otherwise thresh itself could have
served as limit?

If my understanding is right, then can we get rid of terms "setpoint" and
"origin". Would it be easier to understand the things if we just talk
in terms of "goal" and "limit" and how these are derived from "thresh".

	thresh == soft limit
	limit == 4*thresh (hard limit)
	goal = thresh - thresh / DIRTY_SCOPE (where we want system to
						be in steady state).
                     limit - dirty
         pos_ratio = --------------
                     limit - goal

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-09  9:31             ` Peter Zijlstra
@ 2011-08-10 12:28               ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-10 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 05:31:44PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> > origin is where the control line crosses the X axis (in both the
> > global/bdi setpoint cases). 
> 
> Ah, that's normally called zero, root or or x-intercept:
> 
> http://en.wikipedia.org/wiki/X-intercept

Yes indeed! I'll change the name to x_intercept.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-10 12:28               ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-10 12:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, Aug 09, 2011 at 05:31:44PM +0800, Peter Zijlstra wrote:
> On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> > origin is where the control line crosses the X axis (in both the
> > global/bdi setpoint cases). 
> 
> Ah, that's normally called zero, root or or x-intercept:
> 
> http://en.wikipedia.org/wiki/X-intercept

Yes indeed! I'll change the name to x_intercept.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-09 17:20             ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

>From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.

Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 17:20             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.

Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 17:20             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09 17:20 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 12:32 +0200, Peter Zijlstra wrote:
> >                     origin - dirty
> >         pos_ratio = --------------
> >                     origin - goal 
> 
> > which comes from the below [*] control line, so that when (dirty == goal),
> > pos_ratio == 1.0:
> 
> OK, so basically you want a linear function for which:
> 
> f(goal) = 1 and has a root somewhere > goal.
> 
> (that one line is much more informative than all your graphs put
> together, one can start from there and derive your function)
> 
> That does indeed get you the above function, now what does it mean? 

So going by:

                                         write_bw
  ref_bw = dirty_ratelimit * pos_ratio * --------
                                         dirty_bw

pos_ratio seems to be the feedback on the deviation of the dirty pages
around its setpoint. So we adjust the reference bw (or rather ratelimit)
to take account of the shift in output vs input capacity as well as the
shift in dirty pages around its setpoint.

From that we derive the condition that: 

  pos_ratio(setpoint) := 1

Now in order to create a linear function we need one more condition. We
get one from the fact that once we hit the limit we should hard throttle
our writers. We get that by setting the ratelimit to 0, because, after
all, pause = nr_dirtied / ratelimit would yield inf. in that case. Thus:

  pos_ratio(limit) := 0

Using these two conditions we can solve the equations and get your:

                        limit - dirty
  pos_ratio(dirty) =  ----------------
                      limit - setpoint

Now, for some reason you chose not to use limit, but something like
min(limit, 4*thresh) something to do with the slope affecting the rate
of adjustment. This wants a comment someplace.

Now all of the above would seem to suggest:

  dirty_ratelimit := ref_bw

However for that you use:

  if (pos_bw < dirty_ratelimit && ref_bw < dirty_ratelimit)
	dirty_ratelimit = max(ref_bw, pos_bw);

  if (pos_bw > dirty_ratelimit && ref_bw > dirty_ratelimit)
	dirty_ratelimit = min(ref_bw, pos_bw);

You have:

  pos_bw = dirty_ratelimit * pos_ratio

Which is ref_bw without the write_bw/dirty_bw factor, this confuses me..
why are you ignoring the shift in output vs input rate there?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 23:05           ` Wu Fengguang
  (?)
@ 2011-08-09 10:32             ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 10:32             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09 10:32             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09 10:32 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 07:05 +0800, Wu Fengguang wrote:
> This is the more meaningful view :)
> 
>                     origin - dirty
>         pos_ratio = --------------
>                     origin - goal 

> which comes from the below [*] control line, so that when (dirty == goal),
> pos_ratio == 1.0:

OK, so basically you want a linear function for which:

f(goal) = 1 and has a root somewhere > goal.

(that one line is much more informative than all your graphs put
together, one can start from there and derive your function)

That does indeed get you the above function, now what does it mean?

> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.

(you seem inconsistent with your terminology, I think goal and setpoint
are interchanged? I looked up set point and its a term from control
system theory, so I'll chalk that up to my own ignorance..)

Ok, so higher dirty -> lower position ration -> lower dirty rate (and
the inverse), now what does that do...

/me goes read other patches in search of more clues.. I'm starting to
dislike graphs.. why not simply state where those things come from,
that's much easier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 22:47           ` Wu Fengguang
  (?)
@ 2011-08-09  9:31             ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  9:31             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  9:31             ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-09  9:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Tue, 2011-08-09 at 06:47 +0800, Wu Fengguang wrote:
> origin is where the control line crosses the X axis (in both the
> global/bdi setpoint cases). 

Ah, that's normally called zero, root or or x-intercept:

http://en.wikipedia.org/wiki/X-intercept

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-09  2:08     ` Vivek Goyal
  -1 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-09  2:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote:
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> For simplicity, only the global/bdi setpoint control lines are
> implemented here, so the [*] curve is more straight than the ideal one
> showed in the above figure.
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 

IMHO, "position_ratio" is not necessarily very intutive. Can there be
a better name? Based on your slides, it is scaling factor applied to
task rate limit depending on how well we are doing in terms of meeting
our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
that make sense and be little more intutive? 

Thanks
Vivek
 

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define BANDWIDTH_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.
> + *
> + *                              setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------|-----------|
> + * ^                               ^                               ^           ^
> + * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
> + *
> + *                          bdi setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------------------|
> + * ^                               ^                                           ^
> + * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
> + *
> + * (o) pseudo code
> + *
> + *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
> + *
> + *     if (dirty < thresh) scale up   pos_ratio
> + *     if (dirty > thresh) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
> + *
> + * (o) global/bdi control lines
> + *
> + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
> + * several control lines in turn.
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * If any control line drops below Y=0 before reaching @limit, an auxiliary
> + * line will be setup to connect them. The below figure illustrates the main
> + * bdi control line with an auxiliary line extending it to @limit.
> + *
> + * This allows smoothly throttling bdi_dirty down to normal if it starts high
> + * in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
> + * - the bdi dirty thresh goes down quickly due to change of JBOD workload
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, bw scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, bw scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 bdi setpoint                 bdi origin            limit
> + *
> + * The bdi control line: if (origin < limit), an auxiliary control line (*)
> + * will be setup to extend the main control line (o) to @limit.
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long origin;
> +	unsigned long goal;
> +	unsigned long long span;
> +	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 */
> +	goal = thresh - thresh / DIRTY_SCOPE;
> +	origin = 4 * thresh;
> +
> +	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +		origin = limit;			/* auxiliary control line */
> +		goal = (goal + origin) / 2;
> +		pos_ratio >>= 1;
> +	}
> +	pos_ratio = origin - dirty;
> +	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	/*
> +	 * bdi setpoint
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> +	/*
> +	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
> +	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
> +	 * Otherwise the bdi write bandwidth is good for limiting the floating
> +	 * area, which makes the bdi control line a good backup when the global
> +	 * control line is too flat/weak in large memory systems.
> +	 */
> +	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
> +		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
> +	do_div(span, thresh + 1);
> +	origin = goal + 2 * span;
> +
> +	if (unlikely(bdi_dirty > goal + span)) {
> +		if (bdi_dirty > limit)
> +			return 0;
> +		if (origin < limit) {
> +			origin = limit;		/* auxiliary control line */
> +			goal += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= origin - bdi_dirty;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> 

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-09  2:08     ` Vivek Goyal
  0 siblings, 0 replies; 169+ messages in thread
From: Vivek Goyal @ 2011-08-09  2:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Andrea Righi, linux-mm,
	LKML

On Sat, Aug 06, 2011 at 04:44:49PM +0800, Wu Fengguang wrote:
> Old scheme is,
>                                           |
>                            free run area  |  throttle area
>   ----------------------------------------+---------------------------->
>                                     thresh^                  dirty pages
> 
> New scheme is,
> 
>   ^ task rate limit
>   |
>   |            *
>   |             *
>   |              *
>   |[free run]      *      [smooth throttled]
>   |                  *
>   |                     *
>   |                         *
>   ..bdi->dirty_ratelimit..........*
>   |                               .     *
>   |                               .          *
>   |                               .              *
>   |                               .                 *
>   |                               .                    *
>   +-------------------------------.-----------------------*------------>
>                           setpoint^                  limit^  dirty pages
> 
> For simplicity, only the global/bdi setpoint control lines are
> implemented here, so the [*] curve is more straight than the ideal one
> showed in the above figure.
> 
> bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
> that the resulted task rate limit can drive the dirty pages back to the
> global/bdi setpoints.
> 

IMHO, "position_ratio" is not necessarily very intutive. Can there be
a better name? Based on your slides, it is scaling factor applied to
task rate limit depending on how well we are doing in terms of meeting
our goal of dirty limit. Will "dirty_rate_scale_factor" or something like
that make sense and be little more intutive? 

Thanks
Vivek
 

> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
> @@ -46,6 +46,8 @@
>   */
>  #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
>  
> +#define BANDWIDTH_CALC_SHIFT	10
> +
>  /*
>   * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
>   * will look to see if it needs to force writeback or throttling.
> @@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
>  	return bdi_dirty;
>  }
>  
> +/*
> + * Dirty position control.
> + *
> + * (o) global/bdi setpoints
> + *
> + *  When the number of dirty pages go higher/lower than the setpoint, the dirty
> + *  position ratio (and hence dirty rate limit) will be decreased/increased to
> + *  bring the dirty pages back to the setpoint.
> + *
> + *                              setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------|-----------|
> + * ^                               ^                               ^           ^
> + * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
> + *
> + *                          bdi setpoint
> + *                                 v
> + * |-------------------------------*-------------------------------------------|
> + * ^                               ^                                           ^
> + * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
> + *
> + * (o) pseudo code
> + *
> + *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
> + *
> + *     if (dirty < thresh) scale up   pos_ratio
> + *     if (dirty > thresh) scale down pos_ratio
> + *
> + *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
> + *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
> + *
> + * (o) global/bdi control lines
> + *
> + * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
> + * several control lines in turn.
> + *
> + * The control lines for the global/bdi setpoints both stretch up to @limit.
> + * If any control line drops below Y=0 before reaching @limit, an auxiliary
> + * line will be setup to connect them. The below figure illustrates the main
> + * bdi control line with an auxiliary line extending it to @limit.
> + *
> + * This allows smoothly throttling bdi_dirty down to normal if it starts high
> + * in situations like
> + * - start writing to a slow SD card and a fast disk at the same time. The SD
> + *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
> + * - the bdi dirty thresh goes down quickly due to change of JBOD workload
> + *
> + *   o
> + *     o
> + *       o                                      [o] main control line
> + *         o                                    [*] auxiliary control line
> + *           o
> + *             o
> + *               o
> + *                 o
> + *                   o
> + *                     o
> + *                       o--------------------- balance point, bw scale = 1
> + *                       | o
> + *                       |   o
> + *                       |     o
> + *                       |       o
> + *                       |         o
> + *                       |           o
> + *                       |             o------- connect point, bw scale = 1/2
> + *                       |               .*
> + *                       |                 .   *
> + *                       |                   .      *
> + *                       |                     .         *
> + *                       |                       .           *
> + *                       |                         .              *
> + *                       |                           .                 *
> + *  [--------------------+-----------------------------.--------------------*]
> + *  0                 bdi setpoint                 bdi origin            limit
> + *
> + * The bdi control line: if (origin < limit), an auxiliary control line (*)
> + * will be setup to extend the main control line (o) to @limit.
> + */
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +					unsigned long thresh,
> +					unsigned long dirty,
> +					unsigned long bdi_thresh,
> +					unsigned long bdi_dirty)
> +{
> +	unsigned long limit = hard_dirty_limit(thresh);
> +	unsigned long origin;
> +	unsigned long goal;
> +	unsigned long long span;
> +	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
> +
> +	if (unlikely(dirty >= limit))
> +		return 0;
> +
> +	/*
> +	 * global setpoint
> +	 */
> +	goal = thresh - thresh / DIRTY_SCOPE;
> +	origin = 4 * thresh;
> +
> +	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +		origin = limit;			/* auxiliary control line */
> +		goal = (goal + origin) / 2;
> +		pos_ratio >>= 1;
> +	}
> +	pos_ratio = origin - dirty;
> +	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	/*
> +	 * bdi setpoint
> +	 */
> +	if (unlikely(bdi_thresh > thresh))
> +		bdi_thresh = thresh;
> +	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
> +	/*
> +	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
> +	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
> +	 * Otherwise the bdi write bandwidth is good for limiting the floating
> +	 * area, which makes the bdi control line a good backup when the global
> +	 * control line is too flat/weak in large memory systems.
> +	 */
> +	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
> +		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
> +	do_div(span, thresh + 1);
> +	origin = goal + 2 * span;
> +
> +	if (unlikely(bdi_dirty > goal + span)) {
> +		if (bdi_dirty > limit)
> +			return 0;
> +		if (origin < limit) {
> +			origin = limit;		/* auxiliary control line */
> +			goal += span;
> +			pos_ratio >>= 1;
> +		}
> +	}
> +	pos_ratio *= origin - bdi_dirty;
> +	do_div(pos_ratio, origin - goal + 1);
> +
> +	return pos_ratio;
> +}
> +
>  static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
>  				       unsigned long elapsed,
>  				       unsigned long written)
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:41         ` Peter Zijlstra
@ 2011-08-08 23:05           ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-08 23:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> >         goal = thresh - thresh / DIRTY_SCOPE;
> >         origin = 4 * thresh;
> >  
> > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > -               origin = limit;                 /* auxiliary control line */
> > -               goal = (goal + origin) / 2;
> > -               pos_ratio >>= 1;
> > -       }
> >         pos_ratio = origin - dirty;
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, origin - goal + 1); 

FYI I've updated the fix to the below one, so that @limit will be used
as the origin in the rare case of (4*thresh < dirty).

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
@@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
 	 * global setpoint
 	 */
 	goal = thresh - thresh / DIRTY_SCOPE;
-	origin = 4 * thresh;
+	origin = max(4 * thresh, limit);
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

> So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
> comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 

This is the more meaningful view :)

                    origin - dirty
        pos_ratio = --------------
                    origin - goal

which comes from the below [*] control line, so that when (dirty == goal),
pos_ratio == 1.0:

 ^ pos_ratio
 |
 |
 |   *
 |      *
 |         *
 |            *
 |               *
 |                  *
 |                     *
 |                        *
 |                           *
 |                              *
 |                                 *
 .. pos_ratio = 1.0 ..................*
 |                                    .  *
 |                                    .     *
 |                                    .        *
 |                                    .           *
 |                                    .              *
 |                                    .                 *
 |                                    .                    *
 |                                    .                       *
 |                                    .                          *
 |                                    .                             *
 |                                    .                                *
 |                                    .                                   *
 |                                    .                                      *
 |                                    .                                         *
 |                                    .                                            *
 |                                    .                                               *
 +------------------------------------.--------------------------------------------------*---------------------->
 0                                   goal                                              origin         dirty pages

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 23:05           ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-08 23:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:41:41PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
> >         goal = thresh - thresh / DIRTY_SCOPE;
> >         origin = 4 * thresh;
> >  
> > -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > -               origin = limit;                 /* auxiliary control line */
> > -               goal = (goal + origin) / 2;
> > -               pos_ratio >>= 1;
> > -       }
> >         pos_ratio = origin - dirty;
> >         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
> >         do_div(pos_ratio, origin - goal + 1); 

FYI I've updated the fix to the below one, so that @limit will be used
as the origin in the rare case of (4*thresh < dirty).

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-09 06:34:25.000000000 +0800
@@ -536,13 +536,8 @@ static unsigned long bdi_position_ratio(
 	 * global setpoint
 	 */
 	goal = thresh - thresh / DIRTY_SCOPE;
-	origin = 4 * thresh;
+	origin = max(4 * thresh, limit);
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

> So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
> comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 

This is the more meaningful view :)

                    origin - dirty
        pos_ratio = --------------
                    origin - goal

which comes from the below [*] control line, so that when (dirty == goal),
pos_ratio == 1.0:

 ^ pos_ratio
 |
 |
 |   *
 |      *
 |         *
 |            *
 |               *
 |                  *
 |                     *
 |                        *
 |                           *
 |                              *
 |                                 *
 .. pos_ratio = 1.0 ..................*
 |                                    .  *
 |                                    .     *
 |                                    .        *
 |                                    .           *
 |                                    .              *
 |                                    .                 *
 |                                    .                    *
 |                                    .                       *
 |                                    .                          *
 |                                    .                             *
 |                                    .                                *
 |                                    .                                   *
 |                                    .                                      *
 |                                    .                                         *
 |                                    .                                            *
 |                                    .                                               *
 +------------------------------------.--------------------------------------------------*---------------------->
 0                                   goal                                              origin         dirty pages

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:31         ` Peter Zijlstra
@ 2011-08-08 22:47           ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-08 22:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:31:49PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > It's actually dead code because (origin < limit) should never happen.
> > I feel so good being able to drop 5 more lines of code :) 
> 
> OK, but that leaves me trying to figure out what origin is, and why its
> 4 * thresh.

origin is where the control line crosses the X axis (in both the
global/bdi setpoint cases).

"4 * thresh" is merely something larger than max(dirty, thresh)
that yields reasonably gentle slope. The more slope, the larger
"gravity" to bring the dirty pages back to the setpoint.

> I'm having a horrible time understanding this stuff.

Sorry for that. Do you have more questions?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 22:47           ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-08 22:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 10:31:49PM +0800, Peter Zijlstra wrote:
> On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> > It's actually dead code because (origin < limit) should never happen.
> > I feel so good being able to drop 5 more lines of code :) 
> 
> OK, but that leaves me trying to figure out what origin is, and why its
> 4 * thresh.

origin is where the control line crosses the X axis (in both the
global/bdi setpoint cases).

"4 * thresh" is merely something larger than max(dirty, thresh)
that yields reasonably gentle slope. The more slope, the larger
"gravity" to bring the dirty pages back to the setpoint.

> I'm having a horrible time understanding this stuff.

Sorry for that. Do you have more questions?

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:11       ` Wu Fengguang
  (?)
@ 2011-08-08 14:41         ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 



^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:41         ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:41         ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> @@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
>         goal = thresh - thresh / DIRTY_SCOPE;
>         origin = 4 * thresh;
>  
> -       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> -               origin = limit;                 /* auxiliary control line */
> -               goal = (goal + origin) / 2;
> -               pos_ratio >>= 1;
> -       }
>         pos_ratio = origin - dirty;
>         pos_ratio <<= BANDWIDTH_CALC_SHIFT;
>         do_div(pos_ratio, origin - goal + 1); 

So basically, pos_ratio = (4t - d) / (25/8)t, which if I'm not mistaken
comes out at 32/25 - 8d/25t. Which simply doesn't make sense at all. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 14:11       ` Wu Fengguang
  (?)
@ 2011-08-08 14:31         ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:31         ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:31         ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 14:31 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, 2011-08-08 at 22:11 +0800, Wu Fengguang wrote:
> It's actually dead code because (origin < limit) should never happen.
> I feel so good being able to drop 5 more lines of code :) 

OK, but that leaves me trying to figure out what origin is, and why its
4 * thresh.

I'm having a horrible time understanding this stuff.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-08 13:46     ` Peter Zijlstra
@ 2011-08-08 14:11       ` Wu Fengguang
  -1 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:46:33PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +                                       unsigned long thresh,
> > +                                       unsigned long dirty,
> > +                                       unsigned long bdi_thresh,
> > +                                       unsigned long bdi_dirty)
> > +{
> > +       unsigned long limit = hard_dirty_limit(thresh);
> > +       unsigned long origin;
> > +       unsigned long goal;
> > +       unsigned long long span;
> > +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> > +
> > +       if (unlikely(dirty >= limit))
> > +               return 0;
> > +
> > +       /*
> > +        * global setpoint
> > +        */
> > +       goal = thresh - thresh / DIRTY_SCOPE;
> > +       origin = 4 * thresh;
> > +
> > +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > +               origin = limit;                 /* auxiliary control line */
> > +               goal = (goal + origin) / 2;
> > +               pos_ratio >>= 1; 
> 
> use before init?

Yeah it's embarrassing, I find this bug all the way back to the initial version...

It's actually dead code because (origin < limit) should never happen.
I feel so good being able to drop 5 more lines of code :)

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-08 22:04:48.000000000 +0800
@@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
 	goal = thresh - thresh / DIRTY_SCOPE;
 	origin = 4 * thresh;
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 14:11       ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-08 14:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Mon, Aug 08, 2011 at 09:46:33PM +0800, Peter Zijlstra wrote:
> On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> > +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> > +                                       unsigned long thresh,
> > +                                       unsigned long dirty,
> > +                                       unsigned long bdi_thresh,
> > +                                       unsigned long bdi_dirty)
> > +{
> > +       unsigned long limit = hard_dirty_limit(thresh);
> > +       unsigned long origin;
> > +       unsigned long goal;
> > +       unsigned long long span;
> > +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> > +
> > +       if (unlikely(dirty >= limit))
> > +               return 0;
> > +
> > +       /*
> > +        * global setpoint
> > +        */
> > +       goal = thresh - thresh / DIRTY_SCOPE;
> > +       origin = 4 * thresh;
> > +
> > +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> > +               origin = limit;                 /* auxiliary control line */
> > +               goal = (goal + origin) / 2;
> > +               pos_ratio >>= 1; 
> 
> use before init?

Yeah it's embarrassing, I find this bug all the way back to the initial version...

It's actually dead code because (origin < limit) should never happen.
I feel so good being able to drop 5 more lines of code :)

Thanks,
Fengguang
---

--- linux-next.orig/mm/page-writeback.c	2011-08-08 21:56:11.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-08 22:04:48.000000000 +0800
@@ -538,11 +538,6 @@ static unsigned long bdi_position_ratio(
 	goal = thresh - thresh / DIRTY_SCOPE;
 	origin = 4 * thresh;
 
-	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
-		origin = limit;			/* auxiliary control line */
-		goal = (goal + origin) / 2;
-		pos_ratio >>= 1;
-	}
 	pos_ratio = origin - dirty;
 	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
 	do_div(pos_ratio, origin - goal + 1);

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44   ` Wu Fengguang
  (?)
@ 2011-08-08 13:46     ` Peter Zijlstra
  -1 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 13:46     ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* Re: [PATCH 2/5] writeback: dirty position control
@ 2011-08-08 13:46     ` Peter Zijlstra
  0 siblings, 0 replies; 169+ messages in thread
From: Peter Zijlstra @ 2011-08-08 13:46 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: linux-fsdevel, Andrew Morton, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

On Sat, 2011-08-06 at 16:44 +0800, Wu Fengguang wrote:
> +static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> +                                       unsigned long thresh,
> +                                       unsigned long dirty,
> +                                       unsigned long bdi_thresh,
> +                                       unsigned long bdi_dirty)
> +{
> +       unsigned long limit = hard_dirty_limit(thresh);
> +       unsigned long origin;
> +       unsigned long goal;
> +       unsigned long long span;
> +       unsigned long long pos_ratio;   /* for scaling up/down the rate limit */
> +
> +       if (unlikely(dirty >= limit))
> +               return 0;
> +
> +       /*
> +        * global setpoint
> +        */
> +       goal = thresh - thresh / DIRTY_SCOPE;
> +       origin = 4 * thresh;
> +
> +       if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
> +               origin = limit;                 /* auxiliary control line */
> +               goal = (goal + origin) / 2;
> +               pos_ratio >>= 1; 

use before init?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* [PATCH 2/5] writeback: dirty position control
  2011-08-06  8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang
  2011-08-06  8:44   ` Wu Fengguang
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7230 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)



^ permalink raw reply	[flat|nested] 169+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7533 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

* [PATCH 2/5] writeback: dirty position control
@ 2011-08-06  8:44   ` Wu Fengguang
  0 siblings, 0 replies; 169+ messages in thread
From: Wu Fengguang @ 2011-08-06  8:44 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig,
	Dave Chinner, Greg Thelen, Minchan Kim, Vivek Goyal,
	Andrea Righi, linux-mm, LKML

[-- Attachment #1: writeback-control-algorithms.patch --]
[-- Type: text/plain, Size: 7533 bytes --]

Old scheme is,
                                          |
                           free run area  |  throttle area
  ----------------------------------------+---------------------------->
                                    thresh^                  dirty pages

New scheme is,

  ^ task rate limit
  |
  |            *
  |             *
  |              *
  |[free run]      *      [smooth throttled]
  |                  *
  |                     *
  |                         *
  ..bdi->dirty_ratelimit..........*
  |                               .     *
  |                               .          *
  |                               .              *
  |                               .                 *
  |                               .                    *
  +-------------------------------.-----------------------*------------>
                          setpoint^                  limit^  dirty pages

For simplicity, only the global/bdi setpoint control lines are
implemented here, so the [*] curve is more straight than the ideal one
showed in the above figure.

bdi_position_ratio() provides a scale factor to bdi->dirty_ratelimit, so
that the resulted task rate limit can drive the dirty pages back to the
global/bdi setpoints.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |  143 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-08-06 10:31:32.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-08-06 11:17:07.000000000 +0800
@@ -46,6 +46,8 @@
  */
 #define BANDWIDTH_INTERVAL	max(HZ/5, 1)
 
+#define BANDWIDTH_CALC_SHIFT	10
+
 /*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
@@ -495,6 +497,147 @@ unsigned long bdi_dirty_limit(struct bac
 	return bdi_dirty;
 }
 
+/*
+ * Dirty position control.
+ *
+ * (o) global/bdi setpoints
+ *
+ *  When the number of dirty pages go higher/lower than the setpoint, the dirty
+ *  position ratio (and hence dirty rate limit) will be decreased/increased to
+ *  bring the dirty pages back to the setpoint.
+ *
+ *                              setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------|-----------|
+ * ^                               ^                               ^           ^
+ * (thresh + background_thresh)/2  thresh - thresh/DIRTY_SCOPE     thresh  limit
+ *
+ *                          bdi setpoint
+ *                                 v
+ * |-------------------------------*-------------------------------------------|
+ * ^                               ^                                           ^
+ * 0                               bdi_thresh - bdi_thresh/DIRTY_SCOPE     limit
+ *
+ * (o) pseudo code
+ *
+ *     pos_ratio = 1 << BANDWIDTH_CALC_SHIFT
+ *
+ *     if (dirty < thresh) scale up   pos_ratio
+ *     if (dirty > thresh) scale down pos_ratio
+ *
+ *     if (bdi_dirty < bdi_thresh) scale up   pos_ratio
+ *     if (bdi_dirty > bdi_thresh) scale down pos_ratio
+ *
+ * (o) global/bdi control lines
+ *
+ * Based on the number of dirty pages (the X), pos_ratio (the Y) is scaled by
+ * several control lines in turn.
+ *
+ * The control lines for the global/bdi setpoints both stretch up to @limit.
+ * If any control line drops below Y=0 before reaching @limit, an auxiliary
+ * line will be setup to connect them. The below figure illustrates the main
+ * bdi control line with an auxiliary line extending it to @limit.
+ *
+ * This allows smoothly throttling bdi_dirty down to normal if it starts high
+ * in situations like
+ * - start writing to a slow SD card and a fast disk at the same time. The SD
+ *   card's bdi_dirty may rush to 5 times higher than bdi setpoint.
+ * - the bdi dirty thresh goes down quickly due to change of JBOD workload
+ *
+ *   o
+ *     o
+ *       o                                      [o] main control line
+ *         o                                    [*] auxiliary control line
+ *           o
+ *             o
+ *               o
+ *                 o
+ *                   o
+ *                     o
+ *                       o--------------------- balance point, bw scale = 1
+ *                       | o
+ *                       |   o
+ *                       |     o
+ *                       |       o
+ *                       |         o
+ *                       |           o
+ *                       |             o------- connect point, bw scale = 1/2
+ *                       |               .*
+ *                       |                 .   *
+ *                       |                   .      *
+ *                       |                     .         *
+ *                       |                       .           *
+ *                       |                         .              *
+ *                       |                           .                 *
+ *  [--------------------+-----------------------------.--------------------*]
+ *  0                 bdi setpoint                 bdi origin            limit
+ *
+ * The bdi control line: if (origin < limit), an auxiliary control line (*)
+ * will be setup to extend the main control line (o) to @limit.
+ */
+static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
+					unsigned long thresh,
+					unsigned long dirty,
+					unsigned long bdi_thresh,
+					unsigned long bdi_dirty)
+{
+	unsigned long limit = hard_dirty_limit(thresh);
+	unsigned long origin;
+	unsigned long goal;
+	unsigned long long span;
+	unsigned long long pos_ratio;	/* for scaling up/down the rate limit */
+
+	if (unlikely(dirty >= limit))
+		return 0;
+
+	/*
+	 * global setpoint
+	 */
+	goal = thresh - thresh / DIRTY_SCOPE;
+	origin = 4 * thresh;
+
+	if (unlikely(origin < limit && dirty > (goal + origin) / 2)) {
+		origin = limit;			/* auxiliary control line */
+		goal = (goal + origin) / 2;
+		pos_ratio >>= 1;
+	}
+	pos_ratio = origin - dirty;
+	pos_ratio <<= BANDWIDTH_CALC_SHIFT;
+	do_div(pos_ratio, origin - goal + 1);
+
+	/*
+	 * bdi setpoint
+	 */
+	if (unlikely(bdi_thresh > thresh))
+		bdi_thresh = thresh;
+	goal = bdi_thresh - bdi_thresh / DIRTY_SCOPE;
+	/*
+	 * Use span=(4*bw) in single disk case and transit to bdi_thresh in
+	 * JBOD case.  For JBOD, bdi_thresh could fluctuate up to its own size.
+	 * Otherwise the bdi write bandwidth is good for limiting the floating
+	 * area, which makes the bdi control line a good backup when the global
+	 * control line is too flat/weak in large memory systems.
+	 */
+	span = (u64) bdi_thresh * (thresh - bdi_thresh) +
+		(4 * bdi->avg_write_bandwidth) * bdi_thresh;
+	do_div(span, thresh + 1);
+	origin = goal + 2 * span;
+
+	if (unlikely(bdi_dirty > goal + span)) {
+		if (bdi_dirty > limit)
+			return 0;
+		if (origin < limit) {
+			origin = limit;		/* auxiliary control line */
+			goal += span;
+			pos_ratio >>= 1;
+		}
+	}
+	pos_ratio *= origin - bdi_dirty;
+	do_div(pos_ratio, origin - goal + 1);
+
+	return pos_ratio;
+}
+
 static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
 				       unsigned long elapsed,
 				       unsigned long written)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 169+ messages in thread

end of thread, other threads:[~2011-09-06 12:40 UTC | newest]

Thread overview: 169+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAFdhcLRKvfqBnXCXLwq-Qe1eNAGC-8XJ3BtHpQKzaa3RhHyp6A@mail.gmail.com>
2011-08-17  6:40 ` [PATCH 2/5] writeback: dirty position control David Horner
2011-08-17 12:03   ` Jan Kara
2011-08-17 12:35     ` Wu Fengguang
2011-08-16  2:20 [PATCH 0/5] IO-less dirty throttling v9 Wu Fengguang
2011-08-16  2:20 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16  2:20   ` Wu Fengguang
2011-08-16 19:41   ` Jan Kara
2011-08-16 19:41     ` Jan Kara
2011-08-17 13:23     ` Wu Fengguang
2011-08-17 13:49       ` Wu Fengguang
2011-08-17 13:49         ` Wu Fengguang
2011-08-17 20:24       ` Jan Kara
2011-08-17 20:24         ` Jan Kara
2011-08-18  4:18         ` Wu Fengguang
2011-08-18  4:18           ` Wu Fengguang
2011-08-18  4:41           ` Wu Fengguang
2011-08-18  4:41             ` Wu Fengguang
2011-08-18 19:16           ` Jan Kara
2011-08-18 19:16             ` Jan Kara
2011-08-24  3:16         ` Wu Fengguang
2011-08-24  3:16           ` Wu Fengguang
2011-08-19  2:53   ` Vivek Goyal
2011-08-19  2:53     ` Vivek Goyal
2011-08-19  3:25     ` Wu Fengguang
2011-08-19  3:25       ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-08-06  8:44 [PATCH 0/5] IO-less dirty throttling v8 Wu Fengguang
2011-08-06  8:44 ` [PATCH 2/5] writeback: dirty position control Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-06  8:44   ` Wu Fengguang
2011-08-08 13:46   ` Peter Zijlstra
2011-08-08 13:46     ` Peter Zijlstra
2011-08-08 13:46     ` Peter Zijlstra
2011-08-08 14:11     ` Wu Fengguang
2011-08-08 14:11       ` Wu Fengguang
2011-08-08 14:31       ` Peter Zijlstra
2011-08-08 14:31         ` Peter Zijlstra
2011-08-08 14:31         ` Peter Zijlstra
2011-08-08 22:47         ` Wu Fengguang
2011-08-08 22:47           ` Wu Fengguang
2011-08-09  9:31           ` Peter Zijlstra
2011-08-09  9:31             ` Peter Zijlstra
2011-08-09  9:31             ` Peter Zijlstra
2011-08-10 12:28             ` Wu Fengguang
2011-08-10 12:28               ` Wu Fengguang
2011-08-08 14:41       ` Peter Zijlstra
2011-08-08 14:41         ` Peter Zijlstra
2011-08-08 14:41         ` Peter Zijlstra
2011-08-08 23:05         ` Wu Fengguang
2011-08-08 23:05           ` Wu Fengguang
2011-08-09 10:32           ` Peter Zijlstra
2011-08-09 10:32             ` Peter Zijlstra
2011-08-09 10:32             ` Peter Zijlstra
2011-08-09 17:20           ` Peter Zijlstra
2011-08-09 17:20             ` Peter Zijlstra
2011-08-09 17:20             ` Peter Zijlstra
2011-08-10 22:34             ` Jan Kara
2011-08-10 22:34               ` Jan Kara
2011-08-11  2:29               ` Wu Fengguang
2011-08-11  2:29                 ` Wu Fengguang
2011-08-11 11:14                 ` Jan Kara
2011-08-11 11:14                   ` Jan Kara
2011-08-16  8:35                   ` Wu Fengguang
2011-08-16  8:35                     ` Wu Fengguang
2011-08-12 13:19             ` Wu Fengguang
2011-08-12 13:19               ` Wu Fengguang
2011-08-10 21:40           ` Vivek Goyal
2011-08-10 21:40             ` Vivek Goyal
2011-08-16  8:55             ` Wu Fengguang
2011-08-16  8:55               ` Wu Fengguang
2011-08-11 22:56           ` Peter Zijlstra
2011-08-11 22:56             ` Peter Zijlstra
2011-08-11 22:56             ` Peter Zijlstra
2011-08-12  2:43             ` Wu Fengguang
2011-08-12  2:43               ` Wu Fengguang
2011-08-12  3:18               ` Wu Fengguang
2011-08-12  5:45               ` Wu Fengguang
2011-08-12  5:45                 ` Wu Fengguang
2011-08-12  9:45                 ` Peter Zijlstra
2011-08-12  9:45                   ` Peter Zijlstra
2011-08-12  9:45                   ` Peter Zijlstra
2011-08-12 11:07                   ` Wu Fengguang
2011-08-12 11:07                     ` Wu Fengguang
2011-08-12 12:17                     ` Peter Zijlstra
2011-08-12 12:17                       ` Peter Zijlstra
2011-08-12 12:17                       ` Peter Zijlstra
2011-08-12  9:47               ` Peter Zijlstra
2011-08-12  9:47                 ` Peter Zijlstra
2011-08-12  9:47                 ` Peter Zijlstra
2011-08-12 11:11                 ` Wu Fengguang
2011-08-12 11:11                   ` Wu Fengguang
2011-08-12 12:54           ` Peter Zijlstra
2011-08-12 12:54             ` Peter Zijlstra
2011-08-12 12:54             ` Peter Zijlstra
2011-08-12 12:59             ` Wu Fengguang
2011-08-12 12:59               ` Wu Fengguang
2011-08-12 13:08               ` Peter Zijlstra
2011-08-12 13:08                 ` Peter Zijlstra
2011-08-12 13:08                 ` Peter Zijlstra
2011-08-12 13:04           ` Peter Zijlstra
2011-08-12 13:04             ` Peter Zijlstra
2011-08-12 13:04             ` Peter Zijlstra
2011-08-12 14:20             ` Wu Fengguang
2011-08-12 14:20               ` Wu Fengguang
2011-08-22 15:38               ` Peter Zijlstra
2011-08-22 15:38                 ` Peter Zijlstra
2011-08-22 15:38                 ` Peter Zijlstra
2011-08-23  3:40                 ` Wu Fengguang
2011-08-23  3:40                   ` Wu Fengguang
2011-08-23 10:01                   ` Peter Zijlstra
2011-08-23 10:01                     ` Peter Zijlstra
2011-08-23 10:01                     ` Peter Zijlstra
2011-08-23 14:15                     ` Wu Fengguang
2011-08-23 14:15                       ` Wu Fengguang
2011-08-23 17:47                       ` Vivek Goyal
2011-08-23 17:47                         ` Vivek Goyal
2011-08-24  0:12                         ` Wu Fengguang
2011-08-24  0:12                           ` Wu Fengguang
2011-08-24 16:12                           ` Peter Zijlstra
2011-08-24 16:12                             ` Peter Zijlstra
2011-08-26  0:18                             ` Wu Fengguang
2011-08-26  0:18                               ` Wu Fengguang
2011-08-26  9:04                               ` Peter Zijlstra
2011-08-26  9:04                                 ` Peter Zijlstra
2011-08-26 10:04                                 ` Wu Fengguang
2011-08-26 10:04                                   ` Wu Fengguang
2011-08-26 10:42                                   ` Peter Zijlstra
2011-08-26 10:42                                     ` Peter Zijlstra
2011-08-26 10:52                                     ` Wu Fengguang
2011-08-26 10:52                                       ` Wu Fengguang
2011-08-26 11:26                                   ` Wu Fengguang
2011-08-26 12:11                                     ` Peter Zijlstra
2011-08-26 12:11                                       ` Peter Zijlstra
2011-08-26 12:20                                       ` Wu Fengguang
2011-08-26 12:20                                         ` Wu Fengguang
2011-08-26 13:13                                         ` Wu Fengguang
2011-08-26 13:18                                           ` Peter Zijlstra
2011-08-26 13:18                                             ` Peter Zijlstra
2011-08-26 13:24                                             ` Wu Fengguang
2011-08-26 13:24                                               ` Wu Fengguang
2011-08-24 18:00                           ` Vivek Goyal
2011-08-24 18:00                             ` Vivek Goyal
2011-08-25  3:19                             ` Wu Fengguang
2011-08-25  3:19                               ` Wu Fengguang
2011-08-25 22:20                               ` Vivek Goyal
2011-08-25 22:20                                 ` Vivek Goyal
2011-08-26  1:56                                 ` Wu Fengguang
2011-08-26  1:56                                   ` Wu Fengguang
2011-08-26  8:56                                   ` Peter Zijlstra
2011-08-26  8:56                                     ` Peter Zijlstra
2011-08-26  9:53                                     ` Wu Fengguang
2011-08-26  9:53                                       ` Wu Fengguang
2011-08-29 13:12                             ` Peter Zijlstra
2011-08-29 13:12                               ` Peter Zijlstra
2011-08-29 13:37                               ` Wu Fengguang
2011-08-29 13:37                                 ` Wu Fengguang
2011-09-02 12:16                                 ` Peter Zijlstra
2011-09-02 12:16                                   ` Peter Zijlstra
2011-09-06 12:40                                 ` Peter Zijlstra
2011-09-06 12:40                                   ` Peter Zijlstra
2011-08-24 15:57                       ` Peter Zijlstra
2011-08-24 15:57                         ` Peter Zijlstra
2011-08-24 15:57                         ` Peter Zijlstra
2011-08-25  5:30                         ` Wu Fengguang
2011-08-25  5:30                           ` Wu Fengguang
2011-08-23 14:36                     ` Vivek Goyal
2011-08-23 14:36                       ` Vivek Goyal
2011-08-09  2:08   ` Vivek Goyal
2011-08-09  2:08     ` Vivek Goyal
2011-08-16  8:59     ` Wu Fengguang
2011-08-16  8:59       ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.