All of lore.kernel.org
 help / color / mirror / Atom feed
* Likley stupid question on "throttle_vm_writeout"
@ 2009-11-09 15:15 Martin Knoblauch
  2009-11-09 15:26 ` Peter Zijlstra
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Knoblauch @ 2009-11-09 15:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: Peter Zijlstra, Fengguang Wu

Hi, (please CC me on replies)

 I have a likely stupid question on the function "throttle_vm_writeout". Looking at the code I find:

                if (global_page_state(NR_UNSTABLE_NFS) +
                        global_page_state(NR_WRITEBACK) <= dirty_thresh)
                                break;
                congestion_wait(WRITE, HZ/10);

Shouldn't the NR_FILE_DIRTY pages be considered as well?

Cheers

Martin 
------------------------------------------------------
Martin Knoblauch
email: k n o b i AT knobisoft DOT de
www:   http://www.knobisoft.de


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likley stupid question on "throttle_vm_writeout"
  2009-11-09 15:15 Likley stupid question on "throttle_vm_writeout" Martin Knoblauch
@ 2009-11-09 15:26 ` Peter Zijlstra
  2009-11-10  2:08   ` Wu Fengguang
  0 siblings, 1 reply; 8+ messages in thread
From: Peter Zijlstra @ 2009-11-09 15:26 UTC (permalink / raw)
  To: Martin Knoblauch; +Cc: linux-kernel, Fengguang Wu

On Mon, 2009-11-09 at 07:15 -0800, Martin Knoblauch wrote:
> Hi, (please CC me on replies)
> 
>  I have a likely stupid question on the function "throttle_vm_writeout". Looking at the code I find:
> 
>                 if (global_page_state(NR_UNSTABLE_NFS) +
>                         global_page_state(NR_WRITEBACK) <= dirty_thresh)
>                                 break;
>                 congestion_wait(WRITE, HZ/10);
> 
> Shouldn't the NR_FILE_DIRTY pages be considered as well?

Ha, you just trod onto a piece of ugly I'd totally forgotten about ;-)

The intent of throttle_vm_writeout() is to limit the total pages in
writeout and to wait for them to go-away.

Everybody hates the function, nobody managed to actually come up with
anything better.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likley stupid question on "throttle_vm_writeout"
  2009-11-09 15:26 ` Peter Zijlstra
@ 2009-11-10  2:08   ` Wu Fengguang
  2009-11-10  6:55     ` KOSAKI Motohiro
  2009-11-10 12:01     ` Martin Knoblauch
  0 siblings, 2 replies; 8+ messages in thread
From: Wu Fengguang @ 2009-11-10  2:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Martin Knoblauch, linux-kernel

On Mon, Nov 09, 2009 at 04:26:33PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-11-09 at 07:15 -0800, Martin Knoblauch wrote:
> > Hi, (please CC me on replies)
> > 
> >  I have a likely stupid question on the function "throttle_vm_writeout". Looking at the code I find:
> > 
> >                 if (global_page_state(NR_UNSTABLE_NFS) +
> >                         global_page_state(NR_WRITEBACK) <= dirty_thresh)
> >                                 break;
> >                 congestion_wait(WRITE, HZ/10);
> > 
> > Shouldn't the NR_FILE_DIRTY pages be considered as well?
> 
> Ha, you just trod onto a piece of ugly I'd totally forgotten about ;-)
> 
> The intent of throttle_vm_writeout() is to limit the total pages in
> writeout and to wait for them to go-away.

Like this:

        vmscan fast => large NR_WRITEBACK => throttle vmscan based on it

> Everybody hates the function, nobody managed to actually come up with
> anything better.

btw, here is another reason to limit NR_WRITEBACK: I saw many
throttle_vm_writeout() waits if there is no wait queue to limit
NR_WRITEBACK (eg. NFS). In that case the (steadily) big NR_WRITEBACK
is _not_ caused by fast vmscan..

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likley stupid question on "throttle_vm_writeout"
  2009-11-10  2:08   ` Wu Fengguang
@ 2009-11-10  6:55     ` KOSAKI Motohiro
  2009-11-10 12:01     ` Martin Knoblauch
  1 sibling, 0 replies; 8+ messages in thread
From: KOSAKI Motohiro @ 2009-11-10  6:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: kosaki.motohiro, Peter Zijlstra, Martin Knoblauch, linux-kernel

> On Mon, Nov 09, 2009 at 04:26:33PM +0100, Peter Zijlstra wrote:
> > On Mon, 2009-11-09 at 07:15 -0800, Martin Knoblauch wrote:
> > > Hi, (please CC me on replies)
> > > 
> > >  I have a likely stupid question on the function "throttle_vm_writeout". Looking at the code I find:
> > > 
> > >                 if (global_page_state(NR_UNSTABLE_NFS) +
> > >                         global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > >                                 break;
> > >                 congestion_wait(WRITE, HZ/10);
> > > 
> > > Shouldn't the NR_FILE_DIRTY pages be considered as well?
> > 
> > Ha, you just trod onto a piece of ugly I'd totally forgotten about ;-)
> > 
> > The intent of throttle_vm_writeout() is to limit the total pages in
> > writeout and to wait for them to go-away.
> 
> Like this:
> 
>         vmscan fast => large NR_WRITEBACK => throttle vmscan based on it
> 
> > Everybody hates the function, nobody managed to actually come up with
> > anything better.
> 
> btw, here is another reason to limit NR_WRITEBACK: I saw many
> throttle_vm_writeout() waits if there is no wait queue to limit
> NR_WRITEBACK (eg. NFS). In that case the (steadily) big NR_WRITEBACK
> is _not_ caused by fast vmscan..

btw, this explanation point out why we should remove other bare congestion_wait()
in reclaim path.
At least, I observed the congestion_wait() in do_try_to_free_pages() decrease
reclaim performance very much.





^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likley stupid question on "throttle_vm_writeout"
  2009-11-10  2:08   ` Wu Fengguang
  2009-11-10  6:55     ` KOSAKI Motohiro
@ 2009-11-10 12:01     ` Martin Knoblauch
  2009-11-10 13:08       ` Wu Fengguang
  1 sibling, 1 reply; 8+ messages in thread
From: Martin Knoblauch @ 2009-11-10 12:01 UTC (permalink / raw)
  To: Wu Fengguang, Peter Zijlstra; +Cc: linux-kernel

----- Original Message ----

> From: Wu Fengguang <fengguang.wu@intel.com>
> To: Peter Zijlstra <peterz@infradead.org>
> Cc: Martin Knoblauch <spamtrap@knobisoft.de>; linux-kernel@vger.kernel.org
> Sent: Tue, November 10, 2009 3:08:58 AM
> Subject: Re: Likley stupid question on "throttle_vm_writeout"
> 
> On Mon, Nov 09, 2009 at 04:26:33PM +0100, Peter Zijlstra wrote:
> > On Mon, 2009-11-09 at 07:15 -0800, Martin Knoblauch wrote:
> > > Hi, (please CC me on replies)
> > > 
> > >  I have a likely stupid question on the function "throttle_vm_writeout". 
> Looking at the code I find:
> > > 
> > >                 if (global_page_state(NR_UNSTABLE_NFS) +
> > >                         global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > >                                 break;
> > >                 congestion_wait(WRITE, HZ/10);
> > > 
> > > Shouldn't the NR_FILE_DIRTY pages be considered as well?
> > 
> > Ha, you just trod onto a piece of ugly I'd totally forgotten about ;-)
> > 
> > The intent of throttle_vm_writeout() is to limit the total pages in
> > writeout and to wait for them to go-away.
> 
> Like this:
> 
>         vmscan fast => large NR_WRITEBACK => throttle vmscan based on it
> 
> > Everybody hates the function, nobody managed to actually come up with
> > anything better.
> 
> btw, here is another reason to limit NR_WRITEBACK: I saw many
> throttle_vm_writeout() waits if there is no wait queue to limit
> NR_WRITEBACK (eg. NFS). In that case the (steadily) big NR_WRITEBACK
> is _not_ caused by fast vmscan..
> 

 That is exactely what made me look again into the code. My observation is that when doing something like:

dd if=/dev/zero of=fast-local-disk bs=1M count=15000

most of the "dirty" pages are in NR_FILE_DIRTY with some relatively small amount (10% or so) in NR_WRITEBACK. If I do:

dd if=/dev/zero of=some-nfs-mount bs=1M count=15000

NR_WRITEBACK almost immediatelly goes up to dirty_ratio, with NR_UNSTABLE_NFS small. Over time NR_UNSTABLE_NFS grows, but is always lower than NR_WRITEBACK (maybe 40/60).

 But don't ask what happens if I do both in parallel.... The local IO really slows to a crawl and sometimes the system just becomes very unresponsive. Have we heard that before? :-)

 Somehow I have the impression that NFS writeout is able to absolutely dominate the dirty pages to an extent that the system is unusable.

Cheers
Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likley stupid question on "throttle_vm_writeout"
  2009-11-10 12:01     ` Martin Knoblauch
@ 2009-11-10 13:08       ` Wu Fengguang
  2009-11-10 16:11         ` Martin Knoblauch
  0 siblings, 1 reply; 8+ messages in thread
From: Wu Fengguang @ 2009-11-10 13:08 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Peter Zijlstra, linux-kernel, Myklebust, Trond, Peter Staubach,
	linux-fsdevel

On Tue, Nov 10, 2009 at 08:01:47PM +0800, Martin Knoblauch wrote:
> ----- Original Message ----
> 
> > From: Wu Fengguang <fengguang.wu@intel.com>
> > To: Peter Zijlstra <peterz@infradead.org>
> > Cc: Martin Knoblauch <spamtrap@knobisoft.de>; linux-kernel@vger.kernel.org
> > Sent: Tue, November 10, 2009 3:08:58 AM
> > Subject: Re: Likley stupid question on "throttle_vm_writeout"
> > 
> > On Mon, Nov 09, 2009 at 04:26:33PM +0100, Peter Zijlstra wrote:
> > > On Mon, 2009-11-09 at 07:15 -0800, Martin Knoblauch wrote:
> > > > Hi, (please CC me on replies)
> > > > 
> > > >  I have a likely stupid question on the function "throttle_vm_writeout". 
> > Looking at the code I find:
> > > > 
> > > >                 if (global_page_state(NR_UNSTABLE_NFS) +
> > > >                         global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > >                                 break;
> > > >                 congestion_wait(WRITE, HZ/10);
> > > > 
> > > > Shouldn't the NR_FILE_DIRTY pages be considered as well?
> > > 
> > > Ha, you just trod onto a piece of ugly I'd totally forgotten about ;-)
> > > 
> > > The intent of throttle_vm_writeout() is to limit the total pages in
> > > writeout and to wait for them to go-away.
> > 
> > Like this:
> > 
> >         vmscan fast => large NR_WRITEBACK => throttle vmscan based on it
> > 
> > > Everybody hates the function, nobody managed to actually come up with
> > > anything better.
> > 
> > btw, here is another reason to limit NR_WRITEBACK: I saw many
> > throttle_vm_writeout() waits if there is no wait queue to limit
> > NR_WRITEBACK (eg. NFS). In that case the (steadily) big NR_WRITEBACK
> > is _not_ caused by fast vmscan..
> > 
> 
>  That is exactely what made me look again into the code. My observation is that when doing something like:
> 
> dd if=/dev/zero of=fast-local-disk bs=1M count=15000
> 
> most of the "dirty" pages are in NR_FILE_DIRTY with some relatively small amount (10% or so) in NR_WRITEBACK. If I do:
> 
> dd if=/dev/zero of=some-nfs-mount bs=1M count=15000
> 
> NR_WRITEBACK almost immediatelly goes up to dirty_ratio, with
> NR_UNSTABLE_NFS small. Over time NR_UNSTABLE_NFS grows, but is
> always lower than NR_WRITEBACK (maybe 40/60).

This is interesting, though I don't see explicit NFS code to limit
NR_UNSTABLE_NFS. Maybe there are some implicit rules.

>  But don't ask what happens if I do both in parallel.... The local
>  IO really slows to a crawl and sometimes the system just becomes
>  very unresponsive. Have we heard that before? :-)

You may be the first reporter as far as I can tell :)

>  Somehow I have the impression that NFS writeout is able to
>  absolutely dominate the dirty pages to an extent that the system is
>  unusable.

This is why I want to limit NR_WRITEBACK for NFS:

        [PATCH] NFS: introduce writeback wait queue
        http://lkml.org/lkml/2009/10/3/198

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likley stupid question on "throttle_vm_writeout"
  2009-11-10 13:08       ` Wu Fengguang
@ 2009-11-10 16:11         ` Martin Knoblauch
  2009-11-11  0:45           ` Wu Fengguang
  0 siblings, 1 reply; 8+ messages in thread
From: Martin Knoblauch @ 2009-11-10 16:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Peter Zijlstra, linux-kernel, Myklebust, Trond, Peter Staubach,
	linux-fsdevel

----- Original Message ----

> From: Wu Fengguang <fengguang.wu@intel.com>
> To: Martin Knoblauch <spamtrap@knobisoft.de>
> Cc: Peter Zijlstra <peterz@infradead.org>; "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>; "Myklebust, Trond" <Trond.Myklebust@netapp.com>; Peter Staubach <staubach@redhat.com>; linux-fsdevel@vger.kernel.org
> Sent: Tue, November 10, 2009 2:08:18 PM
> Subject: Re: Likley stupid question on "throttle_vm_writeout"
> 
> On Tue, Nov 10, 2009 at 08:01:47PM +0800, Martin Knoblauch wrote:
> > ----- Original Message ----
> > 
> > > From: Wu Fengguang 
> > > To: Peter Zijlstra 
> > > Cc: Martin Knoblauch ; linux-kernel@vger.kernel.org
> > > Sent: Tue, November 10, 2009 3:08:58 AM
> > > Subject: Re: Likley stupid question on "throttle_vm_writeout"
> > > 
> > > On Mon, Nov 09, 2009 at 04:26:33PM +0100, Peter Zijlstra wrote:
> > > > On Mon, 2009-11-09 at 07:15 -0800, Martin Knoblauch wrote:
> > > > > Hi, (please CC me on replies)
> > > > > 
> > > > >  I have a likely stupid question on the function "throttle_vm_writeout". 
> 
> > > Looking at the code I find:
> > > > > 
> > > > >                 if (global_page_state(NR_UNSTABLE_NFS) +
> > > > >                         global_page_state(NR_WRITEBACK) <= dirty_thresh)
> > > > >                                 break;
> > > > >                 congestion_wait(WRITE, HZ/10);
> > > > > 
> > > > > Shouldn't the NR_FILE_DIRTY pages be considered as well?
> > > > 
> > > > Ha, you just trod onto a piece of ugly I'd totally forgotten about ;-)
> > > > 
> > > > The intent of throttle_vm_writeout() is to limit the total pages in
> > > > writeout and to wait for them to go-away.
> > > 
> > > Like this:
> > > 
> > >         vmscan fast => large NR_WRITEBACK => throttle vmscan based on it
> > > 
> > > > Everybody hates the function, nobody managed to actually come up with
> > > > anything better.
> > > 
> > > btw, here is another reason to limit NR_WRITEBACK: I saw many
> > > throttle_vm_writeout() waits if there is no wait queue to limit
> > > NR_WRITEBACK (eg. NFS). In that case the (steadily) big NR_WRITEBACK
> > > is _not_ caused by fast vmscan..
> > > 
> > 
> >  That is exactely what made me look again into the code. My observation is 
> that when doing something like:
> > 
> > dd if=/dev/zero of=fast-local-disk bs=1M count=15000
> > 
> > most of the "dirty" pages are in NR_FILE_DIRTY with some relatively small 
> amount (10% or so) in NR_WRITEBACK. If I do:
> > 
> > dd if=/dev/zero of=some-nfs-mount bs=1M count=15000
> > 
> > NR_WRITEBACK almost immediatelly goes up to dirty_ratio, with
> > NR_UNSTABLE_NFS small. Over time NR_UNSTABLE_NFS grows, but is
> > always lower than NR_WRITEBACK (maybe 40/60).
> 
> This is interesting, though I don't see explicit NFS code to limit
> NR_UNSTABLE_NFS. Maybe there are some implicit rules.
> 
> >  But don't ask what happens if I do both in parallel.... The local
> >  IO really slows to a crawl and sometimes the system just becomes
> >  very unresponsive. Have we heard that before? :-)
> 
> You may be the first reporter as far as I can tell :)
>

 Oh come on :-) I (and others) have reported bad writeout behaviour since years. But maybe not in the combination of local and NFS I/O.
 
> >  Somehow I have the impression that NFS writeout is able to
> >  absolutely dominate the dirty pages to an extent that the system is
> >  unusable.
> 
> This is why I want to limit NR_WRITEBACK for NFS:
> 
>         [PATCH] NFS: introduce writeback wait queue
>         http://lkml.org/lkml/2009/10/3/198
> 

 Thanks. I will have a look. Is 2.6.32.x OK for testing?

Cheers
Martin


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Likley stupid question on "throttle_vm_writeout"
  2009-11-10 16:11         ` Martin Knoblauch
@ 2009-11-11  0:45           ` Wu Fengguang
  0 siblings, 0 replies; 8+ messages in thread
From: Wu Fengguang @ 2009-11-11  0:45 UTC (permalink / raw)
  To: Martin Knoblauch
  Cc: Peter Zijlstra, linux-kernel, Myklebust, Trond, Peter Staubach,
	linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 592 bytes --]

On Wed, Nov 11, 2009 at 12:11:37AM +0800, Martin Knoblauch wrote:
> > This is why I want to limit NR_WRITEBACK for NFS:
> > 
> >         [PATCH] NFS: introduce writeback wait queue
> >         http://lkml.org/lkml/2009/10/3/198
> > 
> 
>  Thanks. I will have a look. Is 2.6.32.x OK for testing?

I have a more recent patch (attached) based on 2.6.32.

This on/off threshold based approach may not be good for producing
fluent network flow. So I'm experimenting to do more fine grained
waits (ie. instead of wait for 10 jiffies at one time, we do 10
1-jiffy spread sleeps).

Thanks,
Fengguang

[-- Attachment #2: writeback-nfs-request-queue.patch --]
[-- Type: text/x-diff, Size: 10820 bytes --]

Subject: NFS: introduce writeback wait queue

The generic writeback routines are departing from congestion_wait()
in preferance of get_request_wait(), aka. waiting on the block queues.

Introduce the missing writeback wait queue for NFS, otherwise its
writeback pages may grow out of control.

In perticular, balance_dirty_pages() will exit after it pushes
write_chunk pages into the PG_writeback page pool, _OR_ when the
background writeback work quits. The latter is new behavior, and could
not only quit (normally) after below background threshold, but also
quit when it finds _zero_ dirty pages to write. The latter case gives
rise to the number of PG_writeback pages if it is not explicitly limited.

CC: Jens Axboe <jens.axboe@oracle.com>
CC: Chris Mason <chris.mason@oracle.com>
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Peter Staubach <staubach@redhat.com>
CC: Trond Myklebust <Trond.Myklebust@netapp.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---

The wait time and network throughput varies a lot! this is a major problem.
This means nfs_end_page_writeback() is not called smoothly over time,
even when there are plenty of PG_writeback pages on the client side.

[  397.828509] write_bandwidth: comm=nfsiod pages=192 time=16ms
[  397.850976] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.065244] write_bandwidth: comm=nfsiod pages=192 time=5212ms
[  403.549134] write_bandwidth: comm=nfsiod pages=1536 time=144ms
[  403.570717] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.595749] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.622171] write_bandwidth: comm=nfsiod pages=192 time=24ms
[  403.651779] write_bandwidth: comm=nfsiod pages=192 time=28ms
[  403.680543] write_bandwidth: comm=nfsiod pages=192 time=24ms
[  403.712572] write_bandwidth: comm=nfsiod pages=192 time=28ms
[  403.751552] write_bandwidth: comm=nfsiod pages=192 time=36ms
[  403.785979] write_bandwidth: comm=nfsiod pages=192 time=28ms
[  403.823995] write_bandwidth: comm=nfsiod pages=192 time=36ms
[  403.858970] write_bandwidth: comm=nfsiod pages=192 time=32ms
[  403.880786] write_bandwidth: comm=nfsiod pages=192 time=16ms
[  403.902732] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.925925] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  403.952044] write_bandwidth: comm=nfsiod pages=258 time=24ms
[  403.974006] write_bandwidth: comm=nfsiod pages=192 time=16ms
[  403.995989] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  405.031049] write_bandwidth: comm=nfsiod pages=192 time=1032ms
[  405.257635] write_bandwidth: comm=nfsiod pages=1536 time=192ms
[  405.279069] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  405.300843] write_bandwidth: comm=nfsiod pages=192 time=16ms
[  405.326031] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  405.350843] write_bandwidth: comm=nfsiod pages=192 time=24ms
[  405.375160] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  409.331015] write_bandwidth: comm=nfsiod pages=192 time=3952ms
[  409.587928] write_bandwidth: comm=nfsiod pages=1536 time=152ms
[  409.610068] write_bandwidth: comm=nfsiod pages=192 time=20ms
[  409.635736] write_bandwidth: comm=nfsiod pages=192 time=24ms

# vmmon -d 1 nr_writeback nr_dirty nr_unstable

     nr_writeback         nr_dirty      nr_unstable
            11227            41463            38044
            11227            41463            38044
            11227            41463            38044
            11227            41463            38044
            11045            53987             6490
            11033            53120             8145
            11195            52143            10886
            11211            52144            10913
            11211            52144            10913
            11211            52144            10913
            11056            56887             3876
            11062            55298             8155
            11214            54485             9838
            11225            54461             9852
            11225            54461             9852
            11225            54461             4582
            22342            35535             7823

----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw 
  0   0   9  92   0   0|   0     0 |  66B  306B|   0     0 |1003   377 
  0   1  39  60   0   1|   0     0 |  90k 1361k|   0     0 |1765  1599 
  0  15  12  43   0  31|   0     0 |2292k   34M|   0     0 |  12k   16k
  0   0  16  84   0   0|   0     0 | 132B  306B|   0     0 |1003   376 
  0   0  43  57   0   0|   0     0 |  66B  306B|   0     0 |1004   376 
  0   7  25  55   0  13|   0     0 |1202k   18M|   0     0 |7331  8921 
  0   8  21  55   0  15|   0     0 |1195k   18M|   0     0 |5382  6579 
  0   0  38  62   0   0|   0     0 |  66B  306B|   0     0 |1002   371 
  0   0  33  67   0   0|   0     0 |  66B  306B|   0     0 |1003   376 
  0  14  20  41   0  24|   0     0 |1621k   24M|   0     0 |8549    10k
  0   5  31  55   0   9|   0     0 | 769k   11M|   0     0 |4444  5180 
  0   0  18  82   0   0|   0     0 |  66B  568B|   0     0 |1004   377 
  0   1  41  54   0   3|   0     0 | 184k 2777k|   0     0 |2609  2619 
  1  13  22  43   0  22|   0     0 |1572k   23M|   0     0 |8138    10k
  0  11   9  59   0  20|   0     0 |1861k   27M|   0     0 |9576    13k
  0   5  23  66   0   5|   0     0 | 540k 8122k|   0     0 |2816  2885 


 fs/nfs/client.c           |    2 
 fs/nfs/write.c            |   92 +++++++++++++++++++++++++++++++-----
 include/linux/nfs_fs_sb.h |    1 
 3 files changed, 84 insertions(+), 11 deletions(-)

--- linux.orig/fs/nfs/write.c	2009-11-06 09:52:23.000000000 +0800
+++ linux/fs/nfs/write.c	2009-11-06 09:52:27.000000000 +0800
@@ -187,11 +187,65 @@ static int wb_priority(struct writeback_
  * NFS congestion control
  */
 
+#define NFS_WAIT_PAGES	(1024L >> (PAGE_CACHE_SHIFT - 10))
 int nfs_congestion_kb;
 
-#define NFS_CONGESTION_ON_THRESH 	(nfs_congestion_kb >> (PAGE_SHIFT-10))
-#define NFS_CONGESTION_OFF_THRESH	\
-	(NFS_CONGESTION_ON_THRESH - (NFS_CONGESTION_ON_THRESH >> 2))
+/*
+ * SYNC requests will block on (2*limit) and wakeup on (2*limit-NFS_WAIT_PAGES)
+ * ASYNC requests will block on (limit) and wakeup on (limit - NFS_WAIT_PAGES)
+ * In this way SYNC writes will never be blocked by ASYNC ones.
+ */
+
+static void nfs_set_congested(long nr, long limit,
+			      struct backing_dev_info *bdi)
+{
+	if (nr > limit && !test_bit(BDI_async_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_ASYNC);
+	else if (nr > 2 * limit && !test_bit(BDI_sync_congested, &bdi->state))
+		set_bdi_congested(bdi, BLK_RW_SYNC);
+}
+
+static void nfs_wait_contested(int is_sync,
+			       struct backing_dev_info *bdi,
+			       wait_queue_head_t *wqh)
+{
+	int waitbit = is_sync ? BDI_sync_congested : BDI_async_congested;
+ 	DEFINE_WAIT(wait);
+
+	if (!test_bit(waitbit, &bdi->state))
+		return;
+
+	for (;;) {
+		prepare_to_wait(&wqh[is_sync], &wait, TASK_UNINTERRUPTIBLE);
+		if (!test_bit(waitbit, &bdi->state))
+			break;
+
+		io_schedule();
+	}
+	finish_wait(&wqh[is_sync], &wait);
+}
+
+static void nfs_wakeup_congested(long nr, long limit,
+				 struct backing_dev_info *bdi,
+				 wait_queue_head_t *wqh)
+{
+	if (nr < 2*limit - min(limit/8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_sync_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_SYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_SYNC]))
+			wake_up(&wqh[BLK_RW_SYNC]);
+	}
+	if (nr < limit - min(limit/8, NFS_WAIT_PAGES)) {
+		if (test_bit(BDI_async_congested, &bdi->state)) {
+			clear_bdi_congested(bdi, BLK_RW_ASYNC);
+			smp_mb__after_clear_bit();
+		}
+		if (waitqueue_active(&wqh[BLK_RW_ASYNC]))
+			wake_up(&wqh[BLK_RW_ASYNC]);
+	}
+}
 
 static int nfs_set_page_writeback(struct page *page)
 {
@@ -201,11 +255,9 @@ static int nfs_set_page_writeback(struct
 		struct inode *inode = page->mapping->host;
 		struct nfs_server *nfss = NFS_SERVER(inode);
 
-		if (atomic_long_inc_return(&nfss->writeback) >
-				NFS_CONGESTION_ON_THRESH) {
-			set_bdi_congested(&nfss->backing_dev_info,
-						BLK_RW_ASYNC);
-		}
+		nfs_set_congested(atomic_long_inc_return(&nfss->writeback),
+				  nfs_congestion_kb >> (PAGE_SHIFT-10),
+				  &nfss->backing_dev_info);
 	}
 	return ret;
 }
@@ -216,8 +268,11 @@ static void nfs_end_page_writeback(struc
 	struct nfs_server *nfss = NFS_SERVER(inode);
 
 	end_page_writeback(page);
-	if (atomic_long_dec_return(&nfss->writeback) < NFS_CONGESTION_OFF_THRESH)
-		clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
+
+	nfs_wakeup_congested(atomic_long_dec_return(&nfss->writeback),
+			     nfs_congestion_kb >> (PAGE_SHIFT-10),
+			     &nfss->backing_dev_info,
+			     nfss->writeback_wait);
 }
 
 static struct nfs_page *nfs_find_and_lock_request(struct page *page)
@@ -309,19 +364,34 @@ static int nfs_writepage_locked(struct p
 
 int nfs_writepage(struct page *page, struct writeback_control *wbc)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_writepage_locked(page, wbc);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
-static int nfs_writepages_callback(struct page *page, struct writeback_control *wbc, void *data)
+static int nfs_writepages_callback(struct page *page,
+				   struct writeback_control *wbc, void *data)
 {
+	struct inode *inode = page->mapping->host;
+	struct nfs_server *nfss = NFS_SERVER(inode);
 	int ret;
 
 	ret = nfs_do_writepage(page, wbc, data);
 	unlock_page(page);
+
+	nfs_wait_contested(wbc->sync_mode == WB_SYNC_ALL,
+			   &nfss->backing_dev_info,
+			   nfss->writeback_wait);
+
 	return ret;
 }
 
--- linux.orig/include/linux/nfs_fs_sb.h	2009-11-06 09:22:30.000000000 +0800
+++ linux/include/linux/nfs_fs_sb.h	2009-11-06 09:52:27.000000000 +0800
@@ -108,6 +108,7 @@ struct nfs_server {
 	struct nfs_iostats *	io_stats;	/* I/O statistics */
 	struct backing_dev_info	backing_dev_info;
 	atomic_long_t		writeback;	/* number of writeback pages */
+	wait_queue_head_t	writeback_wait[2];
 	int			flags;		/* various flags */
 	unsigned int		caps;		/* server capabilities */
 	unsigned int		rsize;		/* read size */
--- linux.orig/fs/nfs/client.c	2009-11-06 09:22:30.000000000 +0800
+++ linux/fs/nfs/client.c	2009-11-06 09:52:27.000000000 +0800
@@ -991,6 +991,8 @@ static struct nfs_server *nfs_alloc_serv
 	INIT_LIST_HEAD(&server->master_link);
 
 	atomic_set(&server->active, 0);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_SYNC]);
+	init_waitqueue_head(&server->writeback_wait[BLK_RW_ASYNC]);
 
 	server->io_stats = nfs_alloc_iostats();
 	if (!server->io_stats) {

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-11-11  0:45 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-09 15:15 Likley stupid question on "throttle_vm_writeout" Martin Knoblauch
2009-11-09 15:26 ` Peter Zijlstra
2009-11-10  2:08   ` Wu Fengguang
2009-11-10  6:55     ` KOSAKI Motohiro
2009-11-10 12:01     ` Martin Knoblauch
2009-11-10 13:08       ` Wu Fengguang
2009-11-10 16:11         ` Martin Knoblauch
2009-11-11  0:45           ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.