All of lore.kernel.org
 help / color / mirror / Atom feed
* balance_pgdat(): where is total_scanned ever updated?
@ 2004-11-06 19:15 Chuck Ebbert
  2004-11-07  0:11 ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Chuck Ebbert @ 2004-11-06 19:15 UTC (permalink / raw)
  To: linux-kernel; +Cc: Nick Piggin

Kernel version is 2.6.9, but I see no updates to this function in BK-current.
How is total_scanned ever updated?  AFAICT it is always zero.

In mm/vmscan.c:balance_pgdat(), there are these references to total_scanned
(missing whitepace indicated by "^"):


 977:        int total_scanned, total_reclaimed;

 983:        total_scanned = 0;

1076:                        if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
1077:                            total_scanned > total_reclaimed+total_reclaimed/2)
                                                               ^ ^             ^ ^

1088:                if (total_scanned && priority < DEF_PRIORITY - 2)


Could this be part of the problems with reclaim?  Or have I missed something?


--Chuck Ebbert  06-Nov-04  14:15:21

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-06 19:15 balance_pgdat(): where is total_scanned ever updated? Chuck Ebbert
@ 2004-11-07  0:11 ` Andrew Morton
  2004-11-09 10:42   ` Marcelo Tosatti
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2004-11-07  0:11 UTC (permalink / raw)
  To: Chuck Ebbert; +Cc: linux-kernel, nickpiggin

Chuck Ebbert <76306.1226@compuserve.com> wrote:
>
> Kernel version is 2.6.9, but I see no updates to this function in BK-current.
> How is total_scanned ever updated?  AFAICT it is always zero.

It's a bug which was introduced months ago when we added struct
reclaim_state.

> In mm/vmscan.c:balance_pgdat(), there are these references to total_scanned
> (missing whitepace indicated by "^"):
> 
> 
>  977:        int total_scanned, total_reclaimed;
> 
>  983:        total_scanned = 0;
> 
> 1076:                        if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> 1077:                            total_scanned > total_reclaimed+total_reclaimed/2)
>                                                                ^ ^             ^ ^
> 
> 1088:                if (total_scanned && priority < DEF_PRIORITY - 2)
> 
> 
> Could this be part of the problems with reclaim?  Or have I missed something?

I had a patch which fixes it in -mm for a while.  It does increase the
number of pages which are reclaimed via direct reclaim and decreases the
number of pages which are reclaimed by kswapd.  As one would expect from
throttling kswapd.  This seems undesirable.

I'm leaving this alone until it can be demonstrated that fixing it improves
kernel behaviour in some manner.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-07  0:11 ` Andrew Morton
@ 2004-11-09 10:42   ` Marcelo Tosatti
  2004-11-09 19:36     ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 10:42 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Chuck Ebbert, linux-kernel, nickpiggin

On Sat, Nov 06, 2004 at 04:11:14PM -0800, Andrew Morton wrote:
> Chuck Ebbert <76306.1226@compuserve.com> wrote:
> >
> > Kernel version is 2.6.9, but I see no updates to this function in BK-current.
> > How is total_scanned ever updated?  AFAICT it is always zero.
> 
> It's a bug which was introduced months ago when we added struct
> reclaim_state.
> 
> > In mm/vmscan.c:balance_pgdat(), there are these references to total_scanned
> > (missing whitepace indicated by "^"):
> > 
> > 
> >  977:        int total_scanned, total_reclaimed;
> > 
> >  983:        total_scanned = 0;
> > 
> > 1076:                        if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > 1077:                            total_scanned > total_reclaimed+total_reclaimed/2)
> >                                                                ^ ^             ^ ^
> > 
> > 1088:                if (total_scanned && priority < DEF_PRIORITY - 2)
> > 
> > 
> > Could this be part of the problems with reclaim?  Or have I missed something?
> 
> I had a patch which fixes it in -mm for a while.  It does increase the
> number of pages which are reclaimed via direct reclaim and decreases the
> number of pages which are reclaimed by kswapd.  As one would expect from
> throttling kswapd.  This seems undesirable.

Hi Andrew,

Do you have any numbers to backup the claim "It does increase the
number of pages which are reclaimed via direct reclaim and decreases the
number of pages which are reclaimed by kswapd", please?

Because linux-2.6.10-rc1-mm2 (and 2.6.9) completly ignores sc->may_writepage 
under normal operation, its only used when laptop_mode is on:

		if (laptop_mode && !sc->may_writepage)
			goto keep_locked;

Is this intentional ???

> I'm leaving this alone until it can be demonstrated that fixing it improves
> kernel behaviour in some manner.

I dont see it working at all?

I'll see if I find time to do some tests.. (my usual disclaimer).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-09 19:36     ` Andrew Morton
@ 2004-11-09 18:02       ` Marcelo Tosatti
  2004-11-09 21:40         ` Andrew Morton
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 18:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: 76306.1226, linux-kernel, nickpiggin

On Tue, Nov 09, 2004 at 11:36:20AM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > On Sat, Nov 06, 2004 at 04:11:14PM -0800, Andrew Morton wrote:
> > > Chuck Ebbert <76306.1226@compuserve.com> wrote:
> > > >
> > > > Kernel version is 2.6.9, but I see no updates to this function in BK-current.
> > > > How is total_scanned ever updated?  AFAICT it is always zero.
> > > 
> > > It's a bug which was introduced months ago when we added struct
> > > reclaim_state.
> > > 
> > > > In mm/vmscan.c:balance_pgdat(), there are these references to total_scanned
> > > > (missing whitepace indicated by "^"):
> > > > 
> > > > 
> > > >  977:        int total_scanned, total_reclaimed;
> > > > 
> > > >  983:        total_scanned = 0;
> > > > 
> > > > 1076:                        if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > > > 1077:                            total_scanned > total_reclaimed+total_reclaimed/2)
> > > >                                                                ^ ^             ^ ^
> > > > 
> > > > 1088:                if (total_scanned && priority < DEF_PRIORITY - 2)
> > > > 
> > > > 
> > > > Could this be part of the problems with reclaim?  Or have I missed something?
> > > 
> > > I had a patch which fixes it in -mm for a while.  It does increase the
> > > number of pages which are reclaimed via direct reclaim and decreases the
> > > number of pages which are reclaimed by kswapd.  As one would expect from
> > > throttling kswapd.  This seems undesirable.
> > 
> > Hi Andrew,
> > 
> > Do you have any numbers to backup the claim "It does increase the
> > number of pages which are reclaimed via direct reclaim and decreases the
> > number of pages which are reclaimed by kswapd", please?
> 
> Run a workload and watch /proc/vmstat.  iirc, the one-line total_scanned
> fix takes the kswapd-vs-direct reclaim rate from 1:1 to 1:3 or thereabouts.

You're talking about laptop_mode ONLY, then?
How can that have any effect if may_writepage is ignored if !laptop_mode? 

About /proc/vmstat - each output is huge - do you actually read those?

We need a vmstat like tool for that information to be readable.

> > Because linux-2.6.10-rc1-mm2 (and 2.6.9) completly ignores sc->may_writepage 
> > under normal operation, its only used when laptop_mode is on:
> > 
> > 		if (laptop_mode && !sc->may_writepage)
> > 			goto keep_locked;
> > 
> > Is this intentional ???
> 
> yup.  In laptop mode we try to scan further to find a clean page rather
> than spinning up the disk for a writepage.

It might be interesting to use sc->may_writepage independantly of
laptop mode (ie make kswapd only writeout pages if the reclaim ratio 
is low).

> > > I'm leaving this alone until it can be demonstrated that fixing it improves
> > > kernel behaviour in some manner.
> > 
> > I dont see it working at all?
> > 
> 
> There's lots of useful info in /proc/vmstat.

I dont care much about laptop mode.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-09 21:40         ` Andrew Morton
@ 2004-11-09 18:52           ` Marcelo Tosatti
  2004-11-09 22:40             ` Andrew Morton
  2004-11-10 13:24             ` Nikita Danilov
  0 siblings, 2 replies; 13+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 18:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: 76306.1226, linux-kernel, nickpiggin

On Tue, Nov 09, 2004 at 01:40:32PM -0800, Andrew Morton wrote:
> > > > > I had a patch which fixes it in -mm for a while.  It does increase the
> > > > > number of pages which are reclaimed via direct reclaim and decreases the
> > > > > number of pages which are reclaimed by kswapd.  As one would expect from
> > > > > throttling kswapd.  This seems undesirable.
> > > > 
> > > > Hi Andrew,
> > > > 
> > > > Do you have any numbers to backup the claim "It does increase the
> > > > number of pages which are reclaimed via direct reclaim and decreases the
> > > > number of pages which are reclaimed by kswapd", please?
> > > 
> > > Run a workload and watch /proc/vmstat.  iirc, the one-line total_scanned
> > > fix takes the kswapd-vs-direct reclaim rate from 1:1 to 1:3 or thereabouts.
> > 
> > You're talking about laptop_mode ONLY, then?
> 
> No, not at all.
> 
> If we restore the total_scanned logic then kswapd will throttle itself, as
> designed.  Regardless of laptop_mode.  I did that, and monitored the page
> scanning and reclaim rates under various workloads.  I observed that with
> the fix in place, kswapd performed less page reclaim and direct-reclaim
> performed more reclaim.  And I wasn't able to demonstrate any benchmark
> improvements with the fix in place, so things are left as they are.

Ah, OK, I understand what you mean. I was thinking about sc->may_writepage 
only and its effects on shrink_list/pageout.

You remind me about the self throttling (blk_congestion_wait).
It makes sense now.

Andrea noted that blk_congestion_wait waits on IO which is not generated by 
reclaim - which is indeed a bad thing - it should only wait on IO which the
VM itself has started.

> > How can that have any effect if may_writepage is ignored if !laptop_mode? 
> 
> This is to do with kswapd throttling.  If we put kswapd to sleep more
> often, it does less scanning and reclaiming.

OK! 

> > About /proc/vmstat - each output is huge - do you actually read those?
> 
> yup.
> 
> 	cat /proc/vmstat > /tmp/1
> 	run workload
> 	cat /proc/vmstat > /tmp/2
> 	analyse /tmp/1 and /tmp/2

Will do that more often. :) 

> > We need a vmstat like tool for that information to be readable.
> 
> Would be nice.

I've been thinking on doing a Python based tool someday.

> > > > Because linux-2.6.10-rc1-mm2 (and 2.6.9) completly ignores sc->may_writepage 
> > > > under normal operation, its only used when laptop_mode is on:
> > > > 
> > > > 		if (laptop_mode && !sc->may_writepage)
> > > > 			goto keep_locked;
> > > > 
> > > > Is this intentional ???
> > > 
> > > yup.  In laptop mode we try to scan further to find a clean page rather
> > > than spinning up the disk for a writepage.
> > 
> > It might be interesting to use sc->may_writepage independantly of
> > laptop mode (ie make kswapd only writeout pages if the reclaim ratio 
> > is low).
> 
> sure.

Another related thing I noted this afternoon is that right now kswapd will
always block on full queues:

static int may_write_to_queue(struct backing_dev_info *bdi)
{
        if (current_is_kswapd())
                return 1;
        if (current_is_pdflush())       /* This is unlikely, but why not... */
                return 1;
        if (!bdi_write_congested(bdi))
                return 1;
        if (bdi == current->backing_dev_info)
                return 1;
        return 0;
}

We should make kswapd use the "bdi_write_congested" information and avoid
blocking on full queues. It should improve performance on multi-device 
systems with intense VM loads.

Maybe something along the lines 

"if the reclaim ratio is high, do not writepage"
"if the reclaim ratio is below high, writepage but not block"
"if the reclaim ratio is low, writepage and block"

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-09 10:42   ` Marcelo Tosatti
@ 2004-11-09 19:36     ` Andrew Morton
  2004-11-09 18:02       ` Marcelo Tosatti
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2004-11-09 19:36 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: 76306.1226, linux-kernel, nickpiggin

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> On Sat, Nov 06, 2004 at 04:11:14PM -0800, Andrew Morton wrote:
> > Chuck Ebbert <76306.1226@compuserve.com> wrote:
> > >
> > > Kernel version is 2.6.9, but I see no updates to this function in BK-current.
> > > How is total_scanned ever updated?  AFAICT it is always zero.
> > 
> > It's a bug which was introduced months ago when we added struct
> > reclaim_state.
> > 
> > > In mm/vmscan.c:balance_pgdat(), there are these references to total_scanned
> > > (missing whitepace indicated by "^"):
> > > 
> > > 
> > >  977:        int total_scanned, total_reclaimed;
> > > 
> > >  983:        total_scanned = 0;
> > > 
> > > 1076:                        if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
> > > 1077:                            total_scanned > total_reclaimed+total_reclaimed/2)
> > >                                                                ^ ^             ^ ^
> > > 
> > > 1088:                if (total_scanned && priority < DEF_PRIORITY - 2)
> > > 
> > > 
> > > Could this be part of the problems with reclaim?  Or have I missed something?
> > 
> > I had a patch which fixes it in -mm for a while.  It does increase the
> > number of pages which are reclaimed via direct reclaim and decreases the
> > number of pages which are reclaimed by kswapd.  As one would expect from
> > throttling kswapd.  This seems undesirable.
> 
> Hi Andrew,
> 
> Do you have any numbers to backup the claim "It does increase the
> number of pages which are reclaimed via direct reclaim and decreases the
> number of pages which are reclaimed by kswapd", please?

Run a workload and watch /proc/vmstat.  iirc, the one-line total_scanned
fix takes the kswapd-vs-direct reclaim rate from 1:1 to 1:3 or thereabouts.


> Because linux-2.6.10-rc1-mm2 (and 2.6.9) completly ignores sc->may_writepage 
> under normal operation, its only used when laptop_mode is on:
> 
> 		if (laptop_mode && !sc->may_writepage)
> 			goto keep_locked;
> 
> Is this intentional ???

yup.  In laptop mode we try to scan further to find a clean page rather
than spinning up the disk for a writepage.

> > I'm leaving this alone until it can be demonstrated that fixing it improves
> > kernel behaviour in some manner.
> 
> I dont see it working at all?
> 

There's lots of useful info in /proc/vmstat.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-09 18:02       ` Marcelo Tosatti
@ 2004-11-09 21:40         ` Andrew Morton
  2004-11-09 18:52           ` Marcelo Tosatti
  0 siblings, 1 reply; 13+ messages in thread
From: Andrew Morton @ 2004-11-09 21:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: 76306.1226, linux-kernel, nickpiggin

> > > > I had a patch which fixes it in -mm for a while.  It does increase the
> > > > number of pages which are reclaimed via direct reclaim and decreases the
> > > > number of pages which are reclaimed by kswapd.  As one would expect from
> > > > throttling kswapd.  This seems undesirable.
> > > 
> > > Hi Andrew,
> > > 
> > > Do you have any numbers to backup the claim "It does increase the
> > > number of pages which are reclaimed via direct reclaim and decreases the
> > > number of pages which are reclaimed by kswapd", please?
> > 
> > Run a workload and watch /proc/vmstat.  iirc, the one-line total_scanned
> > fix takes the kswapd-vs-direct reclaim rate from 1:1 to 1:3 or thereabouts.
> 
> You're talking about laptop_mode ONLY, then?

No, not at all.

If we restore the total_scanned logic then kswapd will throttle itself, as
designed.  Regardless of laptop_mode.  I did that, and monitored the page
scanning and reclaim rates under various workloads.  I observed that with
the fix in place, kswapd performed less page reclaim and direct-reclaim
performed more reclaim.  And I wasn't able to demonstrate any benchmark
improvements with the fix in place, so things are left as they are.

> How can that have any effect if may_writepage is ignored if !laptop_mode? 

This is to do with kswapd throttling.  If we put kswapd to sleep more
often, it does less scanning and reclaiming.

> About /proc/vmstat - each output is huge - do you actually read those?

yup.

	cat /proc/vmstat > /tmp/1
	run workload
	cat /proc/vmstat > /tmp/2
	analyse /tmp/1 and /tmp/2

> We need a vmstat like tool for that information to be readable.

Would be nice.

> > > Because linux-2.6.10-rc1-mm2 (and 2.6.9) completly ignores sc->may_writepage 
> > > under normal operation, its only used when laptop_mode is on:
> > > 
> > > 		if (laptop_mode && !sc->may_writepage)
> > > 			goto keep_locked;
> > > 
> > > Is this intentional ???
> > 
> > yup.  In laptop mode we try to scan further to find a clean page rather
> > than spinning up the disk for a writepage.
> 
> It might be interesting to use sc->may_writepage independantly of
> laptop mode (ie make kswapd only writeout pages if the reclaim ratio 
> is low).

sure.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-09 18:52           ` Marcelo Tosatti
@ 2004-11-09 22:40             ` Andrew Morton
  2004-11-10 13:24             ` Nikita Danilov
  1 sibling, 0 replies; 13+ messages in thread
From: Andrew Morton @ 2004-11-09 22:40 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: 76306.1226, linux-kernel, nickpiggin

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> ...
> > > You're talking about laptop_mode ONLY, then?
> > 
> > No, not at all.
> > 
> > If we restore the total_scanned logic then kswapd will throttle itself, as
> > designed.  Regardless of laptop_mode.  I did that, and monitored the page
> > scanning and reclaim rates under various workloads.  I observed that with
> > the fix in place, kswapd performed less page reclaim and direct-reclaim
> > performed more reclaim.  And I wasn't able to demonstrate any benchmark
> > improvements with the fix in place, so things are left as they are.
> 
> Ah, OK, I understand what you mean. I was thinking about sc->may_writepage 
> only and its effects on shrink_list/pageout.
> 
> You remind me about the self throttling (blk_congestion_wait).
> It makes sense now.
> 
> Andrea noted that blk_congestion_wait waits on IO which is not generated by 
> reclaim - which is indeed a bad thing - it should only wait on IO which the
> VM itself has started.

Yes, blk_congestion_wait() is very approximate.  It was always intended
that it be replaced by wakeups from end_page_writeback(), directed to
waitqueues which correspond to the classzones to which the page belongs. 
That way, page reclaiming processes can throttle precisely upon I/O
completion against pages which are useful to them.

But I've never seen a report of a problem which would be solved by such a
change, and so the cost of delivering multiple wakeups at
end_page_writeback() doesn't seem justified thus far.

> ...
> Another related thing I noted this afternoon is that right now kswapd will
> always block on full queues:
> 
> static int may_write_to_queue(struct backing_dev_info *bdi)
> {
>         if (current_is_kswapd())
>                 return 1;
>         if (current_is_pdflush())       /* This is unlikely, but why not... */
>                 return 1;
>         if (!bdi_write_congested(bdi))
>                 return 1;
>         if (bdi == current->backing_dev_info)
>                 return 1;
>         return 0;
> }
> 
> We should make kswapd use the "bdi_write_congested" information and avoid
> blocking on full queues. It should improve performance on multi-device 
> systems with intense VM loads.
> 
> Maybe something along the lines 
> 
> "if the reclaim ratio is high, do not writepage"
> "if the reclaim ratio is below high, writepage but not block"
> "if the reclaim ratio is low, writepage and block"

It used to do that, but it was taken out.  gack, brain-strain.  umm, dig,
dig.   Here you go:





The `low latency page reclaim' design works by preventing page
allocators from blocking on request queues (and by preventing them from
blocking against writeback of individual pages, but that is immaterial
here).

This has a problem under some situations.  pdflush (or a write(2)
caller) could be saturating the queue with highmem pages.  This
prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
enormous amounts of scenning.

A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
then kill the mmapping applications.  The machine instantly goes from
0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
the least-recently-dirtied pages, which are all highmem.  The queue is
congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
efficiency (pages_reclaimed/pages_scanned) falls to 2%.

So this patch changes the policy for kswapd.  kswapd may use all of a
request queue, and is prepared to block on request queues.

What will now happen in the above scenario is:

1: The page alloctor scans some pages, fails to reclaim enough
   memory and takes a nap in blk_congetion_wait().

2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
   back pages.  (These pages will be rotated to the tail of the
   inactive list at IO-completion interrupt time).

   This writeback will saturate the queue with ZONE_NORMAL pages. 
   Conveniently, pdflush will avoid the congested queues.  So we end up
   writing the correct pages.

In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
efficiency rises from 2% to 40% and things are generally a lot happier.


The downside is that kswapd may now do a lot less page reclaim,
increasing page allocation latency, causing more direct reclaim,
increasing lock contention in the VM, etc.  But I have not been able to
demonstrate that in testing.


The other problem is that there is only one kswapd, and there are lots
of disks.  That is a generic problem - without being able to co-opt
user processes we don't have enough threads to keep lots of disks saturated.

One fix for this would be to add an additional "really congested"
threshold in the request queues, so kswapd can still perform
nonblocking writeout.  This gives kswapd priority over pdflush while
allowing kswapd to feed many disk queues.  I doubt if this will be
called for.


 include/linux/swap.h |    6 ++++++
 mm/vmscan.c          |   21 +++++++++++++++------
 2 files changed, 21 insertions(+), 6 deletions(-)

--- 25/mm/vmscan.c~blocking-kswapd	Sat Dec 21 16:24:37 2002
+++ 25-akpm/mm/vmscan.c	Sat Dec 21 16:24:37 2002
@@ -204,6 +204,19 @@ static inline int is_page_cache_freeable
 	return page_count(page) - !!PagePrivate(page) == 2;
 }
 
+static int may_write_to_queue(struct backing_dev_info *bdi)
+{
+	if (current_is_kswapd())
+		return 1;
+	if (current_is_pdflush())	/* This is unlikely, but why not... */
+		return 1;
+	if (!bdi_write_congested(bdi))
+		return 1;
+	if (bdi == current->backing_dev_info)
+		return 1;
+	return 0;
+}
+
 /*
  * shrink_list returns the number of reclaimed pages
  */
@@ -303,8 +316,6 @@ shrink_list(struct list_head *page_list,
 		 * See swapfile.c:page_queue_congested().
 		 */
 		if (PageDirty(page)) {
-			struct backing_dev_info *bdi;
-
 			if (!is_page_cache_freeable(page))
 				goto keep_locked;
 			if (!mapping)
@@ -313,9 +324,7 @@ shrink_list(struct list_head *page_list,
 				goto activate_locked;
 			if (!may_enter_fs)
 				goto keep_locked;
-			bdi = mapping->backing_dev_info;
-			if (bdi != current->backing_dev_info &&
-					bdi_write_congested(bdi))
+			if (!may_write_to_queue(mapping->backing_dev_info))
 				goto keep_locked;
 			write_lock(&mapping->page_lock);
 			if (test_clear_page_dirty(page)) {
@@ -424,7 +433,7 @@ keep:
 	if (pagevec_count(&freed_pvec))
 		__pagevec_release_nonlru(&freed_pvec);
 	mod_page_state(pgsteal, ret);
-	if (current->flags & PF_KSWAPD)
+	if (current_is_kswapd())
 		mod_page_state(kswapd_steal, ret);
 	mod_page_state(pgactivate, pgactivate);
 	return ret;
--- 25/include/linux/swap.h~blocking-kswapd	Sat Dec 21 16:24:37 2002
+++ 25-akpm/include/linux/swap.h	Sat Dec 21 16:24:37 2002
@@ -7,6 +7,7 @@
 #include <linux/linkage.h>
 #include <linux/mmzone.h>
 #include <linux/list.h>
+#include <linux/sched.h>
 #include <asm/atomic.h>
 #include <asm/page.h>
 
@@ -14,6 +15,11 @@
 #define SWAP_FLAG_PRIO_MASK	0x7fff
 #define SWAP_FLAG_PRIO_SHIFT	0
 
+static inline int current_is_kswapd(void)
+{
+	return current->flags & PF_KSWAPD;
+}
+
 /*
  * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
  * be swapped to.  The swap type and the offset into that swap type are

_


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-09 18:52           ` Marcelo Tosatti
  2004-11-09 22:40             ` Andrew Morton
@ 2004-11-10 13:24             ` Nikita Danilov
  2004-11-11 14:49               ` Marcelo Tosatti
  1 sibling, 1 reply; 13+ messages in thread
From: Nikita Danilov @ 2004-11-10 13:24 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Andrew Morton, 76306.1226, linux-kernel, nickpiggin

Marcelo Tosatti writes:

[...]

 > 
 > Another related thing I noted this afternoon is that right now kswapd will
 > always block on full queues:
 > 
 > static int may_write_to_queue(struct backing_dev_info *bdi)
 > {
 >         if (current_is_kswapd())
 >                 return 1;
 >         if (current_is_pdflush())       /* This is unlikely, but why not... */
 >                 return 1;
 >         if (!bdi_write_congested(bdi))
 >                 return 1;
 >         if (bdi == current->backing_dev_info)
 >                 return 1;
 >         return 0;
 > }
 > 
 > We should make kswapd use the "bdi_write_congested" information and avoid
 > blocking on full queues. It should improve performance on multi-device 
 > systems with intense VM loads.

This will have following undesirable side effect: if
may_write_to_queue() returns false, page is not paged out, instead it is
thrown to the head of the inactive queue, thus destroying "LRU
ordering", shrink_list() will dive deeper into inactive list, reclaiming
hotter pages.

It's OK to accidentially skip pageout in direct reclaim path, because

 - we hope most pageout is done by kswapd, and

 - we don't want __alloc_pages() to stall

but _something_ in the kernel should take a pain of actually writing
pages out in LRU order.

 > 
 > Maybe something along the lines 
 > 
 > "if the reclaim ratio is high, do not writepage"
 > "if the reclaim ratio is below high, writepage but not block"
 > "if the reclaim ratio is low, writepage and block"

If kswapd blocking is a concern, inactive list scanning should be
decoupled from actual page-out (a la Solaris): kswapd queues pages to
the yet another kernel thread that calls pageout().

I played with this idea (see
http://nikita.w3.to/code/patches/2-6-10-rc1/async-writepage.txt note
that async_writepage() has to be adjusted to work for kswapd), but while
in some cases (large concurrent builds) it does provide a benefit, in
other cases (heavy write through mmap) it makes throughput slightly
worse.

Besides, this doesn't completely avoid the problem of destroying LRU
ordering, as kswapd still proceeds further through inactive list while
pages are sent out asynchronously.

Nikita.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-10 13:24             ` Nikita Danilov
@ 2004-11-11 14:49               ` Marcelo Tosatti
  2004-11-11 19:37                 ` Nikita Danilov
  0 siblings, 1 reply; 13+ messages in thread
From: Marcelo Tosatti @ 2004-11-11 14:49 UTC (permalink / raw)
  To: Nikita Danilov; +Cc: linux-mm

switching to linux-mm

On Wed, Nov 10, 2004 at 04:24:45PM +0300, Nikita Danilov wrote:
> Marcelo Tosatti writes:
> 
> [...]
> 
>  > 
>  > Another related thing I noted this afternoon is that right now kswapd will
>  > always block on full queues:
>  > 
>  > static int may_write_to_queue(struct backing_dev_info *bdi)
>  > {
>  >         if (current_is_kswapd())
>  >                 return 1;
>  >         if (current_is_pdflush())       /* This is unlikely, but why not... */
>  >                 return 1;
>  >         if (!bdi_write_congested(bdi))
>  >                 return 1;
>  >         if (bdi == current->backing_dev_info)
>  >                 return 1;
>  >         return 0;
>  > }
>  > 
>  > We should make kswapd use the "bdi_write_congested" information and avoid
>  > blocking on full queues. It should improve performance on multi-device 
>  > systems with intense VM loads.
> 
> This will have following undesirable side effect: if
> may_write_to_queue() returns false, page is not paged out, instead it is
> thrown to the head of the inactive queue, thus destroying "LRU
> ordering", shrink_list() will dive deeper into inactive list, reclaiming
> hotter pages.
> It's OK to accidentially skip pageout in direct reclaim path, because
> 
>  - we hope most pageout is done by kswapd, and
> 
>  - we don't want __alloc_pages() to stall
> 
> but _something_ in the kernel should take a pain of actually writing
> pages out in LRU order.

I see - it breaks LRU ordering of pageout. 

>  > Maybe something along the lines 
>  > 
>  > "if the reclaim ratio is high, do not writepage"
>  > "if the reclaim ratio is below high, writepage but not block"
>  > "if the reclaim ratio is low, writepage and block"
> 
> If kswapd blocking is a concern, inactive list scanning should be
> decoupled from actual page-out (a la Solaris): kswapd queues pages to
> the yet another kernel thread that calls pageout().

Its just concern, no numbers to back that up.

But its pretty obvious that its behaviour is suboptimal when you 
think about multi-device systems. kswapd may block for example
in get_block() (there is a comment on top of pageout() about
that), which makes the situation even worse.

> I played with this idea (see
> http://nikita.w3.to/code/patches/2-6-10-rc1/async-writepage.txt note
> that async_writepage() has to be adjusted to work for kswapd), but while
> in some cases (large concurrent builds) it does provide a benefit, in
> other cases (heavy write through mmap) it makes throughput slightly
> worse.

Very sweet, I like it.

Why do you think the heavy write through mmap decreased throughput?

Would be nice if you had those numbers saved somewhere.

> Besides, this doesn't completely avoid the problem of destroying LRU
> ordering, as kswapd still proceeds further through inactive list while
> pages are sent out asynchronously.

Well pages are being sent out in order - which should do fine. o?

kswapd proceeds further through inactive list while pages are sent 
out asynchronously with the current design - pageout() writes,
 moves the pages (now under IO) to head of inactive list and 
continues.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
  2004-11-11 14:49               ` Marcelo Tosatti
@ 2004-11-11 19:37                 ` Nikita Danilov
  0 siblings, 0 replies; 13+ messages in thread
From: Nikita Danilov @ 2004-11-11 19:37 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm

Marcelo Tosatti writes:
 > 

[...]

 > 
 > > I played with this idea (see
 > > http://nikita.w3.to/code/patches/2-6-10-rc1/async-writepage.txt note
 > > that async_writepage() has to be adjusted to work for kswapd), but while
 > > in some cases (large concurrent builds) it does provide a benefit, in
 > > other cases (heavy write through mmap) it makes throughput slightly
 > > worse.
 > 
 > Very sweet, I like it.

Additional advantage of async-writepage is that in this case one has
whole queue of dirty pages ready for page-out, so that some smarter
clustering can be implemented.

 > 
 > Why do you think the heavy write through mmap decreased throughput?

Because I thought I measured it, but see below :)

 > 
 > Would be nice if you had those numbers saved somewhere.

Second column is averaged number of microseconds it takes to dirty 1G
through mmap (big file larger than ram is mmapped in 1G chunks and one
byte at each its page is touched in a loop). Rows correspond to patches
from http://nikita.w3.to/code/patches/2-6-10-rc1/ applied one after
another.

2.6.10-rc1                      77370854.641026
skip-writepage                  72766988.375000
dont-rotate-active-list         71440066.068966
async-writepage                 75028707.083333 /* regression */
batch-mark_page_accessed        74183312.078947
page_referenced-move-dirty      72947326.125000
dont-unmap-on-pageout           72702028.843750
ignore-page_referenced          74188417.156250 /* regression */
cluster-pageout                 69449001.583333

Err... now that I pasted this, I recall that async-writepage patch
tested above does _not_ allow kswapd to do asynchronous page-out:

----------------------------------------------------------------------
/*
 * check whether writepage should be done asynchronously by kaiod.
 */
static int
async_writepage(struct page *page, int nr_dirty)
{
	/* goal of doing writepage asynchronously is to decrease latency of
	 * memory allocations involving direct reclaim, which is inapplicable
	 * to the kswapd */
	if (current_is_kswapd())
		return 0;
	/* limit number of pending async-writepage requests */
	else if (kaio_nr_requests > KAIO_THROTTLE)
		return 0;
	/* if we are under memory pressure---do pageout synchronously to
	 * throttle scanner. */
	else if (page_zone(page)->prev_priority != DEF_PRIORITY)
		return 0;
	/* if expected number of writepage requests submitted by this
	 * invocation of shrink_list() is small enough---do them
	 * asynchronously */
	else if (nr_dirty <= KAIO_CLUSTER_SIZE)
		return 1;
	else
		return 0;
}
----------------------------------------------------------------------

First if ... return 0; should be removed.

Nikita.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
@ 2004-11-10  3:34 Chuck Ebbert
  0 siblings, 0 replies; 13+ messages in thread
From: Chuck Ebbert @ 2004-11-10  3:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: nickpiggin, linux-kernel

Andrew Morton <akpm@osdl.org> wrote:

> There's lots of useful info in /proc/vmstat.

 And the documentation on these fields is the source code itself, right? :)

 The nr_dirty field seems kind of useless -- why not have nr_dirtied
and nr_cleaned instead?  Analysis tools can subtract them to get nr_dirty.
Or is there some other field that shows the nr of pages being dirtied?


--Chuck Ebbert  09-Nov-04  22:13:44

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: balance_pgdat(): where is total_scanned ever updated?
@ 2004-11-07  5:02 Chuck Ebbert
  0 siblings, 0 replies; 13+ messages in thread
From: Chuck Ebbert @ 2004-11-07  5:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

Andrew Morton wrote:

> I'm leaving this alone until it can be demonstrated that fixing it improves
> kernel behaviour in some manner.

How about applying this patch so nobody else will be confused?


diff -ur bk-current/mm/vmscan.c edited/mm/vmscan.c
--- bk-current/mm/vmscan.c      2004-11-06 23:02:48.691160680 -0500
+++ edited/mm/vmscan.c  2004-11-06 23:13:02.636826752 -0500
@@ -1071,10 +1071,13 @@
                        /*
                         * If we've done a decent amount of scanning and
                         * the reclaim ratio is low, start doing writepage
-                        * even in laptop mode
+                        * even in laptop mode.
+                        * NOTE: total_scanned is always zero; this code
+                        *       does nothing.  Reactivating it has not been
+                        *       shown to be helpful at the moment.
                         */
                        if (total_scanned > SWAP_CLUSTER_MAX * 2 &&
-                           total_scanned > total_reclaimed+total_reclaimed/2)
+                           total_scanned > total_reclaimed + total_reclaimed / 2)
                                sc.may_writepage = 1;
                }
                if (nr_pages && to_free > total_reclaimed)
@@ -1084,6 +1087,7 @@
                /*
                 * OK, kswapd is getting into trouble.  Take a nap, then take
                 * another pass across the zones.
+                * NOTE: total_scanned is always zero.  See above.
                 */
                if (total_scanned && priority < DEF_PRIORITY - 2)
                        blk_congestion_wait(WRITE, HZ/10);



--Chuck Ebbert  06-Nov-04  23:35:37

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2004-11-11 19:37 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-06 19:15 balance_pgdat(): where is total_scanned ever updated? Chuck Ebbert
2004-11-07  0:11 ` Andrew Morton
2004-11-09 10:42   ` Marcelo Tosatti
2004-11-09 19:36     ` Andrew Morton
2004-11-09 18:02       ` Marcelo Tosatti
2004-11-09 21:40         ` Andrew Morton
2004-11-09 18:52           ` Marcelo Tosatti
2004-11-09 22:40             ` Andrew Morton
2004-11-10 13:24             ` Nikita Danilov
2004-11-11 14:49               ` Marcelo Tosatti
2004-11-11 19:37                 ` Nikita Danilov
2004-11-07  5:02 Chuck Ebbert
2004-11-10  3:34 Chuck Ebbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.