All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] kswapd shall not sleep during page shortage
@ 2004-11-09 16:46 Marcelo Tosatti
  2004-11-09 20:19 ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 16:46 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, Nick Piggin

Andrew,

I was wrong last time I read balance_pgdat() when I thought kswapd
couldnt sleep under page shortage. 

It can, because all_zones_ok is set to "1" inside the 
"priority=DEF_PRIORITY; priority >= 0; priority--" loop.

So this patch sets "all_zones_ok" to zero even if all_unreclaimable 
is set, avoiding it from sleeping when zones are under page short.

Please apply!


--- linux-2.6.10-rc1-mm2/mm/vmscan.c.orig	2004-11-09 16:38:04.480873424 -0200
+++ linux-2.6.10-rc1-mm2/mm/vmscan.c	2004-11-09 16:38:08.624243536 -0200
@@ -1033,15 +1033,17 @@
 				if (zone->present_pages == 0)
 					continue;
 
-				if (zone->all_unreclaimable &&
-						priority != DEF_PRIORITY)
-					continue;
-
 				if (!zone_watermark_ok(zone, order,
 						zone->pages_high, 0, 0, 0)) {
 					end_zone = i;
-					goto scan;
+					all_zones_ok = 0;
 				}
+
+				if (zone->all_unreclaimable &&
+						priority != DEF_PRIORITY)
+					continue;
+
+				goto scan;
 			}
 			goto out;
 		} else {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 20:19 ` Andrew Morton
@ 2004-11-09 17:41   ` Marcelo Tosatti
  2004-11-09 21:33     ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 17:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, piggin

On Tue, Nov 09, 2004 at 12:19:45PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > 
> > Andrew,
> > 
> > I was wrong last time I read balance_pgdat() when I thought kswapd
> > couldnt sleep under page shortage. 
> > 
> > It can, because all_zones_ok is set to "1" inside the 
> > "priority=DEF_PRIORITY; priority >= 0; priority--" loop.
> > 
> > So this patch sets "all_zones_ok" to zero even if all_unreclaimable 
> > is set, avoiding it from sleeping when zones are under page short.
> > 
> 
> Does this solve any observed problem?  What testing was done, and what were
> the results??


The observed problem are the page allocation failures!

No testing has been done, but it is an obvious problem if you read the
code. 

What your thinking?


> > --- linux-2.6.10-rc1-mm2/mm/vmscan.c.orig	2004-11-09 16:38:04.480873424 -0200
> > +++ linux-2.6.10-rc1-mm2/mm/vmscan.c	2004-11-09 16:38:08.624243536 -0200
> > @@ -1033,15 +1033,17 @@
> >  				if (zone->present_pages == 0)
> >  					continue;
> >  
> > -				if (zone->all_unreclaimable &&
> > -						priority != DEF_PRIORITY)
> > -					continue;
> > -
> >  				if (!zone_watermark_ok(zone, order,
> >  						zone->pages_high, 0, 0, 0)) {
> >  					end_zone = i;
> > -					goto scan;
> > +					all_zones_ok = 0;
> >  				}
> > +
> > +				if (zone->all_unreclaimable &&
> > +						priority != DEF_PRIORITY)
> > +					continue;
> > +
> > +				goto scan;
> >  			}
> >  			goto out;
> >  		} else {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 21:33     ` Andrew Morton
@ 2004-11-09 18:26       ` Marcelo Tosatti
  2004-11-09 22:22         ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 18:26 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, piggin

On Tue, Nov 09, 2004 at 01:33:43PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > On Tue, Nov 09, 2004 at 12:19:45PM -0800, Andrew Morton wrote:
> > > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> > > >
> > > > 
> > > > Andrew,
> > > > 
> > > > I was wrong last time I read balance_pgdat() when I thought kswapd
> > > > couldnt sleep under page shortage. 
> > > > 
> > > > It can, because all_zones_ok is set to "1" inside the 
> > > > "priority=DEF_PRIORITY; priority >= 0; priority--" loop.
> > > > 
> > > > So this patch sets "all_zones_ok" to zero even if all_unreclaimable 
> > > > is set, avoiding it from sleeping when zones are under page short.
> > > > 
> > > 
> > > Does this solve any observed problem?  What testing was done, and what were
> > > the results??
> > 
> > 
> > The observed problem are the page allocation failures!
> 
> But the patch doesn't have any effect on that, which I can see.

Andrew, it avoids kswapd from sleeping when the machine is OOM.

> > No testing has been done, but it is an obvious problem if you read the
> > code. 
> 
> Not really.  The move of the all_unreclaimable test doesn't seem to do
> anything, because we'll just skip that zone anyway in the next loop.
> 
> Maybe you moved the all_unreclaimable test just so that there's an
> opportunity to clear all_zones_ok?  I dunno.

Yes, exactly. I moved all_unreclaimable test because then there is 
an opportunity to clear all_zones_ok. Otherwise all_zones_ok keeps set
even if all_zones are not OK at all!

> AFAICT, the early clearing of all_zones_ok will have no effect on kswapd
> throttling because the total_scanned logic is disabled.

It makes this at the end of balance_pgdat

        if (!all_zones_ok) {
                cond_resched();
                goto loop_again;
        }

happen.

> What I think your patch will do is to cause kswapd to do the `goto
> loop_again' thing if all zones are unreclaimable.  Which appears to risk
> putting kswapd into a busy loop when we're out of memory.

Yes, this is exactly what the patch does. 

And kswapd has to be into a busy loop when we're out of memory! It has 
to be looking for free pages - it should not sleep for god sakes!

Note that it wont cause excessive CPU usage because kswapd will be "polling" 
slowly (with priority = DEF_PRIORITY) on the active/inactive lists (shrink_zone).

The cond_resched at the end of balance_pgdat() makes sure no harmful exclusivity
of CPU will happen.

So this way it still does not cause the excessive CPU usage which is avoided by 
all_unreclaimable (ie we wont be scanning huge amounts of pages at low priorities)
but at the same time avoids kswapd from possibly sleeping, which is IMO
very bad.

> So I'm all confused and concerned.  It would help if you were to explain
> your thinking more completely... 

I think now you can understand what I'm thinking.

Does it makes sense to you?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 16:46 [PATCH] kswapd shall not sleep during page shortage Marcelo Tosatti
@ 2004-11-09 20:19 ` Andrew Morton
  2004-11-09 17:41   ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2004-11-09 20:19 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, piggin

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> 
> Andrew,
> 
> I was wrong last time I read balance_pgdat() when I thought kswapd
> couldnt sleep under page shortage. 
> 
> It can, because all_zones_ok is set to "1" inside the 
> "priority=DEF_PRIORITY; priority >= 0; priority--" loop.
> 
> So this patch sets "all_zones_ok" to zero even if all_unreclaimable 
> is set, avoiding it from sleeping when zones are under page short.
> 

Does this solve any observed problem?  What testing was done, and what were
the results??

> 
> --- linux-2.6.10-rc1-mm2/mm/vmscan.c.orig	2004-11-09 16:38:04.480873424 -0200
> +++ linux-2.6.10-rc1-mm2/mm/vmscan.c	2004-11-09 16:38:08.624243536 -0200
> @@ -1033,15 +1033,17 @@
>  				if (zone->present_pages == 0)
>  					continue;
>  
> -				if (zone->all_unreclaimable &&
> -						priority != DEF_PRIORITY)
> -					continue;
> -
>  				if (!zone_watermark_ok(zone, order,
>  						zone->pages_high, 0, 0, 0)) {
>  					end_zone = i;
> -					goto scan;
> +					all_zones_ok = 0;
>  				}
> +
> +				if (zone->all_unreclaimable &&
> +						priority != DEF_PRIORITY)
> +					continue;
> +
> +				goto scan;
>  			}
>  			goto out;
>  		} else {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 22:22         ` Andrew Morton
@ 2004-11-09 20:31           ` Marcelo Tosatti
  2004-11-10  0:28             ` Andrew Morton
  2004-11-10  0:56           ` Nick Piggin
  1 sibling, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 20:31 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, piggin

On Tue, Nov 09, 2004 at 02:22:57PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > > 
> > > But the patch doesn't have any effect on that, which I can see.
> > 
> > Andrew, it avoids kswapd from sleeping when the machine is OOM.
> 
> I think you mean that it prevents balance_pdget() from falling back to
> kswapd() when the machine is oom, yes?  There are other places where kswapd
> might sleep.

Yes, that is what I meant more precisely.

> > > > No testing has been done, but it is an obvious problem if you read the
> > > > code. 
> > > 
> > > Not really.  The move of the all_unreclaimable test doesn't seem to do
> > > anything, because we'll just skip that zone anyway in the next loop.
> > > 
> > > Maybe you moved the all_unreclaimable test just so that there's an
> > > opportunity to clear all_zones_ok?  I dunno.
> > 
> > Yes, exactly. I moved all_unreclaimable test because then there is 
> > an opportunity to clear all_zones_ok. Otherwise all_zones_ok keeps set
> > even if all_zones are not OK at all!
> 
> OK.
> 
> > > AFAICT, the early clearing of all_zones_ok will have no effect on kswapd
> > > throttling because the total_scanned logic is disabled.
> > 
> > It makes this at the end of balance_pgdat
> > 
> >         if (!all_zones_ok) {
> >                 cond_resched();
> >                 goto loop_again;
> >         }
> > 
> > happen.
> > 
> > > What I think your patch will do is to cause kswapd to do the `goto
> > > loop_again' thing if all zones are unreclaimable.  Which appears to risk
> > > putting kswapd into a busy loop when we're out of memory.
> > 
> > Yes, this is exactly what the patch does. 
> 
> OK.
> 
> > And kswapd has to be into a busy loop when we're out of memory! It has 
> > to be looking for free pages - it should not sleep for god sakes!
> 
> Why?  kswapd's functions are:
> 
> a) To perform scanning when direct-reclaim-capable processes are stuck
>    in disk wait (ie a bit of pipelining for CPU efficiency) and
> 
> b) To keep the free page pools full for interrupt-time allocators.
> 
> If kswapd cannot make forward progress it is quite acceptable to make it
> give up and go back to sleep.  Although it does seem better to keep kswapd
> running (but with throttling) in case there is disk writeout in flight. 

Yes, kswapd has to be "polling" from time to time looking for pages which might 
have become freeable. 

> > Note that it wont cause excessive CPU usage because kswapd will be "polling" 
> > slowly (with priority = DEF_PRIORITY) on the active/inactive lists (shrink_zone).
> > 
> > The cond_resched at the end of balance_pgdat() makes sure no harmful exclusivity
> > of CPU will happen.
> 
> Well it depends on task priorities.  But yeah, when the machine is this
> exhausted for memory we really don't care about CPU consumption.
> 
> > So this way it still does not cause the excessive CPU usage which is avoided by 
> > all_unreclaimable (ie we wont be scanning huge amounts of pages at low priorities)
> > but at the same time avoids kswapd from possibly sleeping, which is IMO
> > very bad.
> > 
> > > So I'm all confused and concerned.  It would help if you were to explain
> > > your thinking more completely... 
> > 
> > I think now you can understand what I'm thinking.
> 
> I do now.  Can we try to avoid the twenty-questions game next time??

I should have put in words what I had in mind in the first place, yes.

Will make sure to avoid it in future occasions.

> > Does it makes sense to you?
> 
> Maybe.  We really shouldn't be sending kswapd into a busy loop if all zones
> are unreclaimable.  Because it could just be that there's some disk I/O in
> flight and we'll find rotated reclaimable pages available once that I/O has
> completed.  (example: all of memory becomes dirty due to a large msync of
> MAP_SHARED memory).  So rather than madly scanning, we should throttle
> kswapd to make it wait for I/O completions.  Via blk_congestion_wait().
> That's what the total_scanned logic is supposed to do.

OK - I see your point - the best thing would be to have the IO completion 
routines (_end_io) asynchronously wake up kswapd.

Back to arguing in favour of my patch - it seemed to me that kswapd could 
go to sleep leaving allocators which can't reclaim pages themselves in a 
bad situation. 

It would have to be waken up by another instance of alloc_pages to then 
execute and start doing its job, while if it was executing already (madly 
scanning as you say), the chance it would find freeable pages quite
earlier.

Note that not only disk IO can cause pages to become freeable. A user
can give up its reference on pagecache page for example (leaving
the page on LRU to be found and freed by kswapd).

So the point was really "do not sleep if you can find freeable pages", 
in another way "its not polling enough".

Testing such modification would also prove if it indeed does what
I think it does, and what are its real effects.

I think I'll start yet another kernel tree "-mt".

Thanks for all your comments! 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 17:41   ` Marcelo Tosatti
@ 2004-11-09 21:33     ` Andrew Morton
  2004-11-09 18:26       ` Marcelo Tosatti
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2004-11-09 21:33 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, piggin

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> On Tue, Nov 09, 2004 at 12:19:45PM -0800, Andrew Morton wrote:
> > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> > >
> > > 
> > > Andrew,
> > > 
> > > I was wrong last time I read balance_pgdat() when I thought kswapd
> > > couldnt sleep under page shortage. 
> > > 
> > > It can, because all_zones_ok is set to "1" inside the 
> > > "priority=DEF_PRIORITY; priority >= 0; priority--" loop.
> > > 
> > > So this patch sets "all_zones_ok" to zero even if all_unreclaimable 
> > > is set, avoiding it from sleeping when zones are under page short.
> > > 
> > 
> > Does this solve any observed problem?  What testing was done, and what were
> > the results??
> 
> 
> The observed problem are the page allocation failures!

But the patch doesn't have any effect on that, which I can see.

> No testing has been done, but it is an obvious problem if you read the
> code. 

Not really.  The move of the all_unreclaimable test doesn't seem to do
anything, because we'll just skip that zone anyway in the next loop.

Maybe you moved the all_unreclaimable test just so that there's an
opportunity to clear all_zones_ok?  I dunno.

AFAICT, the early clearing of all_zones_ok will have no effect on kswapd
throttling because the total_scanned logic is disabled.

What I think your patch will do is to cause kswapd to do the `goto
loop_again' thing if all zones are unreclaimable.  Which appears to risk
putting kswapd into a busy loop when we're out of memory.

So I'm all confused and concerned.  It would help if you were to explain
your thinking more completely...
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 18:26       ` Marcelo Tosatti
@ 2004-11-09 22:22         ` Andrew Morton
  2004-11-09 20:31           ` Marcelo Tosatti
  2004-11-10  0:56           ` Nick Piggin
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2004-11-09 22:22 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, piggin

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> > 
> > But the patch doesn't have any effect on that, which I can see.
> 
> Andrew, it avoids kswapd from sleeping when the machine is OOM.

I think you mean that it prevents balance_pdget() from falling back to
kswapd() when the machine is oom, yes?  There are other places where kswapd
might sleep.


> > > No testing has been done, but it is an obvious problem if you read the
> > > code. 
> > 
> > Not really.  The move of the all_unreclaimable test doesn't seem to do
> > anything, because we'll just skip that zone anyway in the next loop.
> > 
> > Maybe you moved the all_unreclaimable test just so that there's an
> > opportunity to clear all_zones_ok?  I dunno.
> 
> Yes, exactly. I moved all_unreclaimable test because then there is 
> an opportunity to clear all_zones_ok. Otherwise all_zones_ok keeps set
> even if all_zones are not OK at all!

OK.

> > AFAICT, the early clearing of all_zones_ok will have no effect on kswapd
> > throttling because the total_scanned logic is disabled.
> 
> It makes this at the end of balance_pgdat
> 
>         if (!all_zones_ok) {
>                 cond_resched();
>                 goto loop_again;
>         }
> 
> happen.
> 
> > What I think your patch will do is to cause kswapd to do the `goto
> > loop_again' thing if all zones are unreclaimable.  Which appears to risk
> > putting kswapd into a busy loop when we're out of memory.
> 
> Yes, this is exactly what the patch does. 

OK.

> And kswapd has to be into a busy loop when we're out of memory! It has 
> to be looking for free pages - it should not sleep for god sakes!

Why?  kswapd's functions are:

a) To perform scanning when direct-reclaim-capable processes are stuck
   in disk wait (ie a bit of pipelining for CPU efficiency) and

b) To keep the free page pools full for interrupt-time allocators.

If kswapd cannot make forward progress it is quite acceptable to make it
give up and go back to sleep.  Although it does seem better to keep kswapd
running (but with throttling) in case there is disk writeout in flight.

> Note that it wont cause excessive CPU usage because kswapd will be "polling" 
> slowly (with priority = DEF_PRIORITY) on the active/inactive lists (shrink_zone).
> 
> The cond_resched at the end of balance_pgdat() makes sure no harmful exclusivity
> of CPU will happen.

Well it depends on task priorities.  But yeah, when the machine is this
exhausted for memory we really don't care about CPU consumption.

> So this way it still does not cause the excessive CPU usage which is avoided by 
> all_unreclaimable (ie we wont be scanning huge amounts of pages at low priorities)
> but at the same time avoids kswapd from possibly sleeping, which is IMO
> very bad.
> 
> > So I'm all confused and concerned.  It would help if you were to explain
> > your thinking more completely... 
> 
> I think now you can understand what I'm thinking.

I do now.  Can we try to avoid the twenty-questions game next time??

> Does it makes sense to you?

Maybe.  We really shouldn't be sending kswapd into a busy loop if all zones
are unreclaimable.  Because it could just be that there's some disk I/O in
flight and we'll find rotated reclaimable pages available once that I/O has
completed.  (example: all of memory becomes dirty due to a large msync of
MAP_SHARED memory).  So rather than madly scanning, we should throttle
kswapd to make it wait for I/O completions.  Via blk_congestion_wait(). 
That's what the total_scanned logic is supposed to do.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  0:28             ` Andrew Morton
@ 2004-11-09 23:16               ` Marcelo Tosatti
  2004-11-09 23:34                 ` Marcelo Tosatti
  2004-11-10  2:53                 ` Andrew Morton
  2004-11-10 18:14               ` Marcelo Tosatti
  1 sibling, 2 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 23:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, piggin

On Tue, Nov 09, 2004 at 04:28:01PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > Back to arguing in favour of my patch - it seemed to me that kswapd could 
> >  go to sleep leaving allocators which can't reclaim pages themselves in a 
> >  bad situation. 
> 
> Yes, but those processes would be sleeping in blk_congestion_wait() during,
> say, a GFP_NOIO/GFP_NOFS allocation attempt. 

I was thinking about interrupts when I mentioned "allocators which can't reclaim 
pages" :)

> And in that case, they may be
> holding locks whcih prevent kswapd from being able to do any work either.

OK... Just out of curiosity:
Isnt the "lock contention" at this level (filesystem) a relatively rare situation? 

It could be a NFS lock for example? What other kind of lock?

> >  It would have to be waken up by another instance of alloc_pages to then 
> >  execute and start doing its job, while if it was executing already (madly 
> >  scanning as you say), the chance it would find freeable pages quite
> >  earlier.
> > 
> >  Note that not only disk IO can cause pages to become freeable. A user
> >  can give up its reference on pagecache page for example (leaving
> >  the page on LRU to be found and freed by kswapd).
> 
> yup.  Or munlock(), or direct-io completion. 
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 23:16               ` Marcelo Tosatti
@ 2004-11-09 23:34                 ` Marcelo Tosatti
  2004-11-10  2:53                 ` Andrew Morton
  1 sibling, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2004-11-09 23:34 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, piggin

On Tue, Nov 09, 2004 at 09:16:54PM -0200, Marcelo Tosatti wrote:
> On Tue, Nov 09, 2004 at 04:28:01PM -0800, Andrew Morton wrote:
> > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> > >
> > > Back to arguing in favour of my patch - it seemed to me that kswapd could 
> > >  go to sleep leaving allocators which can't reclaim pages themselves in a 
> > >  bad situation. 
> > 
> > Yes, but those processes would be sleeping in blk_congestion_wait() during,
> > say, a GFP_NOIO/GFP_NOFS allocation attempt. 
> 
> I was thinking about interrupts when I mentioned "allocators which can't reclaim 
> pages" :)
> 
> > And in that case, they may be
> > holding locks whcih prevent kswapd from being able to do any work either.
> 
> OK... Just out of curiosity:
> Isnt the "lock contention" at this level (filesystem) a relatively rare situation? 
> 
> It could be a NFS lock for example? What other kind of lock?

Rather stupid question - filesystem internal locks like i_sem - 
not NFS locks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 20:31           ` Marcelo Tosatti
@ 2004-11-10  0:28             ` Andrew Morton
  2004-11-09 23:16               ` Marcelo Tosatti
  2004-11-10 18:14               ` Marcelo Tosatti
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2004-11-10  0:28 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, piggin

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> Back to arguing in favour of my patch - it seemed to me that kswapd could 
>  go to sleep leaving allocators which can't reclaim pages themselves in a 
>  bad situation. 

Yes, but those processes would be sleeping in blk_congestion_wait() during,
say, a GFP_NOIO/GFP_NOFS allocation attempt.  And in that case, they may be
holding locks whcih prevent kswapd from being able to do any work either.

>  It would have to be waken up by another instance of alloc_pages to then 
>  execute and start doing its job, while if it was executing already (madly 
>  scanning as you say), the chance it would find freeable pages quite
>  earlier.
> 
>  Note that not only disk IO can cause pages to become freeable. A user
>  can give up its reference on pagecache page for example (leaving
>  the page on LRU to be found and freed by kswapd).

yup.  Or munlock(), or direct-io completion.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 22:22         ` Andrew Morton
  2004-11-09 20:31           ` Marcelo Tosatti
@ 2004-11-10  0:56           ` Nick Piggin
  2004-11-10  2:49             ` Nick Piggin
  1 sibling, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2004-11-10  0:56 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marcelo Tosatti, linux-mm


Andrew Morton wrote:

>Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
>
>>Does it makes sense to you?
>>
>
>Maybe.  We really shouldn't be sending kswapd into a busy loop if all zones
>are unreclaimable.  Because it could just be that there's some disk I/O in
>flight and we'll find rotated reclaimable pages available once that I/O has
>completed.  (example: all of memory becomes dirty due to a large msync of
>MAP_SHARED memory).  So rather than madly scanning, we should throttle
>kswapd to make it wait for I/O completions.  Via blk_congestion_wait(). 
>That's what the total_scanned logic is supposed to do.
>
>
>

I think the patch is possibly not a good idea. Unless it fixes up
those #*%! allocation failures (*).

For OOM conditions, kswapd can be a bit lax precisely because it
doesn't oom kill things. If there is a shortage, and kswapd can't
make progress though, I think it really should sleep rather than busy
wait (albiet nicely with cond_resched()).

(*) I'm beginning to think they're due to me accidentally bumping the
    page watermarks when 'fixing' them. I'll check that out presently.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  0:56           ` Nick Piggin
@ 2004-11-10  2:49             ` Nick Piggin
  2004-11-10  2:56               ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2004-11-10  2:49 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Marcelo Tosatti, linux-mm


Nick Piggin wrote:

>
> (*) I'm beginning to think they're due to me accidentally bumping the
>    page watermarks when 'fixing' them. I'll check that out presently.
>
>
That's basically it...

2.6.8 and 2.6.10-rc both have the same watermarks (in pages):

-----------
 From SysRq+M:

       pages_min   pages_low   pages_high
dma        4          8          12
normal   234        468         702
high     128        256         384

However, 2.6.10-rc has all 0's in its ->protection maps, 2.6.8 looks like:
       gfp_dma     gfp_normal  gfp_high
dma        8         476         732
normal     0         468         724
high       0         0           256

Because 2.6.8 basically keys the entire alloc_pages behaviour off the
->protection map (and look: the diagonal corresponds to pages_low for
each zone).
-----------


Following is the minimum free pages for each zone at which some action
will happen for order-0 (ZONE_NORMAL) allocations:

2.6.8
                             | GFP_KERNEL        | GFP_ATOMIC
allocate immediately         | 477 dma, 469 norm | 12 dma, 469 norm
allocate after waking kswapd | 477 dma, 469 norm | 12 dma, 352 norm
allocate after synch reclaim | 477 dma, 469 norm | n/a

2.6.10-rc
                             | GFP_KERNEL        | GFP_ATOMIC
allocate immediately         |   9 dma, 469 norm |  9 dma, 469 norm
allocate after waking kswapd |   5 dma, 234 norm |  3 dma,  88 norm
allocate after synch reclaim |   5 dma, 234 norm |  n/a

So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is:

2.6.8      | 465 dma, 117 norm, 582 tot = 2328K
2.6.10-rc  |   2 dma, 146 norm, 148 tot =  592K

Although you can see that, theoretically 2.6.10 has a much better layout
of numbers, and an increased ZONE_NORMAL buffer, 2.6.8's weird ZONE_DMA
handling gives it 4 times the amount of buffer between GFP_KERNEL and
GFP_ATOMIC allocations.

Shall we crank up min_free_kbytes a bit?

We could also compress the watermarks, while increasing pages_min? That
will increase the GFP_ATOMIC buffer as well, without having free memory
run away on us (eg pages_min = 2*x, pages_low = 5*x/2, pages_high = 3*x)?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-09 23:16               ` Marcelo Tosatti
  2004-11-09 23:34                 ` Marcelo Tosatti
@ 2004-11-10  2:53                 ` Andrew Morton
  1 sibling, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2004-11-10  2:53 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, piggin

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> 
> > And in that case, they may be
> > holding locks whcih prevent kswapd from being able to do any work either.
> 
> OK... Just out of curiosity:
> Isnt the "lock contention" at this level (filesystem) a relatively rare situation? 
> 

It should be relatively rare - most of the blocking opportunities on the
writepage path should be avoided by now - we bale out, try another page,
throttle on some I/O completion if it's all not working out.

There are probably filesystem allocation semaphores.  journal_start/stop
acts as a semaphore, as does get_request_wait().

It all depends on whether you're looking for common case or worst case.  In
page reclaim we should tune for the common case, and avoid deadlocking in
the worst case.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  2:49             ` Nick Piggin
@ 2004-11-10  2:56               ` Andrew Morton
  2004-11-10  3:12                 ` Nick Piggin
  0 siblings, 1 reply; 21+ messages in thread
From: Andrew Morton @ 2004-11-10  2:56 UTC (permalink / raw)
  To: Nick Piggin; +Cc: marcelo.tosatti, linux-mm

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> Shall we crank up min_free_kbytes a bit?

May as well.  or we could do something fancy in register_netdevice().

>  We could also compress the watermarks, while increasing pages_min? That
>  will increase the GFP_ATOMIC buffer as well, without having free memory
>  run away on us (eg pages_min = 2*x, pages_low = 5*x/2, pages_high = 3*x)?

There are also hidden intermediate levels for rt-policy tasks.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  2:56               ` Andrew Morton
@ 2004-11-10  3:12                 ` Nick Piggin
  2004-11-10  3:18                   ` Andrew Morton
  0 siblings, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2004-11-10  3:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: marcelo.tosatti, linux-mm


Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>Shall we crank up min_free_kbytes a bit?
>>
>
>May as well.  or we could do something fancy in register_netdevice().
>
>

OK. If you look at my tables, in practice 2.6.8 will actually be
keeping more memory free anyway, in the form of ZONE_DMA free. So
we could quadruple min_free_kbytes, *but* 2.6.10's kswapd will
then free a lot further than 2.6.8.

So I'd advocate doubling min_free_kbytes, *and* squashing watermarks
together.


>> We could also compress the watermarks, while increasing pages_min? That
>> will increase the GFP_ATOMIC buffer as well, without having free memory
>> run away on us (eg pages_min = 2*x, pages_low = 5*x/2, pages_high = 3*x)?
>>
>
>There are also hidden intermediate levels for rt-policy tasks.
>
>
>

Yep, they all get keyed off pages_min - so if we just double pages_min,
we're effectively doubling that GFP_ATOMIC buffer and the rt_task
buffer(*), while halving the asynch reclaim marks (pages_low and
pages_high).

Now combine that with doubling min_free_kbytes, and we have our
quadrupled GFP_ATOMIC buffer, restoring parity with 2.6.8, while also
keeping the asynch reclaim marks in the same place. Make sense?

(*) The rt_task buffer was broken in 2.6.8 anyway because rt tasks could
allocate far more than GFP_ATOMIC allocations.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  3:12                 ` Nick Piggin
@ 2004-11-10  3:18                   ` Andrew Morton
  2004-11-10  3:27                     ` Nick Piggin
  2004-11-10  4:15                     ` Nick Piggin
  0 siblings, 2 replies; 21+ messages in thread
From: Andrew Morton @ 2004-11-10  3:18 UTC (permalink / raw)
  To: Nick Piggin; +Cc: marcelo.tosatti, linux-mm

Nick Piggin <piggin@cyberone.com.au> wrote:
>
> Make sense?

Hey, you know me - I'll believe anything.

Let's take a second look at the numbers when you have a patch.  Please
check that we're printing all the relevant info at boot time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  3:18                   ` Andrew Morton
@ 2004-11-10  3:27                     ` Nick Piggin
  2004-11-10  4:15                     ` Nick Piggin
  1 sibling, 0 replies; 21+ messages in thread
From: Nick Piggin @ 2004-11-10  3:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: marcelo.tosatti, linux-mm


Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>Make sense?
>>
>
>Hey, you know me - I'll believe anything.
>
>Let's take a second look at the numbers when you have a patch.  Please
>check that we're printing all the relevant info at boot time.
>
>

OK... I'll actually go the other way - really quadruple min_free_kbytes,
and squash the top watermarks down, rather than the bottom ones up. This
way min_free_kbytes should retain its semantics. I've got a patch I'll
test now...

Also, what info do you want at boot time?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  3:18                   ` Andrew Morton
  2004-11-10  3:27                     ` Nick Piggin
@ 2004-11-10  4:15                     ` Nick Piggin
  2004-11-10  8:17                       ` Marcelo Tosatti
  1 sibling, 1 reply; 21+ messages in thread
From: Nick Piggin @ 2004-11-10  4:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: marcelo.tosatti, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1807 bytes --]



Andrew Morton wrote:

>Nick Piggin <piggin@cyberone.com.au> wrote:
>
>>Make sense?
>>
>
>Hey, you know me - I'll believe anything.
>
>Let's take a second look at the numbers when you have a patch.  Please
>check that we're printing all the relevant info at boot time.
>
>
>
>

OK with this patch, this is what the situation looks like:

without patch:
      pages_min   pages_low   pages_high
dma        4          8          12
normal   234        468         702
high     128        256         384

with patch:
      pages_min   pages_low   pages_high
dma       17         21          25
normal   939       1173        1408
high     128        160         192

without patch:
                             | GFP_KERNEL        | GFP_ATOMIC
allocate immediately         |   9 dma, 469 norm |  9 dma, 469 norm
allocate after waking kswapd |   5 dma, 234 norm |  3 dma,  88 norm
allocate after synch reclaim |   5 dma, 234 norm |  n/a

with patch:
                             | GFP_KERNEL         | GFP_ATOMIC
allocate immediately         |  22 dma, 1174 norm | 22 dma, 1174 norm
allocate after waking kswapd |  18 dma,  940 norm |  6 dma,  440 norm
allocate after synch reclaim |  18 dma,  940 norm |  n/a


So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is:

2.6.8      | 465 dma, 117 norm, 582 tot = 2328K
2.6.10-rc  |   2 dma, 146 norm, 148 tot =  592K
patch      |  12 dma, 500 norm, 512 tot = 2048K

Which is getting pretty good.


kswap starts at:
2.6.8     477 dma, 496 norm, 973 total
2.6.10-rc   8 dma, 468 norm, 476 total
patched    17 dma, 939 norm, 956 total

So in terms of total pages, that's looking similar to 2.6.8.

I'd respectfully suggest this is a regression (versus 2.6.8, at least),
and hope it (or something like it) can get included in 2.6.10 after further
testing?


[-- Attachment #2: mm-restore-atomic-buffer.patch --]
[-- Type: text/x-patch, Size: 2334 bytes --]




---

 linux-2.6-npiggin/mm/page_alloc.c |   41 +++++++++++++++++++++-----------------
 1 files changed, 23 insertions(+), 18 deletions(-)

diff -puN mm/page_alloc.c~mm-restore-atomic-buffer mm/page_alloc.c
--- linux-2.6/mm/page_alloc.c~mm-restore-atomic-buffer	2004-11-10 15:13:33.000000000 +1100
+++ linux-2.6-npiggin/mm/page_alloc.c	2004-11-10 14:57:54.000000000 +1100
@@ -1935,8 +1935,12 @@ static void setup_per_zone_pages_min(voi
 			                   lowmem_pages;
 		}
 
-		zone->pages_low = zone->pages_min * 2;
-		zone->pages_high = zone->pages_min * 3;
+		/*
+		 * When interpreting these watermarks, just keep in mind that:
+		 * zone->pages_min == (zone->pages_min * 4) / 4;
+		 */
+		zone->pages_low   = (zone->pages_min * 5) / 4;
+		zone->pages_high  = (zone->pages_min * 6) / 4;
 		spin_unlock_irqrestore(&zone->lru_lock, flags);
 	}
 }
@@ -1945,24 +1949,25 @@ static void setup_per_zone_pages_min(voi
  * Initialise min_free_kbytes.
  *
  * For small machines we want it small (128k min).  For large machines
- * we want it large (16MB max).  But it is not linear, because network
+ * we want it large (64MB max).  But it is not linear, because network
  * bandwidth does not increase linearly with machine size.  We use
  *
- *	min_free_kbytes = sqrt(lowmem_kbytes)
+ * 	min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
+ *	min_free_kbytes = sqrt(lowmem_kbytes * 16)
  *
  * which yields
  *
- * 16MB:	128k
- * 32MB:	181k
- * 64MB:	256k
- * 128MB:	362k
- * 256MB:	512k
- * 512MB:	724k
- * 1024MB:	1024k
- * 2048MB:	1448k
- * 4096MB:	2048k
- * 8192MB:	2896k
- * 16384MB:	4096k
+ * 16MB:	512k
+ * 32MB:	724k
+ * 64MB:	1024k
+ * 128MB:	1448k
+ * 256MB:	2048k
+ * 512MB:	2896k
+ * 1024MB:	4096k
+ * 2048MB:	5792k
+ * 4096MB:	8192k
+ * 8192MB:	11584k
+ * 16384MB:	16384k
  */
 static int __init init_per_zone_pages_min(void)
 {
@@ -1970,11 +1975,11 @@ static int __init init_per_zone_pages_mi
 
 	lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
 
-	min_free_kbytes = int_sqrt(lowmem_kbytes);
+	min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
 	if (min_free_kbytes < 128)
 		min_free_kbytes = 128;
-	if (min_free_kbytes > 16384)
-		min_free_kbytes = 16384;
+	if (min_free_kbytes > 65536)
+		min_free_kbytes = 65536;
 	setup_per_zone_pages_min();
 	setup_per_zone_protection();
 	return 0;

_

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  4:15                     ` Nick Piggin
@ 2004-11-10  8:17                       ` Marcelo Tosatti
  0 siblings, 0 replies; 21+ messages in thread
From: Marcelo Tosatti @ 2004-11-10  8:17 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-mm

On Wed, Nov 10, 2004 at 03:15:53PM +1100, Nick Piggin wrote:
> 
> 
> Andrew Morton wrote:
> 
> >Nick Piggin <piggin@cyberone.com.au> wrote:
> >
> >>Make sense?
> >>
> >
> >Hey, you know me - I'll believe anything.
> >
> >Let's take a second look at the numbers when you have a patch.  Please
> >check that we're printing all the relevant info at boot time.
> >
> >
> >
> >
> 
> OK with this patch, this is what the situation looks like:
> 
> without patch:
>      pages_min   pages_low   pages_high
> dma        4          8          12
> normal   234        468         702
> high     128        256         384
> 
> with patch:
>      pages_min   pages_low   pages_high
> dma       17         21          25
> normal   939       1173        1408
> high     128        160         192
> 
> without patch:
>                             | GFP_KERNEL        | GFP_ATOMIC
> allocate immediately         |   9 dma, 469 norm |  9 dma, 469 norm
> allocate after waking kswapd |   5 dma, 234 norm |  3 dma,  88 norm
> allocate after synch reclaim |   5 dma, 234 norm |  n/a
> 
> with patch:
>                             | GFP_KERNEL         | GFP_ATOMIC
> allocate immediately         |  22 dma, 1174 norm | 22 dma, 1174 norm
> allocate after waking kswapd |  18 dma,  940 norm |  6 dma,  440 norm
> allocate after synch reclaim |  18 dma,  940 norm |  n/a
> 
> 
> So the buffer between GFP_KERNEL and GFP_ATOMIC allocations is:
> 
> 2.6.8      | 465 dma, 117 norm, 582 tot = 2328K
> 2.6.10-rc  |   2 dma, 146 norm, 148 tot =  592K
> patch      |  12 dma, 500 norm, 512 tot = 2048K
> 
> Which is getting pretty good.
> 
> 
> kswap starts at:
> 2.6.8     477 dma, 496 norm, 973 total
> 2.6.10-rc   8 dma, 468 norm, 476 total
> patched    17 dma, 939 norm, 956 total
> 
> So in terms of total pages, that's looking similar to 2.6.8.
> 
> I'd respectfully suggest this is a regression (versus 2.6.8, at least),
> and hope it (or something like it) can get included in 2.6.10 after further
> testing?

That looks much better, thanks Nick!

I bet we wont be seeing the failures so often with this in place,
nice analysis.

>  linux-2.6-npiggin/mm/page_alloc.c |   41 +++++++++++++++++++++-----------------
>  1 files changed, 23 insertions(+), 18 deletions(-)
> 
> diff -puN mm/page_alloc.c~mm-restore-atomic-buffer mm/page_alloc.c
> --- linux-2.6/mm/page_alloc.c~mm-restore-atomic-buffer	2004-11-10 15:13:33.000000000 +1100
> +++ linux-2.6-npiggin/mm/page_alloc.c	2004-11-10 14:57:54.000000000 +1100
> @@ -1935,8 +1935,12 @@ static void setup_per_zone_pages_min(voi
>  			                   lowmem_pages;
>  		}
>  
> -		zone->pages_low = zone->pages_min * 2;
> -		zone->pages_high = zone->pages_min * 3;
> +		/*
> +		 * When interpreting these watermarks, just keep in mind that:
> +		 * zone->pages_min == (zone->pages_min * 4) / 4;
> +		 */
> +		zone->pages_low   = (zone->pages_min * 5) / 4;
> +		zone->pages_high  = (zone->pages_min * 6) / 4;
>  		spin_unlock_irqrestore(&zone->lru_lock, flags);
>  	}
>  }
> @@ -1945,24 +1949,25 @@ static void setup_per_zone_pages_min(voi
>   * Initialise min_free_kbytes.
>   *
>   * For small machines we want it small (128k min).  For large machines
> - * we want it large (16MB max).  But it is not linear, because network
> + * we want it large (64MB max).  But it is not linear, because network
>   * bandwidth does not increase linearly with machine size.  We use
>   *
> - *	min_free_kbytes = sqrt(lowmem_kbytes)
> + * 	min_free_kbytes = 4 * sqrt(lowmem_kbytes), for better accuracy:
> + *	min_free_kbytes = sqrt(lowmem_kbytes * 16)
>   *
>   * which yields
>   *
> - * 16MB:	128k
> - * 32MB:	181k
> - * 64MB:	256k
> - * 128MB:	362k
> - * 256MB:	512k
> - * 512MB:	724k
> - * 1024MB:	1024k
> - * 2048MB:	1448k
> - * 4096MB:	2048k
> - * 8192MB:	2896k
> - * 16384MB:	4096k
> + * 16MB:	512k
> + * 32MB:	724k
> + * 64MB:	1024k
> + * 128MB:	1448k
> + * 256MB:	2048k
> + * 512MB:	2896k
> + * 1024MB:	4096k
> + * 2048MB:	5792k
> + * 4096MB:	8192k
> + * 8192MB:	11584k
> + * 16384MB:	16384k
>   */
>  static int __init init_per_zone_pages_min(void)
>  {
> @@ -1970,11 +1975,11 @@ static int __init init_per_zone_pages_mi
>  
>  	lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
>  
> -	min_free_kbytes = int_sqrt(lowmem_kbytes);
> +	min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
>  	if (min_free_kbytes < 128)
>  		min_free_kbytes = 128;
> -	if (min_free_kbytes > 16384)
> -		min_free_kbytes = 16384;
> +	if (min_free_kbytes > 65536)
> +		min_free_kbytes = 65536;
>  	setup_per_zone_pages_min();
>  	setup_per_zone_protection();
>  	return 0;
> 
> _

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10  0:28             ` Andrew Morton
  2004-11-09 23:16               ` Marcelo Tosatti
@ 2004-11-10 18:14               ` Marcelo Tosatti
  2004-11-10 22:08                 ` Andrew Morton
  1 sibling, 1 reply; 21+ messages in thread
From: Marcelo Tosatti @ 2004-11-10 18:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, piggin

On Tue, Nov 09, 2004 at 04:28:01PM -0800, Andrew Morton wrote:
> Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> >
> > Back to arguing in favour of my patch - it seemed to me that kswapd could 
> >  go to sleep leaving allocators which can't reclaim pages themselves in a 
> >  bad situation. 
> 
> Yes, but those processes would be sleeping in blk_congestion_wait() during,
> say, a GFP_NOIO/GFP_NOFS allocation attempt.  And in that case, they may be
> holding locks whcih prevent kswapd from being able to do any work either.
> 
> >  It would have to be waken up by another instance of alloc_pages to then 
> >  execute and start doing its job, while if it was executing already (madly 
> >  scanning as you say), the chance it would find freeable pages quite
> >  earlier.
> > 
> >  Note that not only disk IO can cause pages to become freeable. A user
> >  can give up its reference on pagecache page for example (leaving
> >  the page on LRU to be found and freed by kswapd).
> 
> yup.  Or munlock(), or direct-io completion.

Andrew,

Shouldnt the kernel ideally clear zone->all_unreclaimable in those 
situations? (munlock, direct-io completion, last reference on pagecache
page, etc).
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH] kswapd shall not sleep during page shortage
  2004-11-10 18:14               ` Marcelo Tosatti
@ 2004-11-10 22:08                 ` Andrew Morton
  0 siblings, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2004-11-10 22:08 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-mm, piggin

Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
>
> On Tue, Nov 09, 2004 at 04:28:01PM -0800, Andrew Morton wrote:
> > Marcelo Tosatti <marcelo.tosatti@cyclades.com> wrote:
> > >
> > > Back to arguing in favour of my patch - it seemed to me that kswapd could 
> > >  go to sleep leaving allocators which can't reclaim pages themselves in a 
> > >  bad situation. 
> > 
> > Yes, but those processes would be sleeping in blk_congestion_wait() during,
> > say, a GFP_NOIO/GFP_NOFS allocation attempt.  And in that case, they may be
> > holding locks whcih prevent kswapd from being able to do any work either.
> > 
> > >  It would have to be waken up by another instance of alloc_pages to then 
> > >  execute and start doing its job, while if it was executing already (madly 
> > >  scanning as you say), the chance it would find freeable pages quite
> > >  earlier.
> > > 
> > >  Note that not only disk IO can cause pages to become freeable. A user
> > >  can give up its reference on pagecache page for example (leaving
> > >  the page on LRU to be found and freed by kswapd).
> > 
> > yup.  Or munlock(), or direct-io completion.
> 
> Andrew,
> 
> Shouldnt the kernel ideally clear zone->all_unreclaimable in those 
> situations? (munlock, direct-io completion, last reference on pagecache
> page, etc).

The design intent here is that a zone shouldn't enter the all-unreclaimable
state until we've absolutely scanned the crap out of it.  So we assume that
once a zone is all-unreclaimable then it will stay that way for a
relatively long time.  We do little, short scans just to poll the status of
the zone.  If one of those short scans ends up freeing a page then the zone
is removed from the all_unreclaimable state.

So if someone does one of the above things then we hope that a subsequent
short-scan will free a page and will wake the zone up.  This has the obvious
drawback that it might take us a number of scanning passes before we
discover a reclaimable page.   1<<DEF_PRIORITY passes, worst-case.

For munlock we'd need to actually examine the zone of each affected page,
which is a bunch of new code - a full pte walk.  We don't want munlocks of
ZONE_HIGHMEM to trigger these huge scans of a lower zone.

We could possibly put special-case code in the direct-io completion
handler, but it's all a bit weird.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2004-11-10 22:08 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-09 16:46 [PATCH] kswapd shall not sleep during page shortage Marcelo Tosatti
2004-11-09 20:19 ` Andrew Morton
2004-11-09 17:41   ` Marcelo Tosatti
2004-11-09 21:33     ` Andrew Morton
2004-11-09 18:26       ` Marcelo Tosatti
2004-11-09 22:22         ` Andrew Morton
2004-11-09 20:31           ` Marcelo Tosatti
2004-11-10  0:28             ` Andrew Morton
2004-11-09 23:16               ` Marcelo Tosatti
2004-11-09 23:34                 ` Marcelo Tosatti
2004-11-10  2:53                 ` Andrew Morton
2004-11-10 18:14               ` Marcelo Tosatti
2004-11-10 22:08                 ` Andrew Morton
2004-11-10  0:56           ` Nick Piggin
2004-11-10  2:49             ` Nick Piggin
2004-11-10  2:56               ` Andrew Morton
2004-11-10  3:12                 ` Nick Piggin
2004-11-10  3:18                   ` Andrew Morton
2004-11-10  3:27                     ` Nick Piggin
2004-11-10  4:15                     ` Nick Piggin
2004-11-10  8:17                       ` Marcelo Tosatti

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.