[patch] mm: fix deferred congestion timeout if preferred zone is not allowed

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
@ 2011-01-18  5:09 David Rientjes
  2011-01-18  6:04 ` KOSAKI Motohiro
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: David Rientjes @ 2011-01-18  5:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Minchan Kim, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm

Before 0e093d99763e (writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone), preferred_zone was only used for
statistics and to determine the zoneidx from which to allocate from given
the type requested.

wait_iff_congested(), though, uses preferred_zone to determine if the
congestion wait should be deferred because its dirty pages are backed by
a congested bdi.  This incorrectly defers the timeout and busy loops in
the page allocator with various cond_resched() calls if preferred_zone is
not allowed in the current context, usually consuming 100% of a cpu.

This patch resets preferred_zone to an allowed zone in the slowpath if
the allocation context is constrained by current's cpuset.  It also
ensures preferred_zone is from the set of allowed nodes when called from
within direct reclaim; allocations are always constrainted by cpusets
since the context is always blockable.

Both of these uses of cpuset_current_mems_allowed are protected by
get_mems_allowed().
---
 mm/page_alloc.c |   12 ++++++++++++
 mm/vmscan.c     |    3 ++-
 2 files changed, 14 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2034,6 +2034,18 @@ restart:
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);

+	/*
+	 * If preferred_zone cannot be allocated from in this context, find the
+	 * first allowable zone instead.
+	 */
+	if ((alloc_flags & ALLOC_CPUSET) &&
+	    !cpuset_zone_allowed_softwall(preferred_zone, gfp_mask)) {
+		first_zones_zonelist(zonelist, high_zoneidx,
+				&cpuset_current_mems_allowed, &preferred_zone);
+		if (unlikely(!preferred_zone))
+			goto nopage;
+	}
+
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2084,7 +2084,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			struct zone *preferred_zone;

 			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
-							NULL, &preferred_zone);
+						&cpuset_current_mems_allowed,
+						&preferred_zone);
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
 		}
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18  5:09 [patch] mm: fix deferred congestion timeout if preferred zone is not allowed David Rientjes
@ 2011-01-18  6:04 ` KOSAKI Motohiro
  2011-01-18 10:29   ` Mel Gorman
  2011-01-18 10:15 ` Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: KOSAKI Motohiro @ 2011-01-18  6:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Mel Gorman, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, Rik van Riel,
	Jens Axboe, linux-mm

Hi,

> Before 0e093d99763e (writeback: do not sleep on the congestion queue if
> there are no congested BDIs or if significant congestion is not being
> encountered in the current zone), preferred_zone was only used for
> statistics and to determine the zoneidx from which to allocate from given
> the type requested.

True. So, following comment is now bogus. ;)

__alloc_pages_nodemask()
{
(snip)
        get_mems_allowed();
        /* The preferred zone is used for statistics later */
        first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
        if (!preferred_zone) {
                put_mems_allowed();
                return NULL;
        }


Now, we have three preferred_zone usage.
 1. for zone stat
 2. wait_iff_congested
 3. for calculate compaction duration

So, I have two question.  

1. Why do we need different vm stat policy mempolicy and cpuset? 
That said, if we are using mempolicy, the above nodemask variable is 
not NULL, then preferrd_zone doesn't point nearest zone. But it point 
always nearest zone when cpuset are used. 

2. Why wait_iff_congested in page_alloc only wait preferred zone? 
That said, theorically, any no congested zones in allocatable zones can
avoid waiting. Just code simplify?

3. I'm not sure why zone->compact_defer is not noted per zone, instead
noted only preferred zone. Do you know the intention?

I mean my first feeling tell me that we have a chance to make code simplify
more.

Mel, Can you please tell us your opinion?



> wait_iff_congested(), though, uses preferred_zone to determine if the
> congestion wait should be deferred because its dirty pages are backed by
> a congested bdi.  This incorrectly defers the timeout and busy loops in
> the page allocator with various cond_resched() calls if preferred_zone is
> not allowed in the current context, usually consuming 100% of a cpu.
> 
> This patch resets preferred_zone to an allowed zone in the slowpath if
> the allocation context is constrained by current's cpuset.  It also
> ensures preferred_zone is from the set of allowed nodes when called from
> within direct reclaim; allocations are always constrainted by cpusets
> since the context is always blockable.
> 
> Both of these uses of cpuset_current_mems_allowed are protected by
> get_mems_allowed().


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18  5:09 [patch] mm: fix deferred congestion timeout if preferred zone is not allowed David Rientjes
  2011-01-18  6:04 ` KOSAKI Motohiro
@ 2011-01-18 10:15 ` Mel Gorman
  2011-01-18 20:24   ` David Rientjes
  2011-01-19 12:52   ` KOSAKI Motohiro
  2011-01-19  0:43 ` Minchan Kim
  2011-01-23 22:30 ` [patch v2] " David Rientjes
  3 siblings, 2 replies; 21+ messages in thread
From: Mel Gorman @ 2011-01-18 10:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm

On Mon, Jan 17, 2011 at 09:09:07PM -0800, David Rientjes wrote:
> Before 0e093d99763e (writeback: do not sleep on the congestion queue if
> there are no congested BDIs or if significant congestion is not being
> encountered in the current zone), preferred_zone was only used for
> statistics and to determine the zoneidx from which to allocate from given
> the type requested.
> 

It is also used when deciding if compaction should be deferred for a
period of time.

> wait_iff_congested(), though, uses preferred_zone to determine if the
> congestion wait should be deferred because its dirty pages are backed by
> a congested bdi.  This incorrectly defers the timeout and busy loops in
> the page allocator with various cond_resched() calls if preferred_zone is
> not allowed in the current context, usually consuming 100% of a cpu.
> 

The current context being cpuset context or do you have other situations
in mind?

> This patch resets preferred_zone to an allowed zone in the slowpath if
> the allocation context is constrained by current's cpuset. 

Well, preferred_zone has meaning. If it's not possible to allocate from
that zone in the current cpuset context, it's not really preferred. Why
not set it in the fast path so there isn't a useless call to
get_page_from_freelist()?

> It also
> ensures preferred_zone is from the set of allowed nodes when called from
> within direct reclaim; allocations are always constrainted by cpusets
> since the context is always blockable.
> 

preferred_zone should already be obeying nodemask and the set of allowed
nodes. Are you aware of an instance where this is not the case or are
you talking about the nodes allowed by the cpuset?

> Both of these uses of cpuset_current_mems_allowed are protected by
> get_mems_allowed().
> ---
>  mm/page_alloc.c |   12 ++++++++++++
>  mm/vmscan.c     |    3 ++-
>  2 files changed, 14 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2034,6 +2034,18 @@ restart:
>  	 */
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
> +	/*
> +	 * If preferred_zone cannot be allocated from in this context, find the
> +	 * first allowable zone instead.
> +	 */
> +	if ((alloc_flags & ALLOC_CPUSET) &&
> +	    !cpuset_zone_allowed_softwall(preferred_zone, gfp_mask)) {
> +		first_zones_zonelist(zonelist, high_zoneidx,
> +				&cpuset_current_mems_allowed, &preferred_zone);
> +		if (unlikely(!preferred_zone))
> +			goto nopage;
> +	}
> +

This looks as if it would work but is there any reason why
cpuset_current_mems_allowed is not used as the nodemask for ALLOC_CPUSET? It's
used by ZLC with CONFIG_NUMA machines for example so it seems a little
inconsistent. If a nodemask was supplied by the caller, it could be AND'd
with cpuset_current_mems_allowed.

>  	/* This is the last chance, in general, before the goto nopage. */
>  	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
>  			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2084,7 +2084,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
>  			struct zone *preferred_zone;
>  
>  			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
> -							NULL, &preferred_zone);
> +						&cpuset_current_mems_allowed,
> +						&preferred_zone);
>  			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);

This part looks fine and a worthwhile fix all on its own.

>  		}
>  	}
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18  6:04 ` KOSAKI Motohiro
@ 2011-01-18 10:29   ` Mel Gorman
  2011-01-19 12:48     ` KOSAKI Motohiro
  0 siblings, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2011-01-18 10:29 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: David Rientjes, Andrew Morton, Johannes Weiner, Minchan Kim,
	Wu Fengguang, KAMEZAWA Hiroyuki, Rik van Riel, Jens Axboe,
	linux-mm

On Tue, Jan 18, 2011 at 03:04:21PM +0900, KOSAKI Motohiro wrote:
> Hi,
> 
> > Before 0e093d99763e (writeback: do not sleep on the congestion queue if
> > there are no congested BDIs or if significant congestion is not being
> > encountered in the current zone), preferred_zone was only used for
> > statistics and to determine the zoneidx from which to allocate from given
> > the type requested.
> 
> True. So, following comment is now bogus. ;)
> 
> __alloc_pages_nodemask()
> {
> (snip)
>         get_mems_allowed();
>         /* The preferred zone is used for statistics later */
>         first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
>         if (!preferred_zone) {
>                 put_mems_allowed();
>                 return NULL;
>         }
> 

At best, it's misleading. It is used for staistics later but it's not
all it's used for.

> 
> Now, we have three preferred_zone usage.
>  1. for zone stat
>  2. wait_iff_congested
>  3. for calculate compaction duration
> 

For 3, it is used to determine if compaction should be deferred. I'm not
sure what it has to do with the duration of compaction.

> So, I have two question.  
> 

three questions :)

> 1. Why do we need different vm stat policy mempolicy and cpuset? 
> That said, if we are using mempolicy, the above nodemask variable is 
> not NULL, then preferrd_zone doesn't point nearest zone. But it point 
> always nearest zone when cpuset are used. 
> 

I think this is historical. cpuset and mempolicy were introduced at
different times and never merged together as they should have been. I
think an attempt was made a very long time ago but there was significant
resistance from SGI developers who didn't want to see regressions
introduced in a feature they depended heavily on.

> 2. Why wait_iff_congested in page_alloc only wait preferred zone? 
> That said, theorically, any no congested zones in allocatable zones can
> avoid waiting. Just code simplify?
> 

The ideal for page allocation is that the preferred zone is always used.
If it is congested, it's probable that significant pressure also exists on
the other zones in the zonelist (because an allocation attempt failed)
but if the preferred zone is uncongested, we should try reclaiming from
it rather than going to sleep.

> 3. I'm not sure why zone->compact_defer is not noted per zone, instead
> noted only preferred zone. Do you know the intention?
> 

If we are deferring compaction, we have failed to compact the preferred
zone and all other zones in the zonelist. Updating the preferred zone is
sufficient for future allocations of the same type. We could update all
zones in the zonelist but it's unnecessary overhead and gains very little.

> I mean my first feeling tell me that we have a chance to make code simplify
> more.
> 
> Mel, Can you please tell us your opinion?
> 

Right now, I'm thinking that cpuset_current_mems_allowed should be used
as a nodemask earlier so that preferred_zone gets initialised as a
sensible value early on.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18 10:15 ` Mel Gorman
@ 2011-01-18 20:24   ` David Rientjes
  2011-01-18 20:42     ` Mel Gorman
  2011-01-19 13:01     ` KOSAKI Motohiro
  2011-01-19 12:52   ` KOSAKI Motohiro
  1 sibling, 2 replies; 21+ messages in thread
From: David Rientjes @ 2011-01-18 20:24 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm

On Tue, 18 Jan 2011, Mel Gorman wrote:

> > wait_iff_congested(), though, uses preferred_zone to determine if the
> > congestion wait should be deferred because its dirty pages are backed by
> > a congested bdi.  This incorrectly defers the timeout and busy loops in
> > the page allocator with various cond_resched() calls if preferred_zone is
> > not allowed in the current context, usually consuming 100% of a cpu.
> > 
> 
> The current context being cpuset context or do you have other situations
> in mind?
> 

Only cpuset context will restrict certain nodes from being allocated from, 
mempolicies pass the allowed mask into the page allocator already.

> > This patch resets preferred_zone to an allowed zone in the slowpath if
> > the allocation context is constrained by current's cpuset. 
> 
> Well, preferred_zone has meaning. If it's not possible to allocate from
> that zone in the current cpuset context, it's not really preferred. Why
> not set it in the fast path so there isn't a useless call to
> get_page_from_freelist()?
> 

It may be the preferred zone even if it isn't allowed by current's cpuset 
such as if the allocation is __GFP_WAIT or the task has been oom killed 
and has the TIF_MEMDIE bit set, so the preferred zone in the fastpath is 
accurate in these cases.  In the slowpath, the former is protected by 
checking for ALLOC_CPUSET and the latter is usually only set after the 
page allocator has looped at least once and triggered the oom killer to be 
killed.

I didn't want to add a branch to test for these possibilities in the 
fastpath, however, since preferred_zone isn't of critical importance until 
it's used in the slowpath (ignoring the statistical usage).

> > It also
> > ensures preferred_zone is from the set of allowed nodes when called from
> > within direct reclaim; allocations are always constrainted by cpusets
> > since the context is always blockable.
> > 
> 
> preferred_zone should already be obeying nodemask and the set of allowed
> nodes. Are you aware of an instance where this is not the case or are
> you talking about the nodes allowed by the cpuset?
> 

In the direct reclaim path, the fix is to make sure preferred_zone is 
allowed by cpuset_current_mems_allowed since we don't need to test for 
__GFP_WAIT: it's useless to check the congestion of a zone that cannot be 
allocated from.

> > Both of these uses of cpuset_current_mems_allowed are protected by
> > get_mems_allowed().
> > ---
> >  mm/page_alloc.c |   12 ++++++++++++
> >  mm/vmscan.c     |    3 ++-
> >  2 files changed, 14 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2034,6 +2034,18 @@ restart:
> >  	 */
> >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >  
> > +	/*
> > +	 * If preferred_zone cannot be allocated from in this context, find the
> > +	 * first allowable zone instead.
> > +	 */
> > +	if ((alloc_flags & ALLOC_CPUSET) &&
> > +	    !cpuset_zone_allowed_softwall(preferred_zone, gfp_mask)) {
> > +		first_zones_zonelist(zonelist, high_zoneidx,
> > +				&cpuset_current_mems_allowed, &preferred_zone);
> > +		if (unlikely(!preferred_zone))
> > +			goto nopage;
> > +	}
> > +
> 
> This looks as if it would work but is there any reason why
> cpuset_current_mems_allowed is not used as the nodemask for ALLOC_CPUSET? It's
> used by ZLC with CONFIG_NUMA machines for example so it seems a little
> inconsistent. If a nodemask was supplied by the caller, it could be AND'd
> with cpuset_current_mems_allowed.
> 

ALLOC_CPUSET is checked in get_page_from_freelist() because there are 
exceptions allowed both by cpuset_zone_allowed_softwall() based on the 
state of the task and by not setting ALLOC_CPUSET in the page allocator 
based on !__GFP_WAIT.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18 20:24   ` David Rientjes
@ 2011-01-18 20:42     ` Mel Gorman
  2011-01-19  1:51       ` David Rientjes
  2011-01-19 13:01     ` KOSAKI Motohiro
  1 sibling, 1 reply; 21+ messages in thread
From: Mel Gorman @ 2011-01-18 20:42 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm

On Tue, Jan 18, 2011 at 12:24:26PM -0800, David Rientjes wrote:
> On Tue, 18 Jan 2011, Mel Gorman wrote:
> 
> > > wait_iff_congested(), though, uses preferred_zone to determine if the
> > > congestion wait should be deferred because its dirty pages are backed by
> > > a congested bdi.  This incorrectly defers the timeout and busy loops in
> > > the page allocator with various cond_resched() calls if preferred_zone is
> > > not allowed in the current context, usually consuming 100% of a cpu.
> > > 
> > 
> > The current context being cpuset context or do you have other situations
> > in mind?
> > 
> 
> Only cpuset context will restrict certain nodes from being allocated from, 
> mempolicies pass the allowed mask into the page allocator already.
> 
> > > This patch resets preferred_zone to an allowed zone in the slowpath if
> > > the allocation context is constrained by current's cpuset. 
> > 
> > Well, preferred_zone has meaning. If it's not possible to allocate from
> > that zone in the current cpuset context, it's not really preferred. Why
> > not set it in the fast path so there isn't a useless call to
> > get_page_from_freelist()?
> > 
> 
> It may be the preferred zone even if it isn't allowed by current's cpuset 
> such as if the allocation is __GFP_WAIT or the task has been oom killed 
> and has the TIF_MEMDIE bit set, so the preferred zone in the fastpath is 
> accurate in these cases.  In the slowpath, the former is protected by 
> checking for ALLOC_CPUSET and the latter is usually only set after the 
> page allocator has looped at least once and triggered the oom killer to be 
> killed.
> 

Ok, this is reasonable and is a notable distinction from nodemasks. It's
worth including this in the changelog.

> I didn't want to add a branch to test for these possibilities in the 
> fastpath, however, since preferred_zone isn't of critical importance until 
> it's used in the slowpath (ignoring the statistical usage).
> 

With these two paragraphs included in the changelog;

Acked-by: Mel Gorman <mel@csn.ul.ie>

Thanks.

> > > It also
> > > ensures preferred_zone is from the set of allowed nodes when called from
> > > within direct reclaim; allocations are always constrainted by cpusets
> > > since the context is always blockable.
> > > 
> > 
> > preferred_zone should already be obeying nodemask and the set of allowed
> > nodes. Are you aware of an instance where this is not the case or are
> > you talking about the nodes allowed by the cpuset?
> > 
> 
> In the direct reclaim path, the fix is to make sure preferred_zone is 
> allowed by cpuset_current_mems_allowed since we don't need to test for 
> __GFP_WAIT: it's useless to check the congestion of a zone that cannot be 
> allocated from.
> 
> > > Both of these uses of cpuset_current_mems_allowed are protected by
> > > get_mems_allowed().
> > > ---
> > >  mm/page_alloc.c |   12 ++++++++++++
> > >  mm/vmscan.c     |    3 ++-
> > >  2 files changed, 14 insertions(+), 1 deletions(-)
> > > 
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2034,6 +2034,18 @@ restart:
> > >  	 */
> > >  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
> > >  
> > > +	/*
> > > +	 * If preferred_zone cannot be allocated from in this context, find the
> > > +	 * first allowable zone instead.
> > > +	 */
> > > +	if ((alloc_flags & ALLOC_CPUSET) &&
> > > +	    !cpuset_zone_allowed_softwall(preferred_zone, gfp_mask)) {
> > > +		first_zones_zonelist(zonelist, high_zoneidx,
> > > +				&cpuset_current_mems_allowed, &preferred_zone);
> > > +		if (unlikely(!preferred_zone))
> > > +			goto nopage;
> > > +	}
> > > +
> > 
> > This looks as if it would work but is there any reason why
> > cpuset_current_mems_allowed is not used as the nodemask for ALLOC_CPUSET? It's
> > used by ZLC with CONFIG_NUMA machines for example so it seems a little
> > inconsistent. If a nodemask was supplied by the caller, it could be AND'd
> > with cpuset_current_mems_allowed.
> > 
> 
> ALLOC_CPUSET is checked in get_page_from_freelist() because there are 
> exceptions allowed both by cpuset_zone_allowed_softwall() based on the 
> state of the task and by not setting ALLOC_CPUSET in the page allocator 
> based on !__GFP_WAIT.
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18  5:09 [patch] mm: fix deferred congestion timeout if preferred zone is not allowed David Rientjes
  2011-01-18  6:04 ` KOSAKI Motohiro
  2011-01-18 10:15 ` Mel Gorman
@ 2011-01-19  0:43 ` Minchan Kim
  2011-01-19  1:53   ` David Rientjes
  2011-01-23 22:30 ` [patch v2] " David Rientjes
  3 siblings, 1 reply; 21+ messages in thread
From: Minchan Kim @ 2011-01-19  0:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm, Christoph Lameter

Hi David,

On Tue, Jan 18, 2011 at 2:09 PM, David Rientjes <rientjes@google.com> wrote:
> Before 0e093d99763e (writeback: do not sleep on the congestion queue if
> there are no congested BDIs or if significant congestion is not being
> encountered in the current zone), preferred_zone was only used for
> statistics and to determine the zoneidx from which to allocate from given
> the type requested.
>
> wait_iff_congested(), though, uses preferred_zone to determine if the
> congestion wait should be deferred because its dirty pages are backed by
> a congested bdi.  This incorrectly defers the timeout and busy loops in
> the page allocator with various cond_resched() calls if preferred_zone is
> not allowed in the current context, usually consuming 100% of a cpu.
>
> This patch resets preferred_zone to an allowed zone in the slowpath if
> the allocation context is constrained by current's cpuset.  It also
> ensures preferred_zone is from the set of allowed nodes when called from
> within direct reclaim; allocations are always constrainted by cpusets
> since the context is always blockable.
>
> Both of these uses of cpuset_current_mems_allowed are protected by
> get_mems_allowed().
> ---
>  mm/page_alloc.c |   12 ++++++++++++
>  mm/vmscan.c     |    3 ++-
>  2 files changed, 14 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2034,6 +2034,18 @@ restart:
>         */
>        alloc_flags = gfp_to_alloc_flags(gfp_mask);
>
> +       /*
> +        * If preferred_zone cannot be allocated from in this context, find the
> +        * first allowable zone instead.
> +        */
> +       if ((alloc_flags & ALLOC_CPUSET) &&
> +           !cpuset_zone_allowed_softwall(preferred_zone, gfp_mask)) {
> +               first_zones_zonelist(zonelist, high_zoneidx,
> +                               &cpuset_current_mems_allowed, &preferred_zone);

This patch is one we need. but I have a nitpick.
I am not familiar with CPUSET so I might be wrong.

I think it could make side effect of statistics of ZVM on
buffered_rmqueue since you intercept and change preferred_zone.
It could make NUMA_HIT instead of NUMA_MISS.
Is it your intention?




-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18 20:42     ` Mel Gorman
@ 2011-01-19  1:51       ` David Rientjes
  0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2011-01-19  1:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Johannes Weiner, Minchan Kim, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm

On Tue, 18 Jan 2011, Mel Gorman wrote:

> > It may be the preferred zone even if it isn't allowed by current's cpuset 
> > such as if the allocation is __GFP_WAIT or the task has been oom killed 
> > and has the TIF_MEMDIE bit set, so the preferred zone in the fastpath is 
> > accurate in these cases.  In the slowpath, the former is protected by 
> > checking for ALLOC_CPUSET and the latter is usually only set after the 
> > page allocator has looped at least once and triggered the oom killer to be 
> > killed.
> > 
> 
> Ok, this is reasonable and is a notable distinction from nodemasks. It's
> worth including this in the changelog.
> 

Agreed, and please do s/__GFP_WAIT/!__GFP_WAIT/ for it, too :)

> > I didn't want to add a branch to test for these possibilities in the 
> > fastpath, however, since preferred_zone isn't of critical importance until 
> > it's used in the slowpath (ignoring the statistical usage).
> > 
> 
> With these two paragraphs included in the changelog;
> 
> Acked-by: Mel Gorman <mel@csn.ul.ie>
> 

Thanks, and as Andrew noted:

Signed-off-by: David Rientjes <rientjes@google.com>

I'd also suggest this for the -stable tree for 2.6.37.x.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-19  0:43 ` Minchan Kim
@ 2011-01-19  1:53   ` David Rientjes
  2011-01-19  4:10     ` Minchan Kim
  2011-01-19 19:59     ` Christoph Lameter
  0 siblings, 2 replies; 21+ messages in thread
From: David Rientjes @ 2011-01-19  1:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm, Christoph Lameter

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1187 bytes --]

On Wed, 19 Jan 2011, Minchan Kim wrote:

> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2034,6 +2034,18 @@ restart:
> >         */
> >        alloc_flags = gfp_to_alloc_flags(gfp_mask);
> >
> > +       /*
> > +        * If preferred_zone cannot be allocated from in this context, find the
> > +        * first allowable zone instead.
> > +        */
> > +       if ((alloc_flags & ALLOC_CPUSET) &&
> > +           !cpuset_zone_allowed_softwall(preferred_zone, gfp_mask)) {
> > +               first_zones_zonelist(zonelist, high_zoneidx,
> > +                               &cpuset_current_mems_allowed, &preferred_zone);
> 
> This patch is one we need. but I have a nitpick.
> I am not familiar with CPUSET so I might be wrong.
> 
> I think it could make side effect of statistics of ZVM on
> buffered_rmqueue since you intercept and change preferred_zone.
> It could make NUMA_HIT instead of NUMA_MISS.
> Is it your intention?
> 

It depends on the semantics of NUMA_MISS: if no local nodes are allowed by 
current's cpuset (a pretty poor cpuset config :), then it seems logical 
that all allocations would be a miss.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-19  1:53   ` David Rientjes
@ 2011-01-19  4:10     ` Minchan Kim
  2011-01-19 19:59     ` Christoph Lameter
  1 sibling, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2011-01-19  4:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm, Christoph Lameter

On Wed, Jan 19, 2011 at 10:53 AM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 19 Jan 2011, Minchan Kim wrote:
>
>> >
>> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> > --- a/mm/page_alloc.c
>> > +++ b/mm/page_alloc.c
>> > @@ -2034,6 +2034,18 @@ restart:
>> >         */
>> >        alloc_flags = gfp_to_alloc_flags(gfp_mask);
>> >
>> > +       /*
>> > +        * If preferred_zone cannot be allocated from in this context, find the
>> > +        * first allowable zone instead.
>> > +        */
>> > +       if ((alloc_flags & ALLOC_CPUSET) &&
>> > +           !cpuset_zone_allowed_softwall(preferred_zone, gfp_mask)) {
>> > +               first_zones_zonelist(zonelist, high_zoneidx,
>> > +                               &cpuset_current_mems_allowed, &preferred_zone);
>>
>> This patch is one we need. but I have a nitpick.
>> I am not familiar with CPUSET so I might be wrong.
>>
>> I think it could make side effect of statistics of ZVM on
>> buffered_rmqueue since you intercept and change preferred_zone.
>> It could make NUMA_HIT instead of NUMA_MISS.
>> Is it your intention?
>>
>
> It depends on the semantics of NUMA_MISS: if no local nodes are allowed by
> current's cpuset (a pretty poor cpuset config :), then it seems logical
> that all allocations would be a miss.

It does make sense to me but I delegate the decision to Christoph who is author.
And please write down this behavior change into changelog. :)
Thanks David.



-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18 10:29   ` Mel Gorman
@ 2011-01-19 12:48     ` KOSAKI Motohiro
  0 siblings, 0 replies; 21+ messages in thread
From: KOSAKI Motohiro @ 2011-01-19 12:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, Rik van Riel,
	Jens Axboe, linux-mm

> > Now, we have three preferred_zone usage.
> >  1. for zone stat
> >  2. wait_iff_congested
> >  3. for calculate compaction duration
> > 
> 
> For 3, it is used to determine if compaction should be deferred. I'm not
> sure what it has to do with the duration of compaction.
> 
> > So, I have two question.  
> > 
> 
> three questions :)

Hehe, Maybe the yellow monkey can't count number rather than two. I bet. ;-p

> > 1. Why do we need different vm stat policy mempolicy and cpuset? 
> > That said, if we are using mempolicy, the above nodemask variable is 
> > not NULL, then preferrd_zone doesn't point nearest zone. But it point 
> > always nearest zone when cpuset are used. 
> 
> I think this is historical. cpuset and mempolicy were introduced at
> different times and never merged together as they should have been. I
> think an attempt was made a very long time ago but there was significant
> resistance from SGI developers who didn't want to see regressions
> introduced in a feature they depended heavily on.

Yup, I think so too.
And as David said, NUMA stat is not so important stastics. Probably we can concentrate
usage (2).


> 
> > 2. Why wait_iff_congested in page_alloc only wait preferred zone? 
> > That said, theorically, any no congested zones in allocatable zones can
> > avoid waiting. Just code simplify?
> > 
> 
> The ideal for page allocation is that the preferred zone is always used.

Yup, really.

However, now we are discussing reclaim and allocationo retry path. It's slightly different
allocation fast path.

> If it is congested, it's probable that significant pressure also exists on
> the other zones in the zonelist (because an allocation attempt failed)

Hmm..
Why do we need to guess it? It is in allocation retrying path, IOW it's after try_to_free_pages(),
scanning zonelist is not so heavy impact operation.


> but if the preferred zone is uncongested, we should try reclaiming from
> it rather than going to sleep.

But if it's congested, the task will be stucked in vmscan anyway.
Can you please show your worried scenario?

> > 3. I'm not sure why zone->compact_defer is not noted per zone, instead
> > noted only preferred zone. Do you know the intention?
> 
> If we are deferring compaction, we have failed to compact the preferred
> zone and all other zones in the zonelist. Updating the preferred zone is
> sufficient for future allocations of the same type. We could update all
> zones in the zonelist but it's unnecessary overhead and gains very little.

Ok, the requirement is to note one of zones, not list head zone of zonelist.


> > I mean my first feeling tell me that we have a chance to make code simplify
> > more.
> > 
> > Mel, Can you please tell us your opinion?
> > 
> 
> Right now, I'm thinking that cpuset_current_mems_allowed should be used
> as a nodemask earlier so that preferred_zone gets initialised as a
> sensible value early on.





--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18 10:15 ` Mel Gorman
  2011-01-18 20:24   ` David Rientjes
@ 2011-01-19 12:52   ` KOSAKI Motohiro
  1 sibling, 0 replies; 21+ messages in thread
From: KOSAKI Motohiro @ 2011-01-19 12:52 UTC (permalink / raw)
  To: Mel Gorman
  Cc: kosaki.motohiro, David Rientjes, Andrew Morton, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, Rik van Riel,
	Jens Axboe, linux-mm

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2084,7 +2084,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
> >  			struct zone *preferred_zone;
> >  
> >  			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
> > -							NULL, &preferred_zone);
> > +						&cpuset_current_mems_allowed,
> > +						&preferred_zone);
> >  			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
> 
> This part looks fine and a worthwhile fix all on its own.

No. Memcg reclaim should be cared cpuset-wall. Please look at shrink_zones().
And, Now I don't think checking only preferred zone is good idea. zone congestion
makes pageout() failure and makes lots lru rotation than necessary.

Following patch care all related zones, but keep sleep 0.1 seconds at maximum.

---
 mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
 1 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 55f5c0e..6b453d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1868,6 +1868,7 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 {
 	struct zoneref *z;
 	struct zone *zone;
+	long timeout = HZ/10;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -1886,6 +1887,32 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 
 		shrink_zone(priority, zone, sc);
 	}
+
+	/* No heavy pressure. */
+	if (priority >= DEF_PRIORITY - 2)
+		return;
+
+	/* Obviously we didn't issue IO. */
+	if (sc->nr_scanned == 0)
+		return;
+
+	/* Other tasks are freezed. IO congestion is no matter. */
+	if (sc->hibernation_mode)
+		return;
+
+	/* Take a nap, wait for some writeback to complete */
+	for_each_zone_zonelist_nodemask(zone, z, zonelist,
+					gfp_zone(sc->gfp_mask), sc->nodemask) {
+		if (!populated_zone(zone))
+			continue;
+		if (scanning_global_lru(sc) &&
+		    !cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
+			continue;
+
+		timeout = wait_iff_congested(zone, BLK_RW_ASYNC, timeout);
+		if (!timeout)
+			break;
+	}
 }
 
 static bool zone_reclaimable(struct zone *zone)
@@ -1993,16 +2020,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
 			sc->may_writepage = 1;
 		}
-
-		/* Take a nap, wait for some writeback to complete */
-		if (!sc->hibernation_mode && sc->nr_scanned &&
-		    priority < DEF_PRIORITY - 2) {
-			struct zone *preferred_zone;
-
-			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
-							NULL, &preferred_zone);
-			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
-		}
 	}
 
 out:
-- 
1.6.5.2



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18 20:24   ` David Rientjes
  2011-01-18 20:42     ` Mel Gorman
@ 2011-01-19 13:01     ` KOSAKI Motohiro
  2011-01-19 18:37       ` David Rientjes
  1 sibling, 1 reply; 21+ messages in thread
From: KOSAKI Motohiro @ 2011-01-19 13:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Mel Gorman, Andrew Morton, Johannes Weiner,
	Minchan Kim, Wu Fengguang, KAMEZAWA Hiroyuki, Rik van Riel,
	Jens Axboe, linux-mm

> > > This patch resets preferred_zone to an allowed zone in the slowpath if
> > > the allocation context is constrained by current's cpuset. 
> > 
> > Well, preferred_zone has meaning. If it's not possible to allocate from
> > that zone in the current cpuset context, it's not really preferred. Why
> > not set it in the fast path so there isn't a useless call to
> > get_page_from_freelist()?
> > 
> 
> It may be the preferred zone even if it isn't allowed by current's cpuset 
> such as if the allocation is __GFP_WAIT or the task has been oom killed 
> and has the TIF_MEMDIE bit set, so the preferred zone in the fastpath is 
> accurate in these cases.  In the slowpath, the former is protected by 
> checking for ALLOC_CPUSET and the latter is usually only set after the 
> page allocator has looped at least once and triggered the oom killer to be 
> killed.
> 
> I didn't want to add a branch to test for these possibilities in the 
> fastpath, however, since preferred_zone isn't of critical importance until 
> it's used in the slowpath (ignoring the statistical usage).

I'm glad to you are keeping fastpath concern. However you don't need
nodemask-and in this case. Because zonelist->zref[0] is always in nodemask.
Please see policy_zonelist(). So, you can just replace nodemask with cpuset_mems_allowed.

This is not only simple, but also improve a consisteny of mempolicy.

---
 mm/page_alloc.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 07a6544..876de04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2146,7 +2146,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 
 	get_mems_allowed();
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
+	first_zones_zonelist(zonelist, high_zoneidx,
+			     &cpuset_current_mems_allowed, &preferred_zone);
 	if (!preferred_zone) {
 		put_mems_allowed();
 		return NULL;
-- 
1.6.5.2




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-19 13:01     ` KOSAKI Motohiro
@ 2011-01-19 18:37       ` David Rientjes
  0 siblings, 0 replies; 21+ messages in thread
From: David Rientjes @ 2011-01-19 18:37 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Mel Gorman, Andrew Morton, Johannes Weiner, Minchan Kim,
	Wu Fengguang, KAMEZAWA Hiroyuki, Rik van Riel, Jens Axboe,
	linux-mm

On Wed, 19 Jan 2011, KOSAKI Motohiro wrote:

> I'm glad to you are keeping fastpath concern. However you don't need
> nodemask-and in this case. Because zonelist->zref[0] is always in nodemask.
> Please see policy_zonelist(). So, you can just replace nodemask with cpuset_mems_allowed.
> 
> This is not only simple, but also improve a consisteny of mempolicy.
> 

mempolicies have nothing to do with this, they pass their nodemask into 
the page allocator so the preferred_zone is already allowed; setting a 
mempolicy with a nodemask that is disallowed by the cpuset is an invalid 
configuration.

> ---
>  mm/page_alloc.c |    3 ++-
>  1 files changed, 2 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 07a6544..876de04 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2146,7 +2146,8 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
>  
>  	get_mems_allowed();
>  	/* The preferred zone is used for statistics later */
> -	first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
> +	first_zones_zonelist(zonelist, high_zoneidx,
> +			     &cpuset_current_mems_allowed, &preferred_zone);
>  	if (!preferred_zone) {
>  		put_mems_allowed();
>  		return NULL;

As previously mentioned, I didn't want to affect the current behavior of 
mempolicies when they pass their own nodemask into the page allocator that 
may be a subset of the set of allowed nodes; in that case, the statistics 
are probably actually important and we can defer resetting preferred_zone 
to the slowpath where we know its a __GFP_WAIT allocation instead of the 
first try in the fastpath.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-19  1:53   ` David Rientjes
  2011-01-19  4:10     ` Minchan Kim
@ 2011-01-19 19:59     ` Christoph Lameter
  2011-01-19 20:06       ` Andi Kleen
  1 sibling, 1 reply; 21+ messages in thread
From: Christoph Lameter @ 2011-01-19 19:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, Andrew Morton, Mel Gorman, Johannes Weiner,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel,
	Jens Axboe, linux-mm, andi

On Tue, 18 Jan 2011, David Rientjes wrote:

> It depends on the semantics of NUMA_MISS: if no local nodes are allowed by
> current's cpuset (a pretty poor cpuset config :), then it seems logical
> that all allocations would be a miss.

NUMA_MISS is defined as an allocations that did not succeed on the node
the allocation was "intended" for. So far "intended" as been interpreted
as allocations that are either intended for the closest numa node or the
preferred node. One could say that the cpuset config is an "intention".

Andi?


See man numastat


NAME
       numastat - Print statistics about NUMA memory allocation

SYNOPSIS
       numastat

DESCRIPTION
       numastat  displays  NUMA allocations statistics from the kernel
memory allocator.  Each process has NUMA policies that specifies on which
node pages are allocated.
       See set_mempolicy(2) or numactl(8) on details of the available
policies.  The numastat counters keep track on what nodes memory is
finally allocated.

       The counters are separated for each node. Each count event is the
allocation of a page of memory.

       numa_hit is the number of allocations where an allocation was
intended for that node and succeeded there.

       numa_miss shows how often an allocation was intended for this node,
but ended up on another node due to low memory.

       numa_foreign is the number of allocations that were intended for
another node, but ended up on this node.  Each numa_foreign event has a
numa_miss on another node.

       interleave_hit is the number of interleave policy allocations that
were intended for a specific node and succeeded there.

       local_node is incremented when a process running on the node
allocated memory on the same node.

       other_node is incremented when a process running on another node
allocated memory on that node.

SEE ALSO
       numactl(8) set_mempolicy(2) numa(3)

NOTES
       numastat output is only available on NUMA systems.

       numastat assumes the output terminal has a width of 80 characters
and tries to format the output accordingly.

EXAMPLES
       watch -n1 numastat
       watch -n1 --differences=accumulative numastat

FILES
       /sys/devices/system/node/node*/numastat


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-19 19:59     ` Christoph Lameter
@ 2011-01-19 20:06       ` Andi Kleen
  2011-01-19 20:18         ` David Rientjes
  0 siblings, 1 reply; 21+ messages in thread
From: Andi Kleen @ 2011-01-19 20:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Minchan Kim, Andrew Morton, Mel Gorman,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Jens Axboe, linux-mm, andi

On Wed, Jan 19, 2011 at 01:59:01PM -0600, Christoph Lameter wrote:
> On Tue, 18 Jan 2011, David Rientjes wrote:
> 
> > It depends on the semantics of NUMA_MISS: if no local nodes are allowed by
> > current's cpuset (a pretty poor cpuset config :), then it seems logical
> > that all allocations would be a miss.
> 
> NUMA_MISS is defined as an allocations that did not succeed on the node
> the allocation was "intended" for. So far "intended" as been interpreted
> as allocations that are either intended for the closest numa node or the
> preferred node. One could say that the cpuset config is an "intention".
> 
> Andi?

cpusets didn't exist when I designed that. But the idea was that
the kernel has a first choice ("hit") and any other node is a "miss"
that may need investigation.  So yes I would consider cpuset config as an 
intention too and should be counted as hit/miss.

-Andi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-19 20:06       ` Andi Kleen
@ 2011-01-19 20:18         ` David Rientjes
  2011-01-19 23:07           ` Christoph Lameter
  2011-01-20  0:59           ` Minchan Kim
  0 siblings, 2 replies; 21+ messages in thread
From: David Rientjes @ 2011-01-19 20:18 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Minchan Kim, Andrew Morton, Mel Gorman,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Jens Axboe, linux-mm

On Wed, 19 Jan 2011, Andi Kleen wrote:

> cpusets didn't exist when I designed that. But the idea was that
> the kernel has a first choice ("hit") and any other node is a "miss"
> that may need investigation.  So yes I would consider cpuset config as an 
> intention too and should be counted as hit/miss.
> 

Ok, so there's no additional modification that needs to be made with the 
patch (other than perhaps some more descriptive documentation of a 
NUMA_HIT and NUMA_MISS).  When the kernel passes all zones into the page 
allocator, it's relying on cpusets to reduce that zonelist to only 
allowable nodes by using ALLOC_CPUSET.  If we can allocate from the first 
zone allowed by the cpuset, it will be treated as a hit; otherwise, it 
will be treated as a miss.  That's better than treating everything as a 
miss when the cpuset doesn't include the first node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-19 20:18         ` David Rientjes
@ 2011-01-19 23:07           ` Christoph Lameter
  2011-01-20  0:59           ` Minchan Kim
  1 sibling, 0 replies; 21+ messages in thread
From: Christoph Lameter @ 2011-01-19 23:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, Minchan Kim, Andrew Morton, Mel Gorman,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Jens Axboe, linux-mm

On Wed, 19 Jan 2011, David Rientjes wrote:

> On Wed, 19 Jan 2011, Andi Kleen wrote:
>
> > cpusets didn't exist when I designed that. But the idea was that
> > the kernel has a first choice ("hit") and any other node is a "miss"
> > that may need investigation.  So yes I would consider cpuset config as an
> > intention too and should be counted as hit/miss.
> >
>
> Ok, so there's no additional modification that needs to be made with the
> patch (other than perhaps some more descriptive documentation of a
> NUMA_HIT and NUMA_MISS).  When the kernel passes all zones into the page
> allocator, it's relying on cpusets to reduce that zonelist to only
> allowable nodes by using ALLOC_CPUSET.  If we can allocate from the first
> zone allowed by the cpuset, it will be treated as a hit; otherwise, it
> will be treated as a miss.  That's better than treating everything as a
> miss when the cpuset doesn't include the first node.

To be more specific: It is the first zone of the zonelist that the cpuset
context provided for allocation from the node that the process is
currently executing on.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-19 20:18         ` David Rientjes
  2011-01-19 23:07           ` Christoph Lameter
@ 2011-01-20  0:59           ` Minchan Kim
  1 sibling, 0 replies; 21+ messages in thread
From: Minchan Kim @ 2011-01-20  0:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andi Kleen, Christoph Lameter, Andrew Morton, Mel Gorman,
	Johannes Weiner, Wu Fengguang, KAMEZAWA Hiroyuki,
	KOSAKI Motohiro, Rik van Riel, Jens Axboe, linux-mm

On Thu, Jan 20, 2011 at 5:18 AM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 19 Jan 2011, Andi Kleen wrote:
>
>> cpusets didn't exist when I designed that. But the idea was that
>> the kernel has a first choice ("hit") and any other node is a "miss"
>> that may need investigation.  So yes I would consider cpuset config as an
>> intention too and should be counted as hit/miss.
>>
>
> Ok, so there's no additional modification that needs to be made with the
> patch (other than perhaps some more descriptive documentation of a
> NUMA_HIT and NUMA_MISS).  When the kernel passes all zones into the page
> allocator, it's relying on cpusets to reduce that zonelist to only
> allowable nodes by using ALLOC_CPUSET.  If we can allocate from the first
> zone allowed by the cpuset, it will be treated as a hit; otherwise, it
> will be treated as a miss.  That's better than treating everything as a
> miss when the cpuset doesn't include the first node.
>

Thanks for the care on this issue, David, Christoph, Andi.
Looks good to me.
Feel free to add my Reviewed-by.

Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [patch v2] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-18  5:09 [patch] mm: fix deferred congestion timeout if preferred zone is not allowed David Rientjes
                   ` (2 preceding siblings ...)
  2011-01-19  0:43 ` Minchan Kim
@ 2011-01-23 22:30 ` David Rientjes
  2011-01-24 17:16   ` Rik van Riel
  3 siblings, 1 reply; 21+ messages in thread
From: David Rientjes @ 2011-01-23 22:30 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mel Gorman, Johannes Weiner, Minchan Kim, Wu Fengguang,
	KAMEZAWA Hiroyuki, KOSAKI Motohiro, Rik van Riel, Jens Axboe,
	linux-mm

Before 0e093d99763e (writeback: do not sleep on the congestion queue if
there are no congested BDIs or if significant congestion is not being
encountered in the current zone), preferred_zone was only used for
NUMA statistics, to determine the zoneidx from which to allocate from given
the type requested, and whether to utilize memory compaction.

wait_iff_congested(), though, uses preferred_zone to determine if the
congestion wait should be deferred because its dirty pages are backed by
a congested bdi.  This incorrectly defers the timeout and busy loops in
the page allocator with various cond_resched() calls if preferred_zone is
not allowed in the current context, usually consuming 100% of a cpu.

This patch ensures preferred_zone is an allowed zone in the fastpath
depending on whether current is constrained by its cpuset or nodes in
its mempolicy (when the nodemask passed is non-NULL).  This is correct
since the fastpath allocation always passes ALLOC_CPUSET when trying
to allocate memory.  In the slowpath, this patch resets preferred_zone
to the first zone of the allowed type when the allocation is not
constrained by current's cpuset, i.e. it does not pass ALLOC_CPUSET.

This patch also ensures preferred_zone is from the set of allowed nodes
when called from within direct reclaim since allocations are always
constrained by cpusets in this context (it is blockable).

Both of these uses of cpuset_current_mems_allowed are protected by
get_mems_allowed().

Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/page_alloc.c |   12 +++++++++++-
 mm/vmscan.c     |    3 ++-
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2034,6 +2034,14 @@ restart:
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);

+	/*
+	 * Find the true preferred zone if the allocation is unconstrained by
+	 * cpusets.
+	 */
+	if (!(alloc_flags & ALLOC_CPUSET) && !nodemask)
+		first_zones_zonelist(zonelist, high_zoneidx, NULL,
+					&preferred_zone);
+
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, nodemask, order, zonelist,
 			high_zoneidx, alloc_flags & ~ALLOC_NO_WATERMARKS,
@@ -2192,7 +2200,9 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,

 	get_mems_allowed();
 	/* The preferred zone is used for statistics later */
-	first_zones_zonelist(zonelist, high_zoneidx, nodemask, &preferred_zone);
+	first_zones_zonelist(zonelist, high_zoneidx,
+				nodemask ? : &cpuset_current_mems_allowed,
+				&preferred_zone);
 	if (!preferred_zone) {
 		put_mems_allowed();
 		return NULL;
diff --git a/mm/vmscan.c b/mm/vmscan.c
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2083,7 +2083,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 			struct zone *preferred_zone;

 			first_zones_zonelist(zonelist, gfp_zone(sc->gfp_mask),
-							NULL, &preferred_zone);
+						&cpuset_current_mems_allowed,
+						&preferred_zone);
 			wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/10);
 		}
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [patch v2] mm: fix deferred congestion timeout if preferred zone is not allowed
  2011-01-23 22:30 ` [patch v2] " David Rientjes
@ 2011-01-24 17:16   ` Rik van Riel
  0 siblings, 0 replies; 21+ messages in thread
From: Rik van Riel @ 2011-01-24 17:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Mel Gorman, Johannes Weiner, Minchan Kim,
	Wu Fengguang, KAMEZAWA Hiroyuki, KOSAKI Motohiro, Jens Axboe,
	linux-mm

On 01/23/2011 05:30 PM, David Rientjes wrote:

> This patch also ensures preferred_zone is from the set of allowed nodes
> when called from within direct reclaim since allocations are always
> constrained by cpusets in this context (it is blockable).
>
> Both of these uses of cpuset_current_mems_allowed are protected by
> get_mems_allowed().
>
> Signed-off-by: David Rientjes<rientjes@google.com>

Acked-by: Rik van Riel <riel@redhat.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2011-01-24 17:17 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-18  5:09 [patch] mm: fix deferred congestion timeout if preferred zone is not allowed David Rientjes
2011-01-18  6:04 ` KOSAKI Motohiro
2011-01-18 10:29   ` Mel Gorman
2011-01-19 12:48     ` KOSAKI Motohiro
2011-01-18 10:15 ` Mel Gorman
2011-01-18 20:24   ` David Rientjes
2011-01-18 20:42     ` Mel Gorman
2011-01-19  1:51       ` David Rientjes
2011-01-19 13:01     ` KOSAKI Motohiro
2011-01-19 18:37       ` David Rientjes
2011-01-19 12:52   ` KOSAKI Motohiro
2011-01-19  0:43 ` Minchan Kim
2011-01-19  1:53   ` David Rientjes
2011-01-19  4:10     ` Minchan Kim
2011-01-19 19:59     ` Christoph Lameter
2011-01-19 20:06       ` Andi Kleen
2011-01-19 20:18         ` David Rientjes
2011-01-19 23:07           ` Christoph Lameter
2011-01-20  0:59           ` Minchan Kim
2011-01-23 22:30 ` [patch v2] " David Rientjes
2011-01-24 17:16   ` Rik van Riel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.