linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE
@ 2014-02-25 20:27 Johannes Weiner
  2014-02-25 20:27 ` [patch 2/2] mm: fix GFP_THISNODE callers and clarify Johannes Weiner
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Johannes Weiner @ 2014-02-25 20:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Stancek, Mel Gorman, Rik van Riel, linux-mm, linux-kernel

Jan Stancek reports manual page migration encountering allocation
failures after some pages when there is still plenty of memory free,
and bisected the problem down to 81c0a2bb515f ("mm: page_alloc: fair
zone allocator policy").

The problem is that page migration uses GFP_THISNODE and this makes
the page allocator bail out before entering the slowpath entirely,
without resetting the zone round-robin batches.  A string of such
allocations will fail long before the node's free memory is exhausted.

GFP_THISNODE is a special flag for callsites that implement their own
clever node fallback and so no direct reclaim should be invoked.  But
if the allocations fail, the fair allocation batches should still be
reset, and if the node is full, it should be aged in the background.

Make GFP_THISNODE wake up kswapd and reset the zone batches, but bail
out before entering direct reclaim to not stall the allocating task.

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org> # 3.12+
---
 mm/page_alloc.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3758a09a009..b92f66e78ec1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2493,18 +2493,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 	}
 
-	/*
-	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
-	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
-	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
-	 * using a larger set of nodes after it has established that the
-	 * allowed per node queues are empty and that nodes are
-	 * over allocated.
-	 */
-	if (IS_ENABLED(CONFIG_NUMA) &&
-			(gfp_mask & GFP_THISNODE) == GFP_THISNODE)
-		goto nopage;
-
 restart:
 	prepare_slowpath(gfp_mask, order, zonelist,
 			 high_zoneidx, preferred_zone);
@@ -2549,6 +2537,18 @@ rebalance:
 		}
 	}
 
+	/*
+	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
+	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
+	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
+	 * using a larger set of nodes after it has established that the
+	 * allowed per node queues are empty and that nodes are
+	 * over allocated.
+	 */
+	if (IS_ENABLED(CONFIG_NUMA) &&
+			(gfp_mask & GFP_THISNODE) == GFP_THISNODE)
+		goto nopage;
+
 	/* Atomic allocations - we can't balance anything */
 	if (!wait) {
 		/*
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [patch 2/2] mm: fix GFP_THISNODE callers and clarify
  2014-02-25 20:27 [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE Johannes Weiner
@ 2014-02-25 20:27 ` Johannes Weiner
  2014-02-25 20:37   ` Rik van Riel
  2014-02-25 20:36 ` [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE Rik van Riel
  2014-02-26  9:54 ` Mel Gorman
  2 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2014-02-25 20:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Stancek, Mel Gorman, Rik van Riel, linux-mm, linux-kernel

GFP_THISNODE is for callers that implement their own clever fallback
to remote nodes, and so no direct reclaim is invoked.  There are many
current users that only want node exclusiveness but still want reclaim
to make the allocation happen.  Convert them over to __GFP_THISNODE
and update the documentation to clarify GFP_THISNODE semantics.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 arch/ia64/kernel/uncached.c       |  2 +-
 arch/powerpc/platforms/cell/ras.c |  3 ++-
 drivers/misc/sgi-xp/xpc_uv.c      |  2 +-
 include/linux/gfp.h               |  4 ++++
 include/linux/mmzone.h            |  4 ++--
 include/linux/slab.h              |  2 +-
 kernel/profile.c                  |  4 ++--
 mm/migrate.c                      | 11 ++++++-----
 8 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/arch/ia64/kernel/uncached.c b/arch/ia64/kernel/uncached.c
index a96bcf83a735..20e8a9b21d75 100644
--- a/arch/ia64/kernel/uncached.c
+++ b/arch/ia64/kernel/uncached.c
@@ -98,7 +98,7 @@ static int uncached_add_chunk(struct uncached_pool *uc_pool, int nid)
 	/* attempt to allocate a granule's worth of cached memory pages */
 
 	page = alloc_pages_exact_node(nid,
-				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+				GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
 				IA64_GRANULE_SHIFT-PAGE_SHIFT);
 	if (!page) {
 		mutex_unlock(&uc_pool->add_chunk_mutex);
diff --git a/arch/powerpc/platforms/cell/ras.c b/arch/powerpc/platforms/cell/ras.c
index 5ec1e47a0d77..e865d748179b 100644
--- a/arch/powerpc/platforms/cell/ras.c
+++ b/arch/powerpc/platforms/cell/ras.c
@@ -123,7 +123,8 @@ static int __init cbe_ptcal_enable_on_node(int nid, int order)
 
 	area->nid = nid;
 	area->order = order;
-	area->pages = alloc_pages_exact_node(area->nid, GFP_KERNEL|GFP_THISNODE,
+	area->pages = alloc_pages_exact_node(area->nid,
+						GFP_KERNEL|__GFP_THISNODE,
 						area->order);
 
 	if (!area->pages) {
diff --git a/drivers/misc/sgi-xp/xpc_uv.c b/drivers/misc/sgi-xp/xpc_uv.c
index b9e2000969f0..95c894482fdd 100644
--- a/drivers/misc/sgi-xp/xpc_uv.c
+++ b/drivers/misc/sgi-xp/xpc_uv.c
@@ -240,7 +240,7 @@ xpc_create_gru_mq_uv(unsigned int mq_size, int cpu, char *irq_name,
 
 	nid = cpu_to_node(cpu);
 	page = alloc_pages_exact_node(nid,
-				      GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+				      GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
 				      pg_order);
 	if (page == NULL) {
 		dev_err(xpc_part, "xpc_create_gru_mq_uv() failed to alloc %d "
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0437439bc047..39b81dc7d01a 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -123,6 +123,10 @@ struct vm_area_struct;
 			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
 			 __GFP_NO_KSWAPD)
 
+/*
+ * GFP_THISNODE does not perform any reclaim, you most likely want to
+ * use __GFP_THISNODE to allocate from a given node without fallback!
+ */
 #ifdef CONFIG_NUMA
 #define GFP_THISNODE	(__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
 #else
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5f2052c83154..9b61b9bf81ac 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -590,10 +590,10 @@ static inline bool zone_is_empty(struct zone *zone)
 
 /*
  * The NUMA zonelists are doubled because we need zonelists that restrict the
- * allocations to a single node for GFP_THISNODE.
+ * allocations to a single node for __GFP_THISNODE.
  *
  * [0]	: Zonelist with fallback
- * [1]	: No fallback (GFP_THISNODE)
+ * [1]	: No fallback (__GFP_THISNODE)
  */
 #define MAX_ZONELISTS 2
 
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 9260abdd67df..b5b2df60299e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -410,7 +410,7 @@ static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
  *
  * %GFP_NOWAIT - Allocation will not sleep.
  *
- * %GFP_THISNODE - Allocate node-local memory only.
+ * %__GFP_THISNODE - Allocate node-local memory only.
  *
  * %GFP_DMA - Allocation suitable for DMA.
  *   Should only be used for kmalloc() caches. Otherwise, use a
diff --git a/kernel/profile.c b/kernel/profile.c
index 6631e1ef55ab..ebdd9c1a86b4 100644
--- a/kernel/profile.c
+++ b/kernel/profile.c
@@ -549,14 +549,14 @@ static int create_hash_tables(void)
 		struct page *page;
 
 		page = alloc_pages_exact_node(node,
-				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+				GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
 		per_cpu(cpu_profile_hits, cpu)[1]
 				= (struct profile_hit *)page_address(page);
 		page = alloc_pages_exact_node(node,
-				GFP_KERNEL | __GFP_ZERO | GFP_THISNODE,
+				GFP_KERNEL | __GFP_ZERO | __GFP_THISNODE,
 				0);
 		if (!page)
 			goto out_cleanup;
diff --git a/mm/migrate.c b/mm/migrate.c
index 482a33d89134..b494fdb9a636 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1158,7 +1158,7 @@ static struct page *new_page_node(struct page *p, unsigned long private,
 					pm->node);
 	else
 		return alloc_pages_exact_node(pm->node,
-				GFP_HIGHUSER_MOVABLE | GFP_THISNODE, 0);
+				GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
 }
 
 /*
@@ -1544,9 +1544,9 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 	struct page *newpage;
 
 	newpage = alloc_pages_exact_node(nid,
-					 (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
-					  __GFP_NOMEMALLOC | __GFP_NORETRY |
-					  __GFP_NOWARN) &
+					 (GFP_HIGHUSER_MOVABLE |
+					  __GFP_THISNODE | __GFP_NOMEMALLOC |
+					  __GFP_NORETRY | __GFP_NOWARN) &
 					 ~GFP_IOFS, 0);
 
 	return newpage;
@@ -1747,7 +1747,8 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_dropref;
 
 	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE | GFP_THISNODE) & ~__GFP_WAIT, HPAGE_PMD_ORDER);
+		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_WAIT,
+		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
 
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE
  2014-02-25 20:27 [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE Johannes Weiner
  2014-02-25 20:27 ` [patch 2/2] mm: fix GFP_THISNODE callers and clarify Johannes Weiner
@ 2014-02-25 20:36 ` Rik van Riel
  2014-02-26  9:54 ` Mel Gorman
  2 siblings, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2014-02-25 20:36 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Jan Stancek, Mel Gorman, linux-mm, linux-kernel

On 02/25/2014 03:27 PM, Johannes Weiner wrote:
> Jan Stancek reports manual page migration encountering allocation
> failures after some pages when there is still plenty of memory free,
> and bisected the problem down to 81c0a2bb515f ("mm: page_alloc: fair
> zone allocator policy").
> 
> The problem is that page migration uses GFP_THISNODE and this makes
> the page allocator bail out before entering the slowpath entirely,
> without resetting the zone round-robin batches.  A string of such
> allocations will fail long before the node's free memory is exhausted.
> 
> GFP_THISNODE is a special flag for callsites that implement their own
> clever node fallback and so no direct reclaim should be invoked.  But
> if the allocations fail, the fair allocation batches should still be
> reset, and if the node is full, it should be aged in the background.
> 
> Make GFP_THISNODE wake up kswapd and reset the zone batches, but bail
> out before entering direct reclaim to not stall the allocating task.
> 
> Reported-by: Jan Stancek <jstancek@redhat.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@kernel.org> # 3.12+

Acked-by: Rik van Riel <riel@redhat.com>

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 2/2] mm: fix GFP_THISNODE callers and clarify
  2014-02-25 20:27 ` [patch 2/2] mm: fix GFP_THISNODE callers and clarify Johannes Weiner
@ 2014-02-25 20:37   ` Rik van Riel
  0 siblings, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2014-02-25 20:37 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: Jan Stancek, Mel Gorman, linux-mm, linux-kernel

On 02/25/2014 03:27 PM, Johannes Weiner wrote:
> GFP_THISNODE is for callers that implement their own clever fallback
> to remote nodes, and so no direct reclaim is invoked.  There are many
> current users that only want node exclusiveness but still want reclaim
> to make the allocation happen.  Convert them over to __GFP_THISNODE
> and update the documentation to clarify GFP_THISNODE semantics.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Acked-by: Rik van Riel <riel@redhat.com>


-- 
All rights reversed

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE
  2014-02-25 20:27 [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE Johannes Weiner
  2014-02-25 20:27 ` [patch 2/2] mm: fix GFP_THISNODE callers and clarify Johannes Weiner
  2014-02-25 20:36 ` [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE Rik van Riel
@ 2014-02-26  9:54 ` Mel Gorman
  2014-02-26 17:12   ` Johannes Weiner
  2 siblings, 1 reply; 9+ messages in thread
From: Mel Gorman @ 2014-02-26  9:54 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jan Stancek, Rik van Riel, linux-mm, linux-kernel

On Tue, Feb 25, 2014 at 03:27:01PM -0500, Johannes Weiner wrote:
> Jan Stancek reports manual page migration encountering allocation
> failures after some pages when there is still plenty of memory free,
> and bisected the problem down to 81c0a2bb515f ("mm: page_alloc: fair
> zone allocator policy").
> 
> The problem is that page migration uses GFP_THISNODE and this makes
> the page allocator bail out before entering the slowpath entirely,
> without resetting the zone round-robin batches.  A string of such
> allocations will fail long before the node's free memory is exhausted.
> 
> GFP_THISNODE is a special flag for callsites that implement their own
> clever node fallback and so no direct reclaim should be invoked.  But
> if the allocations fail, the fair allocation batches should still be
> reset, and if the node is full, it should be aged in the background.
> 
> Make GFP_THISNODE wake up kswapd and reset the zone batches, but bail
> out before entering direct reclaim to not stall the allocating task.
> 
> Reported-by: Jan Stancek <jstancek@redhat.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@kernel.org> # 3.12+
> ---
>  mm/page_alloc.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e3758a09a009..b92f66e78ec1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2493,18 +2493,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		return NULL;
>  	}
>  
> -	/*
> -	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> -	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
> -	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
> -	 * using a larger set of nodes after it has established that the
> -	 * allowed per node queues are empty and that nodes are
> -	 * over allocated.
> -	 */

By moving this past prepare_slowpath, the comment is no longer accurate.
It says it "should not cause reclaim" but a consequence of this patch is
that we wake kswapd if the allocation failed due to memory exhaustion and
attempt an allocation at a different watermark.  Your changelog calls this
out the kswapd part but it's actually a pretty significant change to do
as part of this bug fix. kswapd potentially reclaims within a node when
the caller was potentially happy to retry on remote nodes without reclaiming.

The bug report states that "manual page migration encountering allocation
failures after some pages when there is still plenty of memory free". Plenty
of memory was free, yet with this patch applied we will attempt to wake
kswapd. Granted, the zone_balanced() check should prevent kswapd being
actually woken up but it's wasteful.

How about special casing the (alloc_flags & ALLOC_WMARK_LOW) check in
get_page_from_freelist to also ignore GFP_THISNODE? The NR_ALLOC_BATCH
will go further negative if there are storms of GFP_THISNODE allocations
forcing other allocations into the slow path doing multiple calls to
prepare_slowpath but it would be closer to current behaviour and avoid
weirdness with kswapd.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE
  2014-02-26  9:54 ` Mel Gorman
@ 2014-02-26 17:12   ` Johannes Weiner
  2014-02-26 20:13     ` Johannes Weiner
  0 siblings, 1 reply; 9+ messages in thread
From: Johannes Weiner @ 2014-02-26 17:12 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jan Stancek, Rik van Riel, linux-mm, linux-kernel

On Wed, Feb 26, 2014 at 09:54:22AM +0000, Mel Gorman wrote:
> On Tue, Feb 25, 2014 at 03:27:01PM -0500, Johannes Weiner wrote:
> > Jan Stancek reports manual page migration encountering allocation
> > failures after some pages when there is still plenty of memory free,
> > and bisected the problem down to 81c0a2bb515f ("mm: page_alloc: fair
> > zone allocator policy").
> > 
> > The problem is that page migration uses GFP_THISNODE and this makes
> > the page allocator bail out before entering the slowpath entirely,
> > without resetting the zone round-robin batches.  A string of such
> > allocations will fail long before the node's free memory is exhausted.
> > 
> > GFP_THISNODE is a special flag for callsites that implement their own
> > clever node fallback and so no direct reclaim should be invoked.  But
> > if the allocations fail, the fair allocation batches should still be
> > reset, and if the node is full, it should be aged in the background.
> > 
> > Make GFP_THISNODE wake up kswapd and reset the zone batches, but bail
> > out before entering direct reclaim to not stall the allocating task.
> > 
> > Reported-by: Jan Stancek <jstancek@redhat.com>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: <stable@kernel.org> # 3.12+
> > ---
> >  mm/page_alloc.c | 24 ++++++++++++------------
> >  1 file changed, 12 insertions(+), 12 deletions(-)
> > 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index e3758a09a009..b92f66e78ec1 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2493,18 +2493,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >  		return NULL;
> >  	}
> >  
> > -	/*
> > -	 * GFP_THISNODE (meaning __GFP_THISNODE, __GFP_NORETRY and
> > -	 * __GFP_NOWARN set) should not cause reclaim since the subsystem
> > -	 * (f.e. slab) using GFP_THISNODE may choose to trigger reclaim
> > -	 * using a larger set of nodes after it has established that the
> > -	 * allowed per node queues are empty and that nodes are
> > -	 * over allocated.
> > -	 */
> 
> By moving this past prepare_slowpath, the comment is no longer accurate.
> It says it "should not cause reclaim" but a consequence of this patch is
> that we wake kswapd if the allocation failed due to memory exhaustion and
> attempt an allocation at a different watermark.  Your changelog calls this
> out the kswapd part but it's actually a pretty significant change to do
> as part of this bug fix. kswapd potentially reclaims within a node when
> the caller was potentially happy to retry on remote nodes without reclaiming.
> 
> The bug report states that "manual page migration encountering allocation
> failures after some pages when there is still plenty of memory free". Plenty
> of memory was free, yet with this patch applied we will attempt to wake
> kswapd. Granted, the zone_balanced() check should prevent kswapd being
> actually woken up but it's wasteful.

Yes, slab has its own fallback implementation and is willing to
sacrifice some locality for allocation latency, but once the fallbacks
are exhausted it will also have to enter reclaim.  By waking kswapd in
this case, future fallbacks can be prevented and allocation latency
reduced.  Not just for slab allocations, but for all order-0 requests.
And at near-nil cost to the allocating task.

Most other allocations will wake kswapd anyway, this report came from
a testcase that didn't run anything else on the machine.  The current
behavior seems more like an implementation glitch in this special
case, rather than intentional design.

> How about special casing the (alloc_flags & ALLOC_WMARK_LOW) check in
> get_page_from_freelist to also ignore GFP_THISNODE? The NR_ALLOC_BATCH
> will go further negative if there are storms of GFP_THISNODE allocations
> forcing other allocations into the slow path doing multiple calls to
> prepare_slowpath but it would be closer to current behaviour and avoid
> weirdness with kswapd.

I think the result would be much uglier.  The allocations wouldn't
participate in the fairness protocol, and they'd create work for
kswapd without waking it up, diminishing the latency reduction for
which we have kswapd in the first place.

If kswapd wakeups should be too aggressive, I'd rather we ratelimit
them in some way rather than exempting random order-0 allocation types
as a moderation measure.  Exempting higher order wakeups, like THP
does is one thing, but we want order-0 watermarks to be met at all
times anyway, so it would make sense to me to nudge kswapd for every
failing order-0 request.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE
  2014-02-26 17:12   ` Johannes Weiner
@ 2014-02-26 20:13     ` Johannes Weiner
  2014-02-27 20:23       ` Rik van Riel
  2014-02-28 11:45       ` Mel Gorman
  0 siblings, 2 replies; 9+ messages in thread
From: Johannes Weiner @ 2014-02-26 20:13 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Jan Stancek, Rik van Riel, linux-mm, linux-kernel

On Wed, Feb 26, 2014 at 12:12:06PM -0500, Johannes Weiner wrote:
> On Wed, Feb 26, 2014 at 09:54:22AM +0000, Mel Gorman wrote:
> > How about special casing the (alloc_flags & ALLOC_WMARK_LOW) check in
> > get_page_from_freelist to also ignore GFP_THISNODE? The NR_ALLOC_BATCH
> > will go further negative if there are storms of GFP_THISNODE allocations
> > forcing other allocations into the slow path doing multiple calls to
> > prepare_slowpath but it would be closer to current behaviour and avoid
> > weirdness with kswapd.
> 
> I think the result would be much uglier.  The allocations wouldn't
> participate in the fairness protocol, and they'd create work for
> kswapd without waking it up, diminishing the latency reduction for
> which we have kswapd in the first place.
> 
> If kswapd wakeups should be too aggressive, I'd rather we ratelimit
> them in some way rather than exempting random order-0 allocation types
> as a moderation measure.  Exempting higher order wakeups, like THP
> does is one thing, but we want order-0 watermarks to be met at all
> times anyway, so it would make sense to me to nudge kswapd for every
> failing order-0 request.

So I'd still like to fix this and wake kswapd even for GFP_THISNODE
allocations, but let's defer it for now in favor of a minimal bugfix
that can be ported to -stable.

Would this be an acceptable replacement for 1/2?

---

From: Johannes Weiner <hannes@cmpxchg.org>
Subject: [patch 1/2] mm: page_alloc: exempt GFP_THISNODE allocations from zone
 fairness

Jan Stancek reports manual page migration encountering allocation
failures after some pages when there is still plenty of memory free,
and bisected the problem down to 81c0a2bb515f ("mm: page_alloc: fair
zone allocator policy").

The problem is that GFP_THISNODE obeys the zone fairness allocation
batches on one hand, but doesn't reset them and wake kswapd on the
other hand.  After a few of those allocations, the batches are
exhausted and the allocations fail.

Fixing this means either having GFP_THISNODE wake up kswapd, or
GFP_THISNODE not participating in zone fairness at all.  The latter
seems safer as an acute bugfix, we can clean up later.

Reported-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: <stable@kernel.org> # 3.12+
---
 mm/page_alloc.c | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e3758a09a009..14372bec0e81 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1236,6 +1236,15 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
 	}
 	local_irq_restore(flags);
 }
+static bool gfp_thisnode_allocation(gfp_t gfp_mask)
+{
+	return (gfp_mask & GFP_THISNODE) == GFP_THISNODE;
+}
+#else
+static bool gfp_thisnode_allocation(gfp_t gfp_mask)
+{
+	return false;
+}
 #endif
 
 /*
@@ -1572,7 +1581,13 @@ again:
 					  get_pageblock_migratetype(page));
 	}
 
-	__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+	/*
+	 * NOTE: GFP_THISNODE allocations do not partake in the kswapd
+	 * aging protocol, so they can't be fair.
+	 */
+	if (!gfp_thisnode_allocation(gfp_flags))
+		__mod_zone_page_state(zone, NR_ALLOC_BATCH, -(1 << order));
+
 	__count_zone_vm_events(PGALLOC, zone, 1 << order);
 	zone_statistics(preferred_zone, zone, gfp_flags);
 	local_irq_restore(flags);
@@ -1944,8 +1959,12 @@ zonelist_scan:
 		 * ultimately fall back to remote zones that do not
 		 * partake in the fairness round-robin cycle of this
 		 * zonelist.
+		 *
+		 * NOTE: GFP_THISNODE allocations do not partake in
+		 * the kswapd aging protocol, so they can't be fair.
 		 */
-		if (alloc_flags & ALLOC_WMARK_LOW) {
+		if ((alloc_flags & ALLOC_WMARK_LOW) &&
+		    !gfp_thisnode_allocation(gfp_mask)) {
 			if (zone_page_state(zone, NR_ALLOC_BATCH) <= 0)
 				continue;
 			if (!zone_local(preferred_zone, zone))
@@ -2501,8 +2520,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 * allowed per node queues are empty and that nodes are
 	 * over allocated.
 	 */
-	if (IS_ENABLED(CONFIG_NUMA) &&
-			(gfp_mask & GFP_THISNODE) == GFP_THISNODE)
+	if (gfp_thisnode_allocation(gfp_mask))
 		goto nopage;
 
 restart:
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE
  2014-02-26 20:13     ` Johannes Weiner
@ 2014-02-27 20:23       ` Rik van Riel
  2014-02-28 11:45       ` Mel Gorman
  1 sibling, 0 replies; 9+ messages in thread
From: Rik van Riel @ 2014-02-27 20:23 UTC (permalink / raw)
  To: Johannes Weiner, Mel Gorman
  Cc: Andrew Morton, Jan Stancek, linux-mm, linux-kernel

On 02/26/2014 03:13 PM, Johannes Weiner wrote:

> Would this be an acceptable replacement for 1/2?

Looks reasonable to me. This should avoid the issues that
were observed with NUMA migrations.

> ---
>
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch 1/2] mm: page_alloc: exempt GFP_THISNODE allocations from zone
>   fairness
>
> Jan Stancek reports manual page migration encountering allocation
> failures after some pages when there is still plenty of memory free,
> and bisected the problem down to 81c0a2bb515f ("mm: page_alloc: fair
> zone allocator policy").
>
> The problem is that GFP_THISNODE obeys the zone fairness allocation
> batches on one hand, but doesn't reset them and wake kswapd on the
> other hand.  After a few of those allocations, the batches are
> exhausted and the allocations fail.
>
> Fixing this means either having GFP_THISNODE wake up kswapd, or
> GFP_THISNODE not participating in zone fairness at all.  The latter
> seems safer as an acute bugfix, we can clean up later.
>
> Reported-by: Jan Stancek <jstancek@redhat.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@kernel.org> # 3.12+

Acked-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE
  2014-02-26 20:13     ` Johannes Weiner
  2014-02-27 20:23       ` Rik van Riel
@ 2014-02-28 11:45       ` Mel Gorman
  1 sibling, 0 replies; 9+ messages in thread
From: Mel Gorman @ 2014-02-28 11:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Jan Stancek, Rik van Riel, linux-mm, linux-kernel

On Wed, Feb 26, 2014 at 03:13:33PM -0500, Johannes Weiner wrote:
> On Wed, Feb 26, 2014 at 12:12:06PM -0500, Johannes Weiner wrote:
> > On Wed, Feb 26, 2014 at 09:54:22AM +0000, Mel Gorman wrote:
> > > How about special casing the (alloc_flags & ALLOC_WMARK_LOW) check in
> > > get_page_from_freelist to also ignore GFP_THISNODE? The NR_ALLOC_BATCH
> > > will go further negative if there are storms of GFP_THISNODE allocations
> > > forcing other allocations into the slow path doing multiple calls to
> > > prepare_slowpath but it would be closer to current behaviour and avoid
> > > weirdness with kswapd.
> > 
> > I think the result would be much uglier.  The allocations wouldn't
> > participate in the fairness protocol, and they'd create work for
> > kswapd without waking it up, diminishing the latency reduction for
> > which we have kswapd in the first place.
> > 
> > If kswapd wakeups should be too aggressive, I'd rather we ratelimit
> > them in some way rather than exempting random order-0 allocation types
> > as a moderation measure.  Exempting higher order wakeups, like THP
> > does is one thing, but we want order-0 watermarks to be met at all
> > times anyway, so it would make sense to me to nudge kswapd for every
> > failing order-0 request.
> 
> So I'd still like to fix this and wake kswapd even for GFP_THISNODE
> allocations, but let's defer it for now in favor of a minimal bugfix
> that can be ported to -stable.
> 
> Would this be an acceptable replacement for 1/2?
> 
> ---
> 
> From: Johannes Weiner <hannes@cmpxchg.org>
> Subject: [patch 1/2] mm: page_alloc: exempt GFP_THISNODE allocations from zone
>  fairness
> 
> Jan Stancek reports manual page migration encountering allocation
> failures after some pages when there is still plenty of memory free,
> and bisected the problem down to 81c0a2bb515f ("mm: page_alloc: fair
> zone allocator policy").
> 
> The problem is that GFP_THISNODE obeys the zone fairness allocation
> batches on one hand, but doesn't reset them and wake kswapd on the
> other hand.  After a few of those allocations, the batches are
> exhausted and the allocations fail.
> 
> Fixing this means either having GFP_THISNODE wake up kswapd, or
> GFP_THISNODE not participating in zone fairness at all.  The latter
> seems safer as an acute bugfix, we can clean up later.
> 
> Reported-by: Jan Stancek <jstancek@redhat.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: <stable@kernel.org> # 3.12+

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2014-02-28 11:45 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-02-25 20:27 [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE Johannes Weiner
2014-02-25 20:27 ` [patch 2/2] mm: fix GFP_THISNODE callers and clarify Johannes Weiner
2014-02-25 20:37   ` Rik van Riel
2014-02-25 20:36 ` [patch 1/2] mm: page_alloc: reset aging cycle with GFP_THISNODE Rik van Riel
2014-02-26  9:54 ` Mel Gorman
2014-02-26 17:12   ` Johannes Weiner
2014-02-26 20:13     ` Johannes Weiner
2014-02-27 20:23       ` Rik van Riel
2014-02-28 11:45       ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).