[PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
@ 2011-05-11 15:29 Mel Gorman
  2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman
                   ` (3 more replies)
  0 siblings, 4 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-11 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

Debian (and probably Ubuntu) have recently have changed to the default
option of SLUB. There are a few reports of people experiencing hangs
when copying large amounts of data with kswapd using a large amount of
CPU. It appears this is down to SLUB using high orders by default and
the page allocator and reclaim struggling to keep up. The following
three patches reduce the cost of using those high orders.

Patch 1 prevents kswapd waking up in response to SLUBs speculative
	use of high orders. This eliminates the hangs and while the
	system can still stall for long periods, it recovers.

Patch 2 further reduces the cost by prevent SLUB entering direct
	compaction or reclaim paths on the grounds that falling
	back to order-0 should be cheaper.

Patch 3 defaults SLUB to using order-0 on the grounds that the
	systems that heavily benefit from using high-order are also
	sized to fit in physical memory. On such systems, they should
	manually tune slub_max_order=3.

My own data on this is not great. I haven't really been able to
reproduce the same problem locally but a significant failing is
that the tests weren't stressing X but I couldn't make meaningful
comparisons by just randomly clicking on things (working on fixing
this problem).

The test case is simple. "download tar" wgets a large tar file and
stores it locally. "unpack" is expanding it (15 times physical RAM
in this case) and "delete source dirs" is the tarfile being deleted
again. I also experimented with having the tar copied numerous times
and into deeper directories to increase the size but the results were
not particularly interesting so I left it as one tar.

Test server, 4 CPU threads (AMD Phenom), x86_64, 2G of RAM, no X running
                             -       nowake    
             largecopy-vanilla       kswapd-v1r1  noexstep-v1r1     default0-v1r1
download tar           94 ( 0.00%)   94 ( 0.00%)   94 ( 0.00%)   93 ( 1.08%)
unpack tar            521 ( 0.00%)  551 (-5.44%)  482 ( 8.09%)  488 ( 6.76%)
delete source dirs    208 ( 0.00%)  218 (-4.59%)  194 ( 7.22%)  194 ( 7.22%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        740.82    777.73    739.98    747.47
Total Elapsed Time (seconds)               1046.66   1273.91    962.47    936.17

Disabling kswapd alone hurts performance slightly even though testers
report it fixes hangs. I would guess it's because SLUB callers are
calling direct reclaim more frequently (I belatedly noticed that
compaction was disabled so it's not a factor) but haven't confirmed
it. However, preventing kswapd waking or entering direct reclaim and
having SLUB falling back to order-0 performed noticeably faster. Just
using order-0 in the first place was fastest of all.

I tried running the same test on a test laptop but unfortunately
due to a misconfiguration the results were lost. It would take a few
hours to rerun so am posting without them.

If the testers verify this series help and we agree the patches are
appropriate, they should be considered a stable candidate for 2.6.38.

 Documentation/vm/slub.txt |    2 +-
 mm/page_alloc.c           |    3 ++-
 mm/slub.c                 |    5 +++--
 3 files changed, 6 insertions(+), 4 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 77+ messages in thread

* [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
  2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman
@ 2011-05-11 15:29 ` Mel Gorman
  2011-05-11 20:38   ` David Rientjes
  2011-05-11 15:29 ` [PATCH 2/3] mm: slub: Do not take expensive steps " Mel Gorman
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2011-05-11 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail.  However, by simply trying to allocate, kswapd is woken up to
start reclaiming at that order. On a desktop system, two users report
that the system is getting locked up with kswapd using large amounts
of CPU.  Using SLAB instead of SLUB made this problem go away.

This patch prevents kswapd being woken up for high-order allocations.
Testing indicated that with this patch applied, the system was much
harder to hang and even when it did, it eventually recovered.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/slub.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 9d2e5e4..98c358d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * Let the initial higher-order allocation fail under memory pressure
 	 * so we fall-back to the minimum order allocation.
 	 */
-	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
+	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
 
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 2/3] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman
  2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman
@ 2011-05-11 15:29 ` Mel Gorman
  2011-05-11 20:38   ` David Rientjes
  2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman
  2011-05-11 21:39 ` [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations James Bottomley
  3 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2011-05-11 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations and falls back to lower allocations if they
fail. However, by simply trying to allocate, the caller can enter
compaction or reclaim - both of which are likely to cost more than the
benefit of using high-order pages in SLUB. On a desktop system, two
users report that the system is getting stalled with kswapd using large
amounts of CPU.

This patch prevents SLUB taking any expensive steps when trying to
use high-order allocations. Instead, it is expected to fall back to
smaller orders more aggressively. Testing from users was somewhat
inconclusive on how much this helped but local tests showed it made
a positive difference. It makes sense that falling back to order-0
allocations is faster than entering compaction or direct reclaim.

Signed-off-yet: Mel Gorman <mgorman@suse.de>
---
 mm/page_alloc.c |    3 ++-
 mm/slub.c       |    3 ++-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9f8a97b..057f1e2 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 {
 	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
 	const gfp_t wait = gfp_mask & __GFP_WAIT;
+	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
 	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
@@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	 */
 	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
 
-	if (!wait) {
+	if (!wait && can_wake_kswapd) {
 		/*
 		 * Not worth trying to allocate harder for
 		 * __GFP_NOMEMALLOC even if it can't schedule.
diff --git a/mm/slub.c b/mm/slub.c
index 98c358d..1071723 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 * Let the initial higher-order allocation fail under memory pressure
 	 * so we fall-back to the minimum order allocation.
 	 */
-	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
+	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) &
+			~(__GFP_NOFAIL | __GFP_WAIT);
 
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman
  2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman
  2011-05-11 15:29 ` [PATCH 2/3] mm: slub: Do not take expensive steps " Mel Gorman
@ 2011-05-11 15:29 ` Mel Gorman
  2011-05-11 20:38   ` David Rientjes
  2011-05-12 14:43   ` Christoph Lameter
  2011-05-11 21:39 ` [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations James Bottomley
  3 siblings, 2 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-11 15:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: James Bottomley, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4, Mel Gorman

To avoid locking and per-cpu overhead, SLUB optimisically uses
high-order allocations up to order-3 by default and falls back to
lower allocations if they fail. While care is taken that the caller
and kswapd take no unusual steps in response to this, there are
further consequences like shrinkers who have to free more objects to
release any memory. There is anecdotal evidence that significant time
is being spent looping in shrinkers with insufficient progress being
made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake.

SLUB is now the default allocator and some bug reports have been
pinned down to SLUB using high orders during operations like
copying large amounts of data. SLUBs use of high-orders benefits
applications that are sized to memory appropriately but this does not
necessarily apply to large file servers or desktops.  This patch
causes SLUB to use order-0 pages like SLAB does by default.
There is further evidence that this keeps kswapd's usage lower
(https://lkml.org/lkml/2011/5/10/383).

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 Documentation/vm/slub.txt |    2 +-
 mm/slub.c                 |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
index 07375e7..778e9fa 100644
--- a/Documentation/vm/slub.txt
+++ b/Documentation/vm/slub.txt
@@ -117,7 +117,7 @@ can be influenced by kernel parameters:
 
 slub_min_objects=x		(default 4)
 slub_min_order=x		(default 0)
-slub_max_order=x		(default 1)
+slub_max_order=x		(default 0)
 
 slub_min_objects allows to specify how many objects must at least fit
 into one slab in order for the allocation order to be acceptable.
diff --git a/mm/slub.c b/mm/slub.c
index 1071723..23a4789 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
  * take the list_lock.
  */
 static int slub_min_order;
-static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
+static int slub_max_order;
 static int slub_min_objects;
 
 /*
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative high-order allocations
  2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman
@ 2011-05-11 20:38   ` David Rientjes
  0 siblings, 0 replies; 77+ messages in thread
From: David Rientjes @ 2011-05-11 20:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 11 May 2011, Mel Gorman wrote:

> To avoid locking and per-cpu overhead, SLUB optimisically uses
> high-order allocations and falls back to lower allocations if they
> fail.  However, by simply trying to allocate, kswapd is woken up to
> start reclaiming at that order. On a desktop system, two users report
> that the system is getting locked up with kswapd using large amounts
> of CPU.  Using SLAB instead of SLUB made this problem go away.
> 
> This patch prevents kswapd being woken up for high-order allocations.
> Testing indicated that with this patch applied, the system was much
> harder to hang and even when it did, it eventually recovered.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>

Acked-by: David Rientjes <rientjes@google.com>

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-11 15:29 ` [PATCH 2/3] mm: slub: Do not take expensive steps " Mel Gorman
@ 2011-05-11 20:38   ` David Rientjes
  2011-05-11 21:10     ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: David Rientjes @ 2011-05-11 20:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 11 May 2011, Mel Gorman wrote:

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9f8a97b..057f1e2 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  {
>  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
>  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
>  
>  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
>  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	 */
>  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
>  
> -	if (!wait) {
> +	if (!wait && can_wake_kswapd) {
>  		/*
>  		 * Not worth trying to allocate harder for
>  		 * __GFP_NOMEMALLOC even if it can't schedule.
> diff --git a/mm/slub.c b/mm/slub.c
> index 98c358d..1071723 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
>  	 * Let the initial higher-order allocation fail under memory pressure
>  	 * so we fall-back to the minimum order allocation.
>  	 */
> -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) &
> +			~(__GFP_NOFAIL | __GFP_WAIT);

__GFP_NORETRY is a no-op without __GFP_WAIT.

>  
>  	page = alloc_slab_page(alloc_gfp, node, oo);
>  	if (unlikely(!page)) {

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman
@ 2011-05-11 20:38   ` David Rientjes
  2011-05-11 20:53     ` James Bottomley
                       ` (2 more replies)
  2011-05-12 14:43   ` Christoph Lameter
  1 sibling, 3 replies; 77+ messages in thread
From: David Rientjes @ 2011-05-11 20:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 11 May 2011, Mel Gorman wrote:

> To avoid locking and per-cpu overhead, SLUB optimisically uses
> high-order allocations up to order-3 by default and falls back to
> lower allocations if they fail. While care is taken that the caller
> and kswapd take no unusual steps in response to this, there are
> further consequences like shrinkers who have to free more objects to
> release any memory. There is anecdotal evidence that significant time
> is being spent looping in shrinkers with insufficient progress being
> made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake.
> 
> SLUB is now the default allocator and some bug reports have been
> pinned down to SLUB using high orders during operations like
> copying large amounts of data. SLUBs use of high-orders benefits
> applications that are sized to memory appropriately but this does not
> necessarily apply to large file servers or desktops.  This patch
> causes SLUB to use order-0 pages like SLAB does by default.
> There is further evidence that this keeps kswapd's usage lower
> (https://lkml.org/lkml/2011/5/10/383).
> 

This is going to severely impact slub's performance for applications on 
machines with plenty of memory available where fragmentation isn't a 
concern when allocating from caches with large object sizes (even 
changing the min order of kamlloc-256 from 1 to 0!) by default for users 
who don't use slub_max_order=3 on the command line.  SLUB relies heavily 
on allocating from the cpu slab and freeing to the cpu slab to avoid the 
slowpaths, so higher order slabs are important for its performance.

I can get numbers for a simple netperf TCP_RR benchmark with this change 
applied to show the degradation on a server with >32GB of RAM with this 
patch applied.

It would be ideal if this default could be adjusted based on the amount of 
memory available in the smallest node to determine whether we're concerned 
about making higher order allocations.  (Using the smallest node as a 
metric so that mempolicies and cpusets don't get unfairly biased against.)  
With the previous changes in this patchset, specifically avoiding waking 
kswapd and doing compaction for the higher order allocs before falling 
back to the min order, it shouldn't be devastating to try an order-3 alloc 
that will fail quickly.

> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  Documentation/vm/slub.txt |    2 +-
>  mm/slub.c                 |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
> index 07375e7..778e9fa 100644
> --- a/Documentation/vm/slub.txt
> +++ b/Documentation/vm/slub.txt
> @@ -117,7 +117,7 @@ can be influenced by kernel parameters:
>  
>  slub_min_objects=x		(default 4)
>  slub_min_order=x		(default 0)
> -slub_max_order=x		(default 1)
> +slub_max_order=x		(default 0)

Hmm, that was wrong to begin with, it should have been 3.

>  
>  slub_min_objects allows to specify how many objects must at least fit
>  into one slab in order for the allocation order to be acceptable.
> diff --git a/mm/slub.c b/mm/slub.c
> index 1071723..23a4789 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
>   * take the list_lock.
>   */
>  static int slub_min_order;
> -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> +static int slub_max_order;
>  static int slub_min_objects;
>  
>  /*

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-11 20:38   ` David Rientjes
@ 2011-05-11 20:53     ` James Bottomley
  2011-05-11 21:09     ` Mel Gorman
  2011-05-12 17:36     ` Andrea Arcangeli
  2 siblings, 0 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-11 20:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 2011-05-11 at 13:38 -0700, David Rientjes wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > To avoid locking and per-cpu overhead, SLUB optimisically uses
> > high-order allocations up to order-3 by default and falls back to
> > lower allocations if they fail. While care is taken that the caller
> > and kswapd take no unusual steps in response to this, there are
> > further consequences like shrinkers who have to free more objects to
> > release any memory. There is anecdotal evidence that significant time
> > is being spent looping in shrinkers with insufficient progress being
> > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake.
> > 
> > SLUB is now the default allocator and some bug reports have been
> > pinned down to SLUB using high orders during operations like
> > copying large amounts of data. SLUBs use of high-orders benefits
> > applications that are sized to memory appropriately but this does not
> > necessarily apply to large file servers or desktops.  This patch
> > causes SLUB to use order-0 pages like SLAB does by default.
> > There is further evidence that this keeps kswapd's usage lower
> > (https://lkml.org/lkml/2011/5/10/383).
> > 
> 
> This is going to severely impact slub's performance for applications on 
> machines with plenty of memory available where fragmentation isn't a 
> concern when allocating from caches with large object sizes (even 
> changing the min order of kamlloc-256 from 1 to 0!) by default for users 
> who don't use slub_max_order=3 on the command line.  SLUB relies heavily 
> on allocating from the cpu slab and freeing to the cpu slab to avoid the 
> slowpaths, so higher order slabs are important for its performance.
> 
> I can get numbers for a simple netperf TCP_RR benchmark with this change 
> applied to show the degradation on a server with >32GB of RAM with this 
> patch applied.
> 
> It would be ideal if this default could be adjusted based on the amount of 
> memory available in the smallest node to determine whether we're concerned 
> about making higher order allocations.  (Using the smallest node as a 
> metric so that mempolicies and cpusets don't get unfairly biased against.)  
> With the previous changes in this patchset, specifically avoiding waking 
> kswapd and doing compaction for the higher order allocs before falling 
> back to the min order, it shouldn't be devastating to try an order-3 alloc 
> that will fail quickly.

So my testing has shown that simply booting the kernel with
slub_max_order=0 makes the hang I'm seeing go away.  This definitely
implicates the higher order allocations in the kswapd problem.  I think
it would be wise not to make it the default until we can sort out the
root cause.

James





^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-11 20:38   ` David Rientjes
  2011-05-11 20:53     ` James Bottomley
@ 2011-05-11 21:09     ` Mel Gorman
  2011-05-11 22:27       ` David Rientjes
  2011-05-12 17:36     ` Andrea Arcangeli
  2 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2011-05-11 21:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > To avoid locking and per-cpu overhead, SLUB optimisically uses
> > high-order allocations up to order-3 by default and falls back to
> > lower allocations if they fail. While care is taken that the caller
> > and kswapd take no unusual steps in response to this, there are
> > further consequences like shrinkers who have to free more objects to
> > release any memory. There is anecdotal evidence that significant time
> > is being spent looping in shrinkers with insufficient progress being
> > made (https://lkml.org/lkml/2011/4/28/361) and keeping kswapd awake.
> > 
> > SLUB is now the default allocator and some bug reports have been
> > pinned down to SLUB using high orders during operations like
> > copying large amounts of data. SLUBs use of high-orders benefits
> > applications that are sized to memory appropriately but this does not
> > necessarily apply to large file servers or desktops.  This patch
> > causes SLUB to use order-0 pages like SLAB does by default.
> > There is further evidence that this keeps kswapd's usage lower
> > (https://lkml.org/lkml/2011/5/10/383).
> > 
> 
> This is going to severely impact slub's performance for applications on 
> machines with plenty of memory available where fragmentation isn't a 
> concern when allocating from caches with large object sizes (even 
> changing the min order of kamlloc-256 from 1 to 0!) by default for users 
> who don't use slub_max_order=3 on the command line.  SLUB relies heavily 
> on allocating from the cpu slab and freeing to the cpu slab to avoid the 
> slowpaths, so higher order slabs are important for its performance.
> 

I agree with you that there are situations where plenty of memory
means that that it'll perform much better. However, indications are
that it breaks down with high CPU usage when memory is low.  Worse,
once fragmentation becomes a problem, large amounts of UNMOVABLE and
RECLAIMABLE will make it progressively more expensive to find the
necessary pages. Perhaps with patches 1 and 2, this is not as much
of a problem but figures in the leader indicated that for a simple
workload with large amounts of files and data exceeding physical
memory that it was better off not to use high orders at all which
is a situation I'd expect to be encountered by more users than
performance-sensitive applications.

In other words, we're taking one hit or the other.

> I can get numbers for a simple netperf TCP_RR benchmark with this change 
> applied to show the degradation on a server with >32GB of RAM with this 
> patch applied.
> 

Agreed, I'd expect netperf TCP_RR or TCP_STREAM to take a hit,
particularly on a local machine where the recycling of pages will
impact it heavily.

> It would be ideal if this default could be adjusted based on the amount of 
> memory available in the smallest node to determine whether we're concerned 
> about making higher order allocations. 

It's not a function of memory size, working set size is what
is important or at least how many new pages have been allocated
recently. Fit your workload in physical memory - high orders are
great. Go larger than that and you hit problems. James' testing
indicated that kswapd CPU usage dropped to far lower levels with this
patch applied his test of untarring a large file for example.

> (Using the smallest node as a 
> metric so that mempolicies and cpusets don't get unfairly biased against.)  
> With the previous changes in this patchset, specifically avoiding waking 
> kswapd and doing compaction for the higher order allocs before falling 
> back to the min order, it shouldn't be devastating to try an order-3 alloc 
> that will fail quickly.
> 

Which is more reasonable? That an ordinary user gets a default that
is fairly safe even if benchmarks that demand the highest performance
from SLUB take a hit or that administrators running such workloads
set slub_max_order=3?

> > Signed-off-by: Mel Gorman <mgorman@suse.de>
> > ---
> >  Documentation/vm/slub.txt |    2 +-
> >  mm/slub.c                 |    2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/Documentation/vm/slub.txt b/Documentation/vm/slub.txt
> > index 07375e7..778e9fa 100644
> > --- a/Documentation/vm/slub.txt
> > +++ b/Documentation/vm/slub.txt
> > @@ -117,7 +117,7 @@ can be influenced by kernel parameters:
> >  
> >  slub_min_objects=x		(default 4)
> >  slub_min_order=x		(default 0)
> > -slub_max_order=x		(default 1)
> > +slub_max_order=x		(default 0)
> 
> Hmm, that was wrong to begin with, it should have been 3.
> 

True, but I didn't see the point fixing it in a separate patch. If this
patch gets rejected, I'll submit a documentation fix.

> >  
> >  slub_min_objects allows to specify how many objects must at least fit
> >  into one slab in order for the allocation order to be acceptable.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 1071723..23a4789 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
> >   * take the list_lock.
> >   */
> >  static int slub_min_order;
> > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> > +static int slub_max_order;
> >  static int slub_min_objects;
> >  
> >  /*

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-11 20:38   ` David Rientjes
@ 2011-05-11 21:10     ` Mel Gorman
  2011-05-12 17:25       ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2011-05-11 21:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, May 11, 2011 at 01:38:44PM -0700, David Rientjes wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 9f8a97b..057f1e2 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1972,6 +1972,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  {
> >  	int alloc_flags = ALLOC_WMARK_MIN | ALLOC_CPUSET;
> >  	const gfp_t wait = gfp_mask & __GFP_WAIT;
> > +	const gfp_t can_wake_kswapd = !(gfp_mask & __GFP_NO_KSWAPD);
> >  
> >  	/* __GFP_HIGH is assumed to be the same as ALLOC_HIGH to save a branch. */
> >  	BUILD_BUG_ON(__GFP_HIGH != (__force gfp_t) ALLOC_HIGH);
> > @@ -1984,7 +1985,7 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
> >  	 */
> >  	alloc_flags |= (__force int) (gfp_mask & __GFP_HIGH);
> >  
> > -	if (!wait) {
> > +	if (!wait && can_wake_kswapd) {
> >  		/*
> >  		 * Not worth trying to allocate harder for
> >  		 * __GFP_NOMEMALLOC even if it can't schedule.
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 98c358d..1071723 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> >  	 * Let the initial higher-order allocation fail under memory pressure
> >  	 * so we fall-back to the minimum order allocation.
> >  	 */
> > -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) &
> > +			~(__GFP_NOFAIL | __GFP_WAIT);
> 
> __GFP_NORETRY is a no-op without __GFP_WAIT.
> 

True. I'll remove it in a V2 but I won't respin just yet.

> >  
> >  	page = alloc_slab_page(alloc_gfp, node, oo);
> >  	if (unlikely(!page)) {

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman
                   ` (2 preceding siblings ...)
  2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman
@ 2011-05-11 21:39 ` James Bottomley
  2011-05-11 22:28   ` David Rientjes
  3 siblings, 1 reply; 77+ messages in thread
From: James Bottomley @ 2011-05-11 21:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Christoph Lameter, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Wed, 2011-05-11 at 16:29 +0100, Mel Gorman wrote:
> Debian (and probably Ubuntu) have recently have changed to the default
> option of SLUB. There are a few reports of people experiencing hangs
> when copying large amounts of data with kswapd using a large amount of
> CPU. It appears this is down to SLUB using high orders by default and
> the page allocator and reclaim struggling to keep up. The following
> three patches reduce the cost of using those high orders.
> 
> Patch 1 prevents kswapd waking up in response to SLUBs speculative
> 	use of high orders. This eliminates the hangs and while the
> 	system can still stall for long periods, it recovers.
> 
> Patch 2 further reduces the cost by prevent SLUB entering direct
> 	compaction or reclaim paths on the grounds that falling
> 	back to order-0 should be cheaper.
> 
> Patch 3 defaults SLUB to using order-0 on the grounds that the
> 	systems that heavily benefit from using high-order are also
> 	sized to fit in physical memory. On such systems, they should
> 	manually tune slub_max_order=3.
> 
> My own data on this is not great. I haven't really been able to
> reproduce the same problem locally but a significant failing is
> that the tests weren't stressing X but I couldn't make meaningful
> comparisons by just randomly clicking on things (working on fixing
> this problem).
> 
> The test case is simple. "download tar" wgets a large tar file and
> stores it locally. "unpack" is expanding it (15 times physical RAM
> in this case) and "delete source dirs" is the tarfile being deleted
> again. I also experimented with having the tar copied numerous times
> and into deeper directories to increase the size but the results were
> not particularly interesting so I left it as one tar.
> 
> Test server, 4 CPU threads (AMD Phenom), x86_64, 2G of RAM, no X running
>                              -       nowake    
>              largecopy-vanilla       kswapd-v1r1  noexstep-v1r1     default0-v1r1
> download tar           94 ( 0.00%)   94 ( 0.00%)   94 ( 0.00%)   93 ( 1.08%)
> unpack tar            521 ( 0.00%)  551 (-5.44%)  482 ( 8.09%)  488 ( 6.76%)
> delete source dirs    208 ( 0.00%)  218 (-4.59%)  194 ( 7.22%)  194 ( 7.22%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)        740.82    777.73    739.98    747.47
> Total Elapsed Time (seconds)               1046.66   1273.91    962.47    936.17
> 
> Disabling kswapd alone hurts performance slightly even though testers
> report it fixes hangs. I would guess it's because SLUB callers are
> calling direct reclaim more frequently (I belatedly noticed that
> compaction was disabled so it's not a factor) but haven't confirmed
> it. However, preventing kswapd waking or entering direct reclaim and
> having SLUB falling back to order-0 performed noticeably faster. Just
> using order-0 in the first place was fastest of all.
> 
> I tried running the same test on a test laptop but unfortunately
> due to a misconfiguration the results were lost. It would take a few
> hours to rerun so am posting without them.
> 
> If the testers verify this series help and we agree the patches are
> appropriate, they should be considered a stable candidate for 2.6.38.

OK, I confirm that I can't seem to break this one.  No hangs visible,
even when loading up the system with firefox, evolution, the usual
massive untar, X and even a distribution upgrade.

You can add my tested-by

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-11 21:09     ` Mel Gorman
@ 2011-05-11 22:27       ` David Rientjes
  2011-05-13 10:14         ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: David Rientjes @ 2011-05-11 22:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 11 May 2011, Mel Gorman wrote:

> I agree with you that there are situations where plenty of memory
> means that that it'll perform much better. However, indications are
> that it breaks down with high CPU usage when memory is low.  Worse,
> once fragmentation becomes a problem, large amounts of UNMOVABLE and
> RECLAIMABLE will make it progressively more expensive to find the
> necessary pages. Perhaps with patches 1 and 2, this is not as much
> of a problem but figures in the leader indicated that for a simple
> workload with large amounts of files and data exceeding physical
> memory that it was better off not to use high orders at all which
> is a situation I'd expect to be encountered by more users than
> performance-sensitive applications.
> 
> In other words, we're taking one hit or the other.
> 

Seems like the ideal solution would then be to find how to best set the 
default, and that can probably only be done with the size of the smallest 
node since it has a higher liklihood of encountering a large amount of 
unreclaimable slab when memory is low.

> > I can get numbers for a simple netperf TCP_RR benchmark with this change 
> > applied to show the degradation on a server with >32GB of RAM with this 
> > patch applied.
> > 
> 
> Agreed, I'd expect netperf TCP_RR or TCP_STREAM to take a hit,
> particularly on a local machine where the recycling of pages will
> impact it heavily.
> 

Ignoring the local machine for a second, TCP_RR probably shouldn't be 
taking any more of a hit with slub than it already is.  When I benchmarked 
slab vs. slub a couple months ago with two machines, each four quad-core 
Opterons with 64GB of memory, with this benchmark it showed slub was 
already 10-15% slower.  That's why slub has always been unusable for us, 
and I'm surprised that it's now becoming the favorite of distros 
everywhere (and, yes, Ubuntu now defaults to it as well).

> > It would be ideal if this default could be adjusted based on the amount of 
> > memory available in the smallest node to determine whether we're concerned 
> > about making higher order allocations. 
> 
> It's not a function of memory size, working set size is what
> is important or at least how many new pages have been allocated
> recently. Fit your workload in physical memory - high orders are
> great. Go larger than that and you hit problems. James' testing
> indicated that kswapd CPU usage dropped to far lower levels with this
> patch applied his test of untarring a large file for example.
> 

My point is that it would probably be better to tune the default based on 
how much memory is available at boot since it implies the probability of 
having an abundance of memory while populating the caches' partial lists 
up to min_partial rather than change it for everyone where it is known 
that it will cause performance degradations if memory is never low.  We 
probably don't want to be doing order-3 allocations for half the slab 
caches when we have 1G of memory available, but that's acceptable with 
64GB.

> > (Using the smallest node as a 
> > metric so that mempolicies and cpusets don't get unfairly biased against.)  
> > With the previous changes in this patchset, specifically avoiding waking 
> > kswapd and doing compaction for the higher order allocs before falling 
> > back to the min order, it shouldn't be devastating to try an order-3 alloc 
> > that will fail quickly.
> > 
> 
> Which is more reasonable? That an ordinary user gets a default that
> is fairly safe even if benchmarks that demand the highest performance
> from SLUB take a hit or that administrators running such workloads
> set slub_max_order=3?
> 

Not sure what is more reasonable since it depends on what the workload is, 
but what probably is unreasonable is changing a slub default that is known 
to directly impact performance by presenting a single benchmark under 
consideration without some due diligence in testing others like netperf.

We all know that slub has some disavantages compared to slab that are only 
now being realized because it has become the debian default, but it does 
excel at some workloads -- it was initially presented to beat slab in 
kernbench, hackbench, sysbench, and aim9 when it was merged.  Those 
advantages may never be fully realized on laptops or desktop machines, but 
with machines with plenty of memory available, slub ofter does perform 
better than slab.

That's why I suggested tuning the min order default based on total memory, 
it would probably be easier to justify than changing it for everyone and 
demanding users who are completely happy with using slub, the kernel.org 
default for years, now use command line options.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-11 21:39 ` [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations James Bottomley
@ 2011-05-11 22:28   ` David Rientjes
  2011-05-11 22:34     ` James Bottomley
  0 siblings, 1 reply; 77+ messages in thread
From: David Rientjes @ 2011-05-11 22:28 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 11 May 2011, James Bottomley wrote:

> OK, I confirm that I can't seem to break this one.  No hangs visible,
> even when loading up the system with firefox, evolution, the usual
> massive untar, X and even a distribution upgrade.
> 
> You can add my tested-by
> 

Your system still hangs with patches 1 and 2 only?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-11 22:28   ` David Rientjes
@ 2011-05-11 22:34     ` James Bottomley
  2011-05-12 11:13       ` Pekka Enberg
  2011-05-12 18:04       ` Andrea Arcangeli
  0 siblings, 2 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-11 22:34 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote:
> On Wed, 11 May 2011, James Bottomley wrote:
> 
> > OK, I confirm that I can't seem to break this one.  No hangs visible,
> > even when loading up the system with firefox, evolution, the usual
> > massive untar, X and even a distribution upgrade.
> > 
> > You can add my tested-by
> > 
> 
> Your system still hangs with patches 1 and 2 only?

Yes, but only once in all the testing.  With patches 1 and 2 the hang is
much harder to reproduce, but it still seems to be present if I hit it
hard enough.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-11 22:34     ` James Bottomley
@ 2011-05-12 11:13       ` Pekka Enberg
  2011-05-12 13:19         ` Mel Gorman
  2011-05-12 14:04         ` James Bottomley
  2011-05-12 18:04       ` Andrea Arcangeli
  1 sibling, 2 replies; 77+ messages in thread
From: Pekka Enberg @ 2011-05-12 11:13 UTC (permalink / raw)
  To: James Bottomley
  Cc: David Rientjes, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On 5/12/11 1:34 AM, James Bottomley wrote:
> On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote:
>> On Wed, 11 May 2011, James Bottomley wrote:
>>
>>> OK, I confirm that I can't seem to break this one.  No hangs visible,
>>> even when loading up the system with firefox, evolution, the usual
>>> massive untar, X and even a distribution upgrade.
>>>
>>> You can add my tested-by
>>>
>> Your system still hangs with patches 1 and 2 only?
> Yes, but only once in all the testing.  With patches 1 and 2 the hang is
> much harder to reproduce, but it still seems to be present if I hit it
> hard enough.

Patches 1-2 look reasonable to me. I'm not completely convinced of patch 
3, though. Why are we seeing these problems now? This has been in 
mainline for a long time already. Shouldn't we fix kswapd?

                         Pekka

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-12 11:13       ` Pekka Enberg
@ 2011-05-12 13:19         ` Mel Gorman
  2011-05-12 14:04         ` James Bottomley
  1 sibling, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-12 13:19 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: James Bottomley, David Rientjes, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, May 12, 2011 at 02:13:44PM +0300, Pekka Enberg wrote:
> On 5/12/11 1:34 AM, James Bottomley wrote:
> >On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote:
> >>On Wed, 11 May 2011, James Bottomley wrote:
> >>
> >>>OK, I confirm that I can't seem to break this one.  No hangs visible,
> >>>even when loading up the system with firefox, evolution, the usual
> >>>massive untar, X and even a distribution upgrade.
> >>>
> >>>You can add my tested-by
> >>>
> >>Your system still hangs with patches 1 and 2 only?
> >Yes, but only once in all the testing.  With patches 1 and 2 the hang is
> >much harder to reproduce, but it still seems to be present if I hit it
> >hard enough.
> 
> Patches 1-2 look reasonable to me. I'm not completely convinced of
> patch 3, though. Why are we seeing these problems now?

I'm not certain and testing so far as only being able to point to changing
from SLAB to SLUB between 2.6.37 and 2.6.38. This probably boils down to
distributions changing their allocator from slab to slub as recommended by
Kconfig and SLUB being tested heavily on desktop workloads in a variety of
settings for the first time. It's worth noting that only a few users have
been able to reproduce this. I don't see the severe hangs for example during
tests meaning it might also be down to newer hardware. What may be required
to reproduce this is many CPUs (4 on the test machines) with relatively
low memory for a 4-CPU machine (2G) and a slower disk than people might
have tested with up until now.

There are other new considerations as well that weren't much of a factor
when SLUB came along. The first reproduction case showed involved ext4 for
example which does delayed block allocation. It's possible there is some
problem wherby all the dirty pages to be written to disk need blocks to
be allocated and GFP_NOFS is not being used properly. Instead of failing
the high-order allocation, we then block instead hanging direct reclaimers
and kswapd. The filesystem people looked at this bug but didn't mention if
something like this was a possibility.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-12 11:13       ` Pekka Enberg
  2011-05-12 13:19         ` Mel Gorman
@ 2011-05-12 14:04         ` James Bottomley
  2011-05-12 15:53           ` James Bottomley
  1 sibling, 1 reply; 77+ messages in thread
From: James Bottomley @ 2011-05-12 14:04 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, 2011-05-12 at 14:13 +0300, Pekka Enberg wrote:
> On 5/12/11 1:34 AM, James Bottomley wrote:
> > On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote:
> >> On Wed, 11 May 2011, James Bottomley wrote:
> >>
> >>> OK, I confirm that I can't seem to break this one.  No hangs visible,
> >>> even when loading up the system with firefox, evolution, the usual
> >>> massive untar, X and even a distribution upgrade.
> >>>
> >>> You can add my tested-by
> >>>
> >> Your system still hangs with patches 1 and 2 only?
> > Yes, but only once in all the testing.  With patches 1 and 2 the hang is
> > much harder to reproduce, but it still seems to be present if I hit it
> > hard enough.
> 
> Patches 1-2 look reasonable to me. I'm not completely convinced of patch 
> 3, though. Why are we seeing these problems now? This has been in 
> mainline for a long time already. Shouldn't we fix kswapd?

So I'm open to this.  The hang occurs when kswapd races around in
shrink_slab and never exits.  It looks like there's a massive number of
wakeups triggering this, but we haven't been able to diagnose it
further.  turning on PREEMPT gets rid of the hang, so I could try to
reproduce with PREEMPT and turn on tracing.  The problem so far has been
that the number of events is so huge that the trace buffer only captures
a few microseconds of output.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman
  2011-05-11 20:38   ` David Rientjes
@ 2011-05-12 14:43   ` Christoph Lameter
  2011-05-12 15:15     ` James Bottomley
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 14:43 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Wed, 11 May 2011, Mel Gorman wrote:

> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
>   * take the list_lock.
>   */
>  static int slub_min_order;
> -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> +static int slub_max_order;

If we really need to do this then do not push this down to zero please.
SLAB uses order 1 for the meax. Lets at least keep it theere.

We have been using SLUB for a long time. Why is this issue arising now?
Due to compaction etc making reclaim less efficient?


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 14:43   ` Christoph Lameter
@ 2011-05-12 15:15     ` James Bottomley
  2011-05-12 15:27       ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: James Bottomley @ 2011-05-12 15:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 2011-05-12 at 09:43 -0500, Christoph Lameter wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2198,7 +2198,7 @@ EXPORT_SYMBOL(kmem_cache_free);
> >   * take the list_lock.
> >   */
> >  static int slub_min_order;
> > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> > +static int slub_max_order;
> 
> If we really need to do this then do not push this down to zero please.
> SLAB uses order 1 for the meax. Lets at least keep it theere.

1 is the current value.  Reducing it to zero seems to fix the kswapd
induced hangs.  The problem does look to be some shrinker/allocator
interference somewhere in vmscan.c, but the fact is that it's triggered
by SLUB and not SLAB.  I really think that what's happening is some type
of feedback loops where one of the shrinkers is issuing a
wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU
on non-preempt).

> We have been using SLUB for a long time. Why is this issue arising now?
> Due to compaction etc making reclaim less efficient?

This is the snark argument (I've said it thrice the bellman cried and
what I tell you three times is true).  The fact is that no enterprise
distribution at all uses SLUB.  It's only recently that the desktop
distributions started to ... the bugs are showing up under FC15 beta,
which is the first fedora distribution to enable it.  I'd say we're only
just beginning widespread SLUB testing.

James

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 15:15     ` James Bottomley
@ 2011-05-12 15:27       ` Christoph Lameter
  2011-05-12 15:43         ` James Bottomley
  2011-05-12 15:45         ` Dave Jones
  0 siblings, 2 replies; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 15:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 12 May 2011, James Bottomley wrote:

> > >   */
> > >  static int slub_min_order;
> > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> > > +static int slub_max_order;
> >
> > If we really need to do this then do not push this down to zero please.
> > SLAB uses order 1 for the meax. Lets at least keep it theere.
>
> 1 is the current value.  Reducing it to zero seems to fix the kswapd
> induced hangs.  The problem does look to be some shrinker/allocator
> interference somewhere in vmscan.c, but the fact is that it's triggered
> by SLUB and not SLAB.  I really think that what's happening is some type
> of feedback loops where one of the shrinkers is issuing a
> wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU
> on non-preempt).

The current value is PAGE_ALLOC_COSTLY_ORDER which is 3.

> > We have been using SLUB for a long time. Why is this issue arising now?
> > Due to compaction etc making reclaim less efficient?
>
> This is the snark argument (I've said it thrice the bellman cried and
> what I tell you three times is true).  The fact is that no enterprise
> distribution at all uses SLUB.  It's only recently that the desktop
> distributions started to ... the bugs are showing up under FC15 beta,
> which is the first fedora distribution to enable it.  I'd say we're only
> just beginning widespread SLUB testing.

Debian and Ubuntu have been using SLUB for a long time (and AFAICT from my
archives so has Fedora). I have been running those here for a couple of
years and the issues that I see here seem to be only with the most
recent kernels that now do compaction and other reclaim tricks.






^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 15:27       ` Christoph Lameter
@ 2011-05-12 15:43         ` James Bottomley
  2011-05-12 15:46           ` Dave Jones
                             ` (2 more replies)
  2011-05-12 15:45         ` Dave Jones
  1 sibling, 3 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-12 15:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 2011-05-12 at 10:27 -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > > >   */
> > > >  static int slub_min_order;
> > > > -static int slub_max_order = PAGE_ALLOC_COSTLY_ORDER;
> > > > +static int slub_max_order;
> > >
> > > If we really need to do this then do not push this down to zero please.
> > > SLAB uses order 1 for the meax. Lets at least keep it theere.
> >
> > 1 is the current value.  Reducing it to zero seems to fix the kswapd
> > induced hangs.  The problem does look to be some shrinker/allocator
> > interference somewhere in vmscan.c, but the fact is that it's triggered
> > by SLUB and not SLAB.  I really think that what's happening is some type
> > of feedback loops where one of the shrinkers is issuing a
> > wakeup_kswapd() so kswapd never sleeps (and never relinquishes the CPU
> > on non-preempt).
> 
> The current value is PAGE_ALLOC_COSTLY_ORDER which is 3.
> 
> > > We have been using SLUB for a long time. Why is this issue arising now?
> > > Due to compaction etc making reclaim less efficient?
> >
> > This is the snark argument (I've said it thrice the bellman cried and
> > what I tell you three times is true).  The fact is that no enterprise
> > distribution at all uses SLUB.  It's only recently that the desktop
> > distributions started to ... the bugs are showing up under FC15 beta,
> > which is the first fedora distribution to enable it.  I'd say we're only
> > just beginning widespread SLUB testing.
> 
> Debian and Ubuntu have been using SLUB for a long time

Only from Squeeze, which has been released for ~3 months.  That doesn't
qualify as a "long time" in my book.

>  (and AFAICT from my
> archives so has Fedora).

As I said above, no released fedora version uses SLUB.  It's only just
been enabled for the unreleased FC15; I'm testing a beta copy.

>  I have been running those here for a couple of
> years and the issues that I see here seem to be only with the most
> recent kernels that now do compaction and other reclaim tricks.

but a sample of one doeth not great testing make.

However, since you admit even you see problems, let's concentrate on
fixing them rather than recriminations?

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 15:27       ` Christoph Lameter
  2011-05-12 15:43         ` James Bottomley
@ 2011-05-12 15:45         ` Dave Jones
  1 sibling, 0 replies; 77+ messages in thread
From: Dave Jones @ 2011-05-12 15:45 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: James Bottomley, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, May 12, 2011 at 10:27:00AM -0500, Christoph Lameter wrote:
 > On Thu, 12 May 2011, James Bottomley wrote:
 > > It's only recently that the desktop
 > > distributions started to ... the bugs are showing up under FC15 beta,
 > > which is the first fedora distribution to enable it.  I'd say we're only
 > > just beginning widespread SLUB testing.
 > 
 > Debian and Ubuntu have been using SLUB for a long time (and AFAICT from my
 > archives so has Fedora).

Indeed. It was enabled in Fedora pretty much as soon as it appeared in mainline.

	Dave


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 15:43         ` James Bottomley
@ 2011-05-12 15:46           ` Dave Jones
  2011-05-12 16:00             ` James Bottomley
  2011-05-12 15:55           ` Pekka Enberg
  2011-05-12 16:01           ` Christoph Lameter
  2 siblings, 1 reply; 77+ messages in thread
From: Dave Jones @ 2011-05-12 15:46 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote:

 > As I said above, no released fedora version uses SLUB.  It's only just
 > been enabled for the unreleased FC15; I'm testing a beta copy.

James, this isn't true.

$ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 
CONFIG_SLUB_DEBUG=y
CONFIG_SLUB=y

(That's the oldest release I have right now, but it's been enabled even
before that release).

	Dave

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-12 14:04         ` James Bottomley
@ 2011-05-12 15:53           ` James Bottomley
  2011-05-13 11:25             ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: James Bottomley @ 2011-05-12 15:53 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: David Rientjes, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

[-- Attachment #1: Type: text/plain, Size: 1736 bytes --]

On Thu, 2011-05-12 at 09:04 -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 14:13 +0300, Pekka Enberg wrote:
> > On 5/12/11 1:34 AM, James Bottomley wrote:
> > > On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote:
> > >> On Wed, 11 May 2011, James Bottomley wrote:
> > >>
> > >>> OK, I confirm that I can't seem to break this one.  No hangs visible,
> > >>> even when loading up the system with firefox, evolution, the usual
> > >>> massive untar, X and even a distribution upgrade.
> > >>>
> > >>> You can add my tested-by
> > >>>
> > >> Your system still hangs with patches 1 and 2 only?
> > > Yes, but only once in all the testing.  With patches 1 and 2 the hang is
> > > much harder to reproduce, but it still seems to be present if I hit it
> > > hard enough.
> > 
> > Patches 1-2 look reasonable to me. I'm not completely convinced of patch 
> > 3, though. Why are we seeing these problems now? This has been in 
> > mainline for a long time already. Shouldn't we fix kswapd?
> 
> So I'm open to this.  The hang occurs when kswapd races around in
> shrink_slab and never exits.  It looks like there's a massive number of
> wakeups triggering this, but we haven't been able to diagnose it
> further.  turning on PREEMPT gets rid of the hang, so I could try to
> reproduce with PREEMPT and turn on tracing.  The problem so far has been
> that the number of events is so huge that the trace buffer only captures
> a few microseconds of output.

OK, here's the trace from a PREEMPT kernel (2.6.38.6) when kswapd hits
99% and stays there.  I've only enabled the vmscan tracepoints to try
and get a longer run.  It mosly looks like kswapd waking itself, but
there might be more in there that mm trained eyes can see.

James


[-- Attachment #2: tmp.trace.gz --]
[-- Type: application/x-gzip, Size: 175858 bytes --]

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 15:43         ` James Bottomley
  2011-05-12 15:46           ` Dave Jones
@ 2011-05-12 15:55           ` Pekka Enberg
  2011-05-12 18:37             ` James Bottomley
  2011-05-12 16:01           ` Christoph Lameter
  2 siblings, 1 reply; 77+ messages in thread
From: Pekka Enberg @ 2011-05-12 15:55 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> However, since you admit even you see problems, let's concentrate on
> fixing them rather than recriminations?

Yes, please. So does dropping max_order to 1 help?
PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.

			Pekka


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 15:46           ` Dave Jones
@ 2011-05-12 16:00             ` James Bottomley
  2011-05-12 16:08               ` Dave Jones
  2011-05-12 16:27               ` Christoph Lameter
  0 siblings, 2 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-12 16:00 UTC (permalink / raw)
  To: Dave Jones
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, 2011-05-12 at 11:46 -0400, Dave Jones wrote:
> On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote:
> 
>  > As I said above, no released fedora version uses SLUB.  It's only just
>  > been enabled for the unreleased FC15; I'm testing a beta copy.
> 
> James, this isn't true.
> 
> $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 
> CONFIG_SLUB_DEBUG=y
> CONFIG_SLUB=y
> 
> (That's the oldest release I have right now, but it's been enabled even
> before that release).

OK, I concede the point ... I haven't actually kept any of my FC
machines current for a while.

However, the fact remains that this seems to be a slub problem and it
needs fixing.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 15:43         ` James Bottomley
  2011-05-12 15:46           ` Dave Jones
  2011-05-12 15:55           ` Pekka Enberg
@ 2011-05-12 16:01           ` Christoph Lameter
  2011-05-12 16:10             ` Eric Dumazet
  2 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 16:01 UTC (permalink / raw)
  To: James Bottomley
  Cc: Mel Gorman, Andrew Morton, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Pekka Enberg, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 12 May 2011, James Bottomley wrote:

> > Debian and Ubuntu have been using SLUB for a long time
>
> Only from Squeeze, which has been released for ~3 months.  That doesn't
> qualify as a "long time" in my book.

I am sorry but I have never used a Debian/Ubuntu system in the last 3
years that did not use SLUB. And it was that by default. But then we
usually do not run the "released" Debian version. Typically one runs
testing. Ubuntu is different there we usually run releases. But those
have been SLUB for as long as I remember.

And so far it is rock solid and is widely rolled out throughout our
infrastructure (mostly 2.6.32 kernels).

> but a sample of one doeth not great testing make.
>
> However, since you admit even you see problems, let's concentrate on
> fixing them rather than recriminations?

I do not see problems here with earlier kernels. I only see these on one
testing system with the latest kernels on Ubuntu 11.04.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:00             ` James Bottomley
@ 2011-05-12 16:08               ` Dave Jones
  2011-05-12 16:27               ` Christoph Lameter
  1 sibling, 0 replies; 77+ messages in thread
From: Dave Jones @ 2011-05-12 16:08 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, May 12, 2011 at 11:00:23AM -0500, James Bottomley wrote:
 > On Thu, 2011-05-12 at 11:46 -0400, Dave Jones wrote:
 > > On Thu, May 12, 2011 at 10:43:13AM -0500, James Bottomley wrote:
 > > 
 > >  > As I said above, no released fedora version uses SLUB.  It's only just
 > >  > been enabled for the unreleased FC15; I'm testing a beta copy.
 > > 
 > > James, this isn't true.
 > > 
 > > $ grep SLUB /boot/config-2.6.35.12-88.fc14.x86_64 
 > > CONFIG_SLUB_DEBUG=y
 > > CONFIG_SLUB=y
 > > 
 > > (That's the oldest release I have right now, but it's been enabled even
 > > before that release).
 > 
 > OK, I concede the point ... I haven't actually kept any of my FC
 > machines current for a while.

'a while' is an understatement :)
It was first enabled in Fedora 8 in 2007.

	Dave
 

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:01           ` Christoph Lameter
@ 2011-05-12 16:10             ` Eric Dumazet
  2011-05-12 17:37               ` Andrew Morton
  0 siblings, 1 reply; 77+ messages in thread
From: Eric Dumazet @ 2011-05-12 16:10 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: James Bottomley, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

Le jeudi 12 mai 2011 à 11:01 -0500, Christoph Lameter a écrit :
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > > Debian and Ubuntu have been using SLUB for a long time
> >
> > Only from Squeeze, which has been released for ~3 months.  That doesn't
> > qualify as a "long time" in my book.
> 
> I am sorry but I have never used a Debian/Ubuntu system in the last 3
> years that did not use SLUB. And it was that by default. But then we
> usually do not run the "released" Debian version. Typically one runs
> testing. Ubuntu is different there we usually run releases. But those
> have been SLUB for as long as I remember.
> 
> And so far it is rock solid and is widely rolled out throughout our
> infrastructure (mostly 2.6.32 kernels).
> 
> > but a sample of one doeth not great testing make.
> >
> > However, since you admit even you see problems, let's concentrate on
> > fixing them rather than recriminations?
> 
> I do not see problems here with earlier kernels. I only see these on one
> testing system with the latest kernels on Ubuntu 11.04.

More fuel to this discussion with commit 6d4831c2

Something is wrong with high order allocations, on some machines.

Maybe we can find real cause instead of limiting us to use order-0 pages
in the end... ;)

commit 6d4831c283530a5f2c6bd8172c13efa236eb149d
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Wed Apr 27 15:26:41 2011 -0700

    vfs: avoid large kmalloc()s for the fdtable
    
    Azurit reports large increases in system time after 2.6.36 when running
    Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
    to allocate fdmem if possible").
    
    That patch caused the vfs to use kmalloc() for very large allocations and
    this is causing excessive work (and presumably excessive reclaim) within
    the page allocator.
    
    Fix it by falling back to vmalloc() earlier - when the allocation attempt
    would have been considered "costly" by reclaim.
    
  


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:00             ` James Bottomley
  2011-05-12 16:08               ` Dave Jones
@ 2011-05-12 16:27               ` Christoph Lameter
  2011-05-12 16:30                 ` James Bottomley
  2011-05-12 17:40                 ` Andrea Arcangeli
  1 sibling, 2 replies; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 16:27 UTC (permalink / raw)
  To: James Bottomley
  Cc: Dave Jones, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, 12 May 2011, James Bottomley wrote:

> However, the fact remains that this seems to be a slub problem and it
> needs fixing.

Why are you so fixed on slub in these matters? Its an key component but
there is a high interaction with other subsystems. There was no recent
change in slub that changed the order of allocations. There were changes
affecting the reclaim logic. Slub has been working just fine with the
existing allocation schemes for a long time.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:27               ` Christoph Lameter
@ 2011-05-12 16:30                 ` James Bottomley
  2011-05-12 16:48                   ` Christoph Lameter
  2011-05-12 17:06                   ` Pekka Enberg
  2011-05-12 17:40                 ` Andrea Arcangeli
  1 sibling, 2 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-12 16:30 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Dave Jones, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, 2011-05-12 at 11:27 -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > However, the fact remains that this seems to be a slub problem and it
> > needs fixing.
> 
> Why are you so fixed on slub in these matters?

Because, as has been hashed out in the thread, changing SLUB to SLAB
makes the hang go away.

>  Its an key component but
> there is a high interaction with other subsystems. There was no recent
> change in slub that changed the order of allocations. There were changes
> affecting the reclaim logic. Slub has been working just fine with the
> existing allocation schemes for a long time.

So suggest an alternative root cause and a test to expose it.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:30                 ` James Bottomley
@ 2011-05-12 16:48                   ` Christoph Lameter
  2011-05-12 17:46                     ` Andrea Arcangeli
  2011-05-12 17:06                   ` Pekka Enberg
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 16:48 UTC (permalink / raw)
  To: James Bottomley
  Cc: Dave Jones, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, 12 May 2011, James Bottomley wrote:

> On Thu, 2011-05-12 at 11:27 -0500, Christoph Lameter wrote:
> > On Thu, 12 May 2011, James Bottomley wrote:
> >
> > > However, the fact remains that this seems to be a slub problem and it
> > > needs fixing.
> >
> > Why are you so fixed on slub in these matters?
>
> Because, as has been hashed out in the thread, changing SLUB to SLAB
> makes the hang go away.

SLUB doesnt hang here with earlier kernel versions either. So the higher
allocations are no longer as effective as they were before. This is due to
a change in another subsystem.

> >  Its an key component but
> > there is a high interaction with other subsystems. There was no recent
> > change in slub that changed the order of allocations. There were changes
> > affecting the reclaim logic. Slub has been working just fine with the
> > existing allocation schemes for a long time.
>
> So suggest an alternative root cause and a test to expose it.

Have a look at my other emails? I am just repeating myself again it seems.

Try order = 1 which gives you SLAB like interaction with the page
allocator. Then we at least know that it is the order 2 and 3 allocs that
are the problem and not something else.



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:30                 ` James Bottomley
  2011-05-12 16:48                   ` Christoph Lameter
@ 2011-05-12 17:06                   ` Pekka Enberg
  2011-05-12 17:11                     ` Pekka Enberg
  1 sibling, 1 reply; 77+ messages in thread
From: Pekka Enberg @ 2011-05-12 17:06 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Lameter, Dave Jones, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, May 12, 2011 at 7:30 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> So suggest an alternative root cause and a test to expose it.

Is your .config available somewhere, btw?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 17:06                   ` Pekka Enberg
@ 2011-05-12 17:11                     ` Pekka Enberg
  2011-05-12 17:38                       ` Christoph Lameter
                                         ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Pekka Enberg @ 2011-05-12 17:11 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Lameter, Dave Jones, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, Andrea Arcangeli

On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote:
> On Thu, May 12, 2011 at 7:30 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
>> So suggest an alternative root cause and a test to expose it.
>
> Is your .config available somewhere, btw?

If it's this:

http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD

I'd love to see what happens if you disable

CONFIG_TRANSPARENT_HUGEPAGE=y

because that's going to reduce high order allocations as well, no?

                        Pekka

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 2/3] mm: slub: Do not take expensive steps for SLUBs speculative high-order allocations
  2011-05-11 21:10     ` Mel Gorman
@ 2011-05-12 17:25       ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 17:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

Hi,

On Wed, May 11, 2011 at 10:10:43PM +0100, Mel Gorman wrote:
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > index 98c358d..1071723 100644
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -1170,7 +1170,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
> > >  	 * Let the initial higher-order allocation fail under memory pressure
> > >  	 * so we fall-back to the minimum order allocation.
> > >  	 */
> > > -	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) & ~__GFP_NOFAIL;
> > > +	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY | __GFP_NO_KSWAPD) &
> > > +			~(__GFP_NOFAIL | __GFP_WAIT);
> > 
> > __GFP_NORETRY is a no-op without __GFP_WAIT.
> > 
> 
> True. I'll remove it in a V2 but I won't respin just yet.

Nothing wrong and no performance difference with clearing
__GFP_NORETRY too, if something it doesn't make sense for a caller to
use __GFP_NOFAIL without __GFP_WAIT so the original version above
looks cleaner. I like this change overall to only poll the buddy
allocator without spinning kswapd and without invoking lumpy reclaim.

Like you noted in the first mail, compaction was disabled, and very
bad behavior is expected without it unless GFP_ATOMIC|__GFP_NO_KSWAPD
is set (that was the way I had to use before disabling lumpy
compaction when first developing THP too for the same reasons).

But when compaction enabled slub could try to only clear __GFP_NOFAIL
and leave __GFP_WAIT and no bad behavior should happen... but it's
probably slower so I prefer to clear __GFP_WAIT too (for THP
compaction is worth it because the allocation is generally long lived,
but for slub allocations like tiny skb the allocation can be extremely
short lived so it's unlikely to be worth it). So this way compaction
is then invoked only by the minimal order allocation later if needed.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-11 20:38   ` David Rientjes
  2011-05-11 20:53     ` James Bottomley
  2011-05-11 21:09     ` Mel Gorman
@ 2011-05-12 17:36     ` Andrea Arcangeli
  2011-05-16 21:03       ` David Rientjes
  2 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 17:36 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> kswapd and doing compaction for the higher order allocs before falling 

Note that patch 2 disabled compaction by clearing __GFP_WAIT.

What you describe here would be patch 2 without the ~__GFP_WAIT
addition (so keeping only ~GFP_NOFAIL).

Not clearing __GFP_WAIT when compaction is enabled is possible and
shouldn't result in bad behavior (if compaction is not enabled with
current SLUB it's hard to imagine how it could perform decently if
there's fragmentation). You should try to benchmark to see if it's
worth it on the large NUMA systems with heavy network traffic (for
normal systems I doubt compaction is worth it but I'm not against
trying to keep it enabled just in case).

On a side note, this reminds me to rebuild with slub_max_order in .bss
on my cellphone (where I can't switch to SLAB because of some silly
rfs vfat-on-steroids proprietary module).

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:10             ` Eric Dumazet
@ 2011-05-12 17:37               ` Andrew Morton
  0 siblings, 0 replies; 77+ messages in thread
From: Andrew Morton @ 2011-05-12 17:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, James Bottomley, Mel Gorman, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, 12 May 2011 18:10:38 +0200 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> More fuel to this discussion with commit 6d4831c2
> 
> Something is wrong with high order allocations, on some machines.
> 
> Maybe we can find real cause instead of limiting us to use order-0 pages
> in the end... ;)
> 
> commit 6d4831c283530a5f2c6bd8172c13efa236eb149d
> Author: Andrew Morton <akpm@linux-foundation.org>
> Date:   Wed Apr 27 15:26:41 2011 -0700
> 
>     vfs: avoid large kmalloc()s for the fdtable

Well, it's always been the case that satisfying higher-order
allocations take a disproportionate amount of work in page reclaim. 
And often causes excessive reclaim.

That's why we've traditionally worked to avoid higher-order
allocations, and this has always been a problem with slub.

But the higher-order allocations shouldn't cause the VM to melt down. 
We changed something, and now it melts down.  Changing slub to avoid
that meltdown doesn't fix the thing we broke.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 17:11                     ` Pekka Enberg
@ 2011-05-12 17:38                       ` Christoph Lameter
  2011-05-12 18:00                         ` Andrea Arcangeli
  2011-05-12 17:51                       ` Andrea Arcangeli
  2011-05-12 18:36                       ` James Bottomley
  2 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 17:38 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, Andrea Arcangeli

On Thu, 12 May 2011, Pekka Enberg wrote:

> On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote:
> > On Thu, May 12, 2011 at 7:30 PM, James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> >> So suggest an alternative root cause and a test to expose it.
> >
> > Is your .config available somewhere, btw?
>
> If it's this:
>
> http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD
>
> I'd love to see what happens if you disable
>
> CONFIG_TRANSPARENT_HUGEPAGE=y
>
> because that's going to reduce high order allocations as well, no?

I dont think that will change much since huge pages are at MAX_ORDER size.
Either you can get them or not. The challenge with the small order
allocations is that they require contiguous memory. Compaction is likely
not as effective as the prior mechanism that did opportunistic reclaim of
neighboring pages.


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:27               ` Christoph Lameter
  2011-05-12 16:30                 ` James Bottomley
@ 2011-05-12 17:40                 ` Andrea Arcangeli
  1 sibling, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 17:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 11:27:04AM -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > However, the fact remains that this seems to be a slub problem and it
> > needs fixing.
> 
> Why are you so fixed on slub in these matters? Its an key component but
> there is a high interaction with other subsystems. There was no recent
> change in slub that changed the order of allocations. There were changes
> affecting the reclaim logic. Slub has been working just fine with the
> existing allocation schemes for a long time.

It should work just fine when compaction is enabled.

The COMPACTION=n case would also work decent if we eliminate the lumpy
reclaim. Lumpy reclaim tells the VM to ignore all young bits in the
pagetables and take everything down in order to generate the order 3
page that SLUB asks. You can't expect decent behavior the moment you
take everything down regardless of referenced bits on page and young
bits in pte. I doubt it's new issue, but lumpy may have become more or
less aggressive over time. Good thing, lumpy is eliminated (basically at
runtime, not compile time) by enabling compaction.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 16:48                   ` Christoph Lameter
@ 2011-05-12 17:46                     ` Andrea Arcangeli
  2011-05-12 18:00                       ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 17:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 11:48:19AM -0500, Christoph Lameter wrote:
> Try order = 1 which gives you SLAB like interaction with the page
> allocator. Then we at least know that it is the order 2 and 3 allocs that
> are the problem and not something else.

order 1 should work better, because it's less likely we end up here
(which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens
at the top of page_check_references())

   else if (sc->order && priority < DEF_PRIORITY - 2)
   	sc->reclaim_mode |= syncmode;

with order 1 more likely we end up here as enough pages are freed for
order 1 and we're safe:

     else
	sc->reclaim_mode = RECLAIM_MODE_SINGLE | RECLAIM_MODE_ASYNC;

None of these issue should materialize with COMPACTION=n. Even
__GFP_WAIT can be left enabled to run compaction without expecting
adverse behavior, but running compaction may still not be worth it for
small systems where the benefit of having order 1/2/3 allocation may
not outweight the cost of compaction itself.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 17:11                     ` Pekka Enberg
  2011-05-12 17:38                       ` Christoph Lameter
@ 2011-05-12 17:51                       ` Andrea Arcangeli
  2011-05-12 18:03                         ` Christoph Lameter
  2011-05-12 18:36                       ` James Bottomley
  2 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 17:51 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: James Bottomley, Christoph Lameter, Dave Jones, Mel Gorman,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 08:11:05PM +0300, Pekka Enberg wrote:
> If it's this:
> 
> http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD
> 
> I'd love to see what happens if you disable
> 
> CONFIG_TRANSPARENT_HUGEPAGE=y
> 
> because that's going to reduce high order allocations as well, no?

Well THP forces COMPACTION=y so lumpy won't risk to be activated. I
got once a complaint asking not to make THP force COMPACTION=y (there
is no real dependency here, THP will just call alloc_pages with
__GFP_NO_KSWAPD and order 9, or 10 on x86-nopae), but I preferred to
keep it forced exactly to avoid issues like these when THP is on. If
even order 3 is causing troubles (which doesn't immediately make lumpy
activated, it only activates when priority is < DEF_PRIORITY-2, so
after 2 loops failing to reclaim nr_to_reclaim pages), imagine what
was happening at order 9 every time firefox, gcc and mutt allocated
memory ;).

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 17:46                     ` Andrea Arcangeli
@ 2011-05-12 18:00                       ` Christoph Lameter
  2011-05-12 18:18                         ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 18:00 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, 12 May 2011, Andrea Arcangeli wrote:

> order 1 should work better, because it's less likely we end up here
> (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens
> at the top of page_check_references())
>
>    else if (sc->order && priority < DEF_PRIORITY - 2)

Why is this DEF_PRIORITY - 2? Shouldnt it be DEF_PRIORITY? An accomodation
for SLAB order 1 allocs?

May I assume that the case of order 2 and 3 allocs in that case was not
very well tested after the changes to introduce compaction since people
were focusing on RHEL testing?


^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 17:38                       ` Christoph Lameter
@ 2011-05-12 18:00                         ` Andrea Arcangeli
  2011-05-13  9:49                           ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 18:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, James Bottomley, Dave Jones, Mel Gorman,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 12:38:34PM -0500, Christoph Lameter wrote:
> I dont think that will change much since huge pages are at MAX_ORDER size.
> Either you can get them or not. The challenge with the small order
> allocations is that they require contiguous memory. Compaction is likely
> not as effective as the prior mechanism that did opportunistic reclaim of
> neighboring pages.

THP requires contiguous pages too, the issue is similar, and worse
with THP, but THP enables compaction by default, likely this only
happens with compaction off. We've really to differentiate between
compaction on and off, it makes world of difference (a THP enabled
kernel with compaction off, also runs into swap storms and temporary
hangs all the time, it's probably the same issue of SLUB=y
COMPACTION=n). At least THP didn't activate kswapd, kswapd running
lumpy too makes things worse as it'll probably keep running in the
background after the direct reclaim fails.

The original reports talks about kerenls with SLUB=y and
COMPACTION=n. Not sure if anybody is having trouble with SLUB=y
COMPACTION=y...

Compaction is more effective than the prior mechanism too (prior
mechanism is lumpy reclaim) and it doesn't cause VM disruptions that
ignore all referenced information and takes down anything it finds in
the way.

I think when COMPACTION=n, lumpy either should go away, or only be
activated by __GFP_REPEAT so that only hugetlbfs makes use of
it. Increasing nr_hugepages is ok to halt the system for a while but
when all allocations are doing that, system becomes unusable, kind of
livelocked.

BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too
(not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not
with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless
lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 17:51                       ` Andrea Arcangeli
@ 2011-05-12 18:03                         ` Christoph Lameter
  2011-05-12 18:09                           ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 18:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pekka Enberg, James Bottomley, Dave Jones, Mel Gorman,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, 12 May 2011, Andrea Arcangeli wrote:

> even order 3 is causing troubles (which doesn't immediately make lumpy
> activated, it only activates when priority is < DEF_PRIORITY-2, so
> after 2 loops failing to reclaim nr_to_reclaim pages), imagine what

That is a significant change for SLUB with the merge of the compaction
code.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-11 22:34     ` James Bottomley
  2011-05-12 11:13       ` Pekka Enberg
@ 2011-05-12 18:04       ` Andrea Arcangeli
  2011-05-13 11:24         ` Mel Gorman
  1 sibling, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 18:04 UTC (permalink / raw)
  To: James Bottomley
  Cc: David Rientjes, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

Hi James!

On Wed, May 11, 2011 at 05:34:27PM -0500, James Bottomley wrote:
> Yes, but only once in all the testing.  With patches 1 and 2 the hang is

Weird patch 2 makes the large order allocation without ~__GFP_WAIT, so
even COMPACTION=y/n shouldn't matter anymore. Am I misreading
something Mel?

Removing ~__GFP_WAIT from patch 2 (and adding ~__GFP_REPEAT as a
correctness improvement) and setting COMPACTION=y also should work ok.

Removing ~__GFP_WAIT from patch 2 and setting COMPACTION=n is expected
not to work well.

But compaction should only make the difference if you remove
~__GFP_WAIT from patch 2.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 18:03                         ` Christoph Lameter
@ 2011-05-12 18:09                           ` Andrea Arcangeli
  2011-05-12 18:16                             ` Christoph Lameter
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 18:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, James Bottomley, Dave Jones, Mel Gorman,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 01:03:05PM -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, Andrea Arcangeli wrote:
> 
> > even order 3 is causing troubles (which doesn't immediately make lumpy
> > activated, it only activates when priority is < DEF_PRIORITY-2, so
> > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what
> 
> That is a significant change for SLUB with the merge of the compaction
> code.

Even before compaction was posted, I had to shut off lumpy reclaim or
it'd hang all the time with frequent order 9 allocations. Maybe lumpy
was better before, maybe lumpy "improved" its reliability recently,
but definitely it wasn't performing well. That definitely applies to
>=2.6.32 (I had to nuke lumpy from it, and only keep compaction
enabled, pretty much like upstream with COMPACTION=y). I think I never
tried earlier lumpy code than 2.6.32, maybe it was less aggressive
back then, I don't exclude it but I thought the whole notion of lumpy
was to takedown everything in the way, which usually leads to process
hanging in swapins or pageins for frequent used memory.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 18:09                           ` Andrea Arcangeli
@ 2011-05-12 18:16                             ` Christoph Lameter
  0 siblings, 0 replies; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 18:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Pekka Enberg, James Bottomley, Dave Jones, Mel Gorman,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, 12 May 2011, Andrea Arcangeli wrote:

> On Thu, May 12, 2011 at 01:03:05PM -0500, Christoph Lameter wrote:
> > On Thu, 12 May 2011, Andrea Arcangeli wrote:
> >
> > > even order 3 is causing troubles (which doesn't immediately make lumpy
> > > activated, it only activates when priority is < DEF_PRIORITY-2, so
> > > after 2 loops failing to reclaim nr_to_reclaim pages), imagine what
> >
> > That is a significant change for SLUB with the merge of the compaction
> > code.
>
> Even before compaction was posted, I had to shut off lumpy reclaim or
> it'd hang all the time with frequent order 9 allocations. Maybe lumpy
> was better before, maybe lumpy "improved" its reliability recently,

Well we are concerned about order 2 and 3 alloc here. Checking for <
PAGE_ORDER_COSTLY to avoid the order 9 lumpy reclaim looks okay.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 18:00                       ` Christoph Lameter
@ 2011-05-12 18:18                         ` Andrea Arcangeli
  0 siblings, 0 replies; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-12 18:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: James Bottomley, Dave Jones, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 01:00:10PM -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, Andrea Arcangeli wrote:
> 
> > order 1 should work better, because it's less likely we end up here
> > (which leaves RECLAIM_MODE_LUMPYRECLAIM on and then see what happens
> > at the top of page_check_references())
> >
> >    else if (sc->order && priority < DEF_PRIORITY - 2)
> 
> Why is this DEF_PRIORITY - 2? Shouldnt it be DEF_PRIORITY? An accomodation
> for SLAB order 1 allocs?

That's to allow a few loops of the shrinker (i.e. not take down
everything in the way regardless of any aging information in pte/page
if there's no memory pressure). This "- 2" is independent of the
allocation order. If it was < DEF_PRIORITY it'd trigger lumpy already
at the second loop (in do_try_to_free_pages). So it'd make things
worse. Like it'd make things worse decreasing the
PAGE_ALLOC_COSTLY_ORDER define to 2 and keeping slub at 3.

> May I assume that the case of order 2 and 3 allocs in that case was not
> very well tested after the changes to introduce compaction since people
> were focusing on RHEL testing?

Not really, I had to eliminate lumpy before compaction was
developed. RHEL6 has zero lumpy code (not even at compile time) and
compaction enabled by default, so even if we enabled SLUB=y it should
work ok (not sure why James still crashes with patch 2 applied that
clears __GFP_WAIT, that crash likely has nothing to do with compaction
or lumpy as both are off with __GFP_WAIT not set).

Lumpy is also eliminated upstream now (but only at runtime when
COMPACTION=y), unless __GFP_REPEAT is set, in which case I think lumpy
will still work upstream too but few unfrequent things like increasing
nr_hugepages uses that.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 17:11                     ` Pekka Enberg
  2011-05-12 17:38                       ` Christoph Lameter
  2011-05-12 17:51                       ` Andrea Arcangeli
@ 2011-05-12 18:36                       ` James Bottomley
  2 siblings, 0 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-12 18:36 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Dave Jones, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4, Andrea Arcangeli

On Thu, 2011-05-12 at 20:11 +0300, Pekka Enberg wrote:
> On Thu, May 12, 2011 at 8:06 PM, Pekka Enberg <penberg@kernel.org> wrote:
> > On Thu, May 12, 2011 at 7:30 PM, James Bottomley
> > <James.Bottomley@hansenpartnership.com> wrote:
> >> So suggest an alternative root cause and a test to expose it.
> >
> > Is your .config available somewhere, btw?
> 
> If it's this:
> 
> http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=blob_plain;f=config-x86_64-generic;hb=HEAD
> 
> I'd love to see what happens if you disable
> 
> CONFIG_TRANSPARENT_HUGEPAGE=y
> 
> because that's going to reduce high order allocations as well, no?

So yes, it's a default FC15 config.

Disabling THP was initially tried a long time ago and didn't make a
difference (it was originally suggested by Chris Mason).

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 15:55           ` Pekka Enberg
@ 2011-05-12 18:37             ` James Bottomley
  2011-05-12 18:46               ` Christoph Lameter
  2011-05-12 19:44               ` James Bottomley
  0 siblings, 2 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-12 18:37 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > However, since you admit even you see problems, let's concentrate on
> > fixing them rather than recriminations?
> 
> Yes, please. So does dropping max_order to 1 help?
> PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.

Just booting with max_slab_order=1 (and none of the other patches
applied) I can still get the machine to go into kswapd at 99%, so it
doesn't seem to make much of a difference.

Do you want me to try with the other two patches and max_slab_order=1?

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 18:37             ` James Bottomley
@ 2011-05-12 18:46               ` Christoph Lameter
  2011-05-12 19:21                 ` James Bottomley
  2011-05-12 19:44               ` James Bottomley
  1 sibling, 1 reply; 77+ messages in thread
From: Christoph Lameter @ 2011-05-12 18:46 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pekka Enberg, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 12 May 2011, James Bottomley wrote:

> On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > However, since you admit even you see problems, let's concentrate on
> > > fixing them rather than recriminations?
> >
> > Yes, please. So does dropping max_order to 1 help?
> > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
>
> Just booting with max_slab_order=1 (and none of the other patches
> applied) I can still get the machine to go into kswapd at 99%, so it
> doesn't seem to make much of a difference.

slub_max_order=1 right? Not max_slab_order.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 18:46               ` Christoph Lameter
@ 2011-05-12 19:21                 ` James Bottomley
  0 siblings, 0 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-12 19:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 2011-05-12 at 13:46 -0500, Christoph Lameter wrote:
> On Thu, 12 May 2011, James Bottomley wrote:
> 
> > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > However, since you admit even you see problems, let's concentrate on
> > > > fixing them rather than recriminations?
> > >
> > > Yes, please. So does dropping max_order to 1 help?
> > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> >
> > Just booting with max_slab_order=1 (and none of the other patches
> > applied) I can still get the machine to go into kswapd at 99%, so it
> > doesn't seem to make much of a difference.
> 
> slub_max_order=1 right? Not max_slab_order.

Yes.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 18:37             ` James Bottomley
  2011-05-12 18:46               ` Christoph Lameter
@ 2011-05-12 19:44               ` James Bottomley
  2011-05-12 20:04                 ` James Bottomley
  1 sibling, 1 reply; 77+ messages in thread
From: James Bottomley @ 2011-05-12 19:44 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > However, since you admit even you see problems, let's concentrate on
> > > fixing them rather than recriminations?
> > 
> > Yes, please. So does dropping max_order to 1 help?
> > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> 
> Just booting with max_slab_order=1 (and none of the other patches
> applied) I can still get the machine to go into kswapd at 99%, so it
> doesn't seem to make much of a difference.
> 
> Do you want me to try with the other two patches and max_slab_order=1?

OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
trigger the problem (kswapd spinning at 99%).  This is still with
PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
it.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 19:44               ` James Bottomley
@ 2011-05-12 20:04                 ` James Bottomley
  2011-05-12 20:29                   ` Johannes Weiner
                                     ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-12 20:04 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > However, since you admit even you see problems, let's concentrate on
> > > > fixing them rather than recriminations?
> > > 
> > > Yes, please. So does dropping max_order to 1 help?
> > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> > 
> > Just booting with max_slab_order=1 (and none of the other patches
> > applied) I can still get the machine to go into kswapd at 99%, so it
> > doesn't seem to make much of a difference.
> > 
> > Do you want me to try with the other two patches and max_slab_order=1?
> 
> OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
> trigger the problem (kswapd spinning at 99%).  This is still with
> PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
> patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
> it.

Confirmed, I'm afraid ... I can trigger the problem with all three
patches under PREEMPT.  It's not a hang this time, it's just kswapd
taking 100% system time on 1 CPU and it won't calm down after I unload
the system.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 20:04                 ` James Bottomley
@ 2011-05-12 20:29                   ` Johannes Weiner
  2011-05-12 20:31                     ` Johannes Weiner
  2011-05-12 20:31                     ` James Bottomley
  2011-05-12 22:04                   ` James Bottomley
  2011-05-13  6:16                   ` Pekka Enberg
  2 siblings, 2 replies; 77+ messages in thread
From: Johannes Weiner @ 2011-05-12 20:29 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > > However, since you admit even you see problems, let's concentrate on
> > > > > fixing them rather than recriminations?
> > > > 
> > > > Yes, please. So does dropping max_order to 1 help?
> > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> > > 
> > > Just booting with max_slab_order=1 (and none of the other patches
> > > applied) I can still get the machine to go into kswapd at 99%, so it
> > > doesn't seem to make much of a difference.
> > > 
> > > Do you want me to try with the other two patches and max_slab_order=1?
> > 
> > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
> > trigger the problem (kswapd spinning at 99%).  This is still with
> > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
> > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
> > it.
> 
> Confirmed, I'm afraid ... I can trigger the problem with all three
> patches under PREEMPT.  It's not a hang this time, it's just kswapd
> taking 100% system time on 1 CPU and it won't calm down after I unload
> the system.

That is kind of expected, though.  If one CPU is busy with a streaming
IO load generating new pages, kswapd is busy reclaiming the old ones
so that the generator does not have to do the reclaim itself.

By unload, do you mean stopping the generator?  And if so, how quickly
after you stop the generator does kswapd go back to sleep?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 20:29                   ` Johannes Weiner
@ 2011-05-12 20:31                     ` Johannes Weiner
  2011-05-12 20:31                     ` James Bottomley
  1 sibling, 0 replies; 77+ messages in thread
From: Johannes Weiner @ 2011-05-12 20:31 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 10:29:17PM +0200, Johannes Weiner wrote:
> On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote:
> > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > > > However, since you admit even you see problems, let's concentrate on
> > > > > > fixing them rather than recriminations?
> > > > > 
> > > > > Yes, please. So does dropping max_order to 1 help?
> > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> > > > 
> > > > Just booting with max_slab_order=1 (and none of the other patches
> > > > applied) I can still get the machine to go into kswapd at 99%, so it
> > > > doesn't seem to make much of a difference.
> > > > 
> > > > Do you want me to try with the other two patches and max_slab_order=1?
> > > 
> > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
> > > trigger the problem (kswapd spinning at 99%).  This is still with
> > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
> > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
> > > it.
> > 
> > Confirmed, I'm afraid ... I can trigger the problem with all three
> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > taking 100% system time on 1 CPU and it won't calm down after I unload
> > the system.

I am so sorry, I missed the "won't" here.  Please ignore.

> That is kind of expected, though.  If one CPU is busy with a streaming
> IO load generating new pages, kswapd is busy reclaiming the old ones
> so that the generator does not have to do the reclaim itself.
> 
> By unload, do you mean stopping the generator?  And if so, how quickly
> after you stop the generator does kswapd go back to sleep?

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 20:29                   ` Johannes Weiner
  2011-05-12 20:31                     ` Johannes Weiner
@ 2011-05-12 20:31                     ` James Bottomley
  1 sibling, 0 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-12 20:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, 2011-05-12 at 22:29 +0200, Johannes Weiner wrote:
> On Thu, May 12, 2011 at 03:04:12PM -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 14:44 -0500, James Bottomley wrote:
> > > On Thu, 2011-05-12 at 13:37 -0500, James Bottomley wrote:
> > > > On Thu, 2011-05-12 at 18:55 +0300, Pekka Enberg wrote:
> > > > > On Thu, 2011-05-12 at 10:43 -0500, James Bottomley wrote:
> > > > > > However, since you admit even you see problems, let's concentrate on
> > > > > > fixing them rather than recriminations?
> > > > > 
> > > > > Yes, please. So does dropping max_order to 1 help?
> > > > > PAGE_ALLOC_COSTLY_ORDER is set to 3 in 2.6.39-rc7.
> > > > 
> > > > Just booting with max_slab_order=1 (and none of the other patches
> > > > applied) I can still get the machine to go into kswapd at 99%, so it
> > > > doesn't seem to make much of a difference.
> > > > 
> > > > Do you want me to try with the other two patches and max_slab_order=1?
> > > 
> > > OK, so patches 1 + 2 plus setting slub_max_order=1 still manages to
> > > trigger the problem (kswapd spinning at 99%).  This is still with
> > > PREEMPT; it's possible that non-PREEMPT might be better, so I'll try
> > > patches 1+2+3 with PREEMPT just to see if the perturbation is caused by
> > > it.
> > 
> > Confirmed, I'm afraid ... I can trigger the problem with all three
> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > taking 100% system time on 1 CPU and it won't calm down after I unload
> > the system.
> 
> That is kind of expected, though.  If one CPU is busy with a streaming
> IO load generating new pages, kswapd is busy reclaiming the old ones
> so that the generator does not have to do the reclaim itself.
> 
> By unload, do you mean stopping the generator? 

Correct.

>  And if so, how quickly
> after you stop the generator does kswapd go back to sleep?

It doesn't.  At least not on its own; the CPU stays pegged.  If I start
other work (like a kernel compile), then sometimes it does go back to
nothing.

I'm speculating that this is the hang case for non-PREEMPT.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 20:04                 ` James Bottomley
  2011-05-12 20:29                   ` Johannes Weiner
@ 2011-05-12 22:04                   ` James Bottomley
  2011-05-12 22:15                     ` Johannes Weiner
  2011-05-13  6:16                   ` Pekka Enberg
  2 siblings, 1 reply; 77+ messages in thread
From: James Bottomley @ 2011-05-12 22:04 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
> Confirmed, I'm afraid ... I can trigger the problem with all three
> patches under PREEMPT.  It's not a hang this time, it's just kswapd
> taking 100% system time on 1 CPU and it won't calm down after I unload
> the system.

Just on a "if you don't know what's wrong poke about and see" basis, I
sliced out all the complex logic in sleeping_prematurely() and, as far
as I can tell, it cures the problem behaviour.  I've loaded up the
system, and taken the tar load generator through three runs without
producing a spinning kswapd (this is PREEMPT).  I'll try with a
non-PREEMPT kernel shortly.

What this seems to say is that there's a problem with the complex logic
in sleeping_prematurely().  I'm pretty sure hacking up
sleeping_prematurely() just to dump all the calculations is the wrong
thing to do, but perhaps someone can see what the right thing is ...

By the way, I stripped off all the patches, so this is a plain old
2.6.38.6 kernel with the default FC15 config.

James

---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0665520..1bdea7d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2255,6 +2255,8 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	if (remaining)
 		return true;

+	return false;
+
 	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 22:04                   ` James Bottomley
@ 2011-05-12 22:15                     ` Johannes Weiner
  2011-05-12 22:58                       ` Minchan Kim
                                         ` (2 more replies)
  0 siblings, 3 replies; 77+ messages in thread
From: Johannes Weiner @ 2011-05-12 22:15 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
> > Confirmed, I'm afraid ... I can trigger the problem with all three
> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > taking 100% system time on 1 CPU and it won't calm down after I unload
> > the system.
> 
> Just on a "if you don't know what's wrong poke about and see" basis, I
> sliced out all the complex logic in sleeping_prematurely() and, as far
> as I can tell, it cures the problem behaviour.  I've loaded up the
> system, and taken the tar load generator through three runs without
> producing a spinning kswapd (this is PREEMPT).  I'll try with a
> non-PREEMPT kernel shortly.
> 
> What this seems to say is that there's a problem with the complex logic
> in sleeping_prematurely().  I'm pretty sure hacking up
> sleeping_prematurely() just to dump all the calculations is the wrong
> thing to do, but perhaps someone can see what the right thing is ...

I think I see the problem: the boolean logic of sleeping_prematurely()
is odd.  If it returns true, kswapd will keep running.  So if
pgdat_balanced() returns true, kswapd should go to sleep.

This?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b701e0..092d773 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2261,7 +2261,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	 * must be balanced
 	 */
 	if (order)
-		return pgdat_balanced(pgdat, balanced, classzone_idx);
+		return !pgdat_balanced(pgdat, balanced, classzone_idx);
 	else
 		return !all_zones_ok;
 }

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 22:15                     ` Johannes Weiner
@ 2011-05-12 22:58                       ` Minchan Kim
  2011-05-13  5:39                         ` Minchan Kim
  2011-05-13  0:47                       ` James Bottomley
  2011-05-13 10:30                       ` Mel Gorman
  2 siblings, 1 reply; 77+ messages in thread
From: Minchan Kim @ 2011-05-12 22:58 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: James Bottomley, Pekka Enberg, Christoph Lameter, Mel Gorman,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, May 13, 2011 at 7:15 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote:
>> On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
>> > Confirmed, I'm afraid ... I can trigger the problem with all three
>> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
>> > taking 100% system time on 1 CPU and it won't calm down after I unload
>> > the system.
>>
>> Just on a "if you don't know what's wrong poke about and see" basis, I
>> sliced out all the complex logic in sleeping_prematurely() and, as far
>> as I can tell, it cures the problem behaviour.  I've loaded up the
>> system, and taken the tar load generator through three runs without
>> producing a spinning kswapd (this is PREEMPT).  I'll try with a
>> non-PREEMPT kernel shortly.
>>
>> What this seems to say is that there's a problem with the complex logic
>> in sleeping_prematurely().  I'm pretty sure hacking up
>> sleeping_prematurely() just to dump all the calculations is the wrong
>> thing to do, but perhaps someone can see what the right thing is ...
>
> I think I see the problem: the boolean logic of sleeping_prematurely()
> is odd.  If it returns true, kswapd will keep running.  So if
> pgdat_balanced() returns true, kswapd should go to sleep.
>
> This?

Yes. Good catch.

>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2b701e0..092d773 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2261,7 +2261,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>         * must be balanced
>         */
>        if (order)
> -               return pgdat_balanced(pgdat, balanced, classzone_idx);
> +               return !pgdat_balanced(pgdat, balanced, classzone_idx);
>        else
>                return !all_zones_ok;
>  }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 22:15                     ` Johannes Weiner
  2011-05-12 22:58                       ` Minchan Kim
@ 2011-05-13  0:47                       ` James Bottomley
  2011-05-13  4:12                         ` James Bottomley
  2011-05-13 10:55                         ` Mel Gorman
  2011-05-13 10:30                       ` Mel Gorman
  2 siblings, 2 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-13  0:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote:
> On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
> > > Confirmed, I'm afraid ... I can trigger the problem with all three
> > > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > > taking 100% system time on 1 CPU and it won't calm down after I unload
> > > the system.
> > 
> > Just on a "if you don't know what's wrong poke about and see" basis, I
> > sliced out all the complex logic in sleeping_prematurely() and, as far
> > as I can tell, it cures the problem behaviour.  I've loaded up the
> > system, and taken the tar load generator through three runs without
> > producing a spinning kswapd (this is PREEMPT).  I'll try with a
> > non-PREEMPT kernel shortly.
> > 
> > What this seems to say is that there's a problem with the complex logic
> > in sleeping_prematurely().  I'm pretty sure hacking up
> > sleeping_prematurely() just to dump all the calculations is the wrong
> > thing to do, but perhaps someone can see what the right thing is ...
> 
> I think I see the problem: the boolean logic of sleeping_prematurely()
> is odd.  If it returns true, kswapd will keep running.  So if
> pgdat_balanced() returns true, kswapd should go to sleep.
> 
> This?

I was going to say this was a winner, but on the third untar run on
non-PREEMPT, I hit the kswapd livelock.  It's got much farther than
previous attempts, which all hang on the first run, but I think the
essential problem is still (at least on this machine) that
sleeping_prematurely() is doing too much work for the wakeup storm that
allocators are causing.

Something that ratelimits the amount of time we spend in the watermark
calculations, like the below (which incorporates your pgdat fix) seems
to be much more stable (I've not run it for three full runs yet, but
kswapd CPU time is way lower so far).

The heuristic here is that if we're making the calculation more than ten
times in 1/10 of a second, stop and sleep anyway.

James

---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0665520..545250c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2249,12 +2249,32 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 {
 	int i;
 	unsigned long balanced = 0;
-	bool all_zones_ok = true;
+	bool all_zones_ok = true, ret;
+	static int returned_true = 0;
+	static unsigned long prev_jiffies = 0;
+	
 
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;
 
+	/* rate limit our entry to the watermark calculations */
+	if (time_after(prev_jiffies + HZ/10, jiffies)) {
+		/* previously returned false, do so again */
+		if (returned_true == 0)
+			return false;
+		/* or we've done the true calculation too many times */
+		if (returned_true++ > 10)
+			return false;
+
+		return true;
+	} else {
+		/* haven't been here for a while, reset the true count */
+		returned_true = 0;
+	}
+
+	prev_jiffies = jiffies;
+
 	/* Check the watermark levels */
 	for (i = 0; i < pgdat->nr_zones; i++) {
 		struct zone *zone = pgdat->node_zones + i;
@@ -2286,9 +2306,16 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	 * must be balanced
 	 */
 	if (order)
-		return pgdat_balanced(pgdat, balanced, classzone_idx);
+		ret = !pgdat_balanced(pgdat, balanced, classzone_idx);
+	else
+		ret = !all_zones_ok;
+
+	if (ret)
+		returned_true++;
 	else
-		return !all_zones_ok;
+		returned_true = 0;
+
+	return ret;
 }
 
 /*



^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-13  0:47                       ` James Bottomley
@ 2011-05-13  4:12                         ` James Bottomley
  2011-05-13 10:55                         ` Mel Gorman
  1 sibling, 0 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-13  4:12 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Pekka Enberg, Christoph Lameter, Mel Gorman, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, 2011-05-12 at 19:47 -0500, James Bottomley wrote:
> On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote:
> > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote:
> > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
> > > > Confirmed, I'm afraid ... I can trigger the problem with all three
> > > > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > > > taking 100% system time on 1 CPU and it won't calm down after I unload
> > > > the system.
> > > 
> > > Just on a "if you don't know what's wrong poke about and see" basis, I
> > > sliced out all the complex logic in sleeping_prematurely() and, as far
> > > as I can tell, it cures the problem behaviour.  I've loaded up the
> > > system, and taken the tar load generator through three runs without
> > > producing a spinning kswapd (this is PREEMPT).  I'll try with a
> > > non-PREEMPT kernel shortly.
> > > 
> > > What this seems to say is that there's a problem with the complex logic
> > > in sleeping_prematurely().  I'm pretty sure hacking up
> > > sleeping_prematurely() just to dump all the calculations is the wrong
> > > thing to do, but perhaps someone can see what the right thing is ...
> > 
> > I think I see the problem: the boolean logic of sleeping_prematurely()
> > is odd.  If it returns true, kswapd will keep running.  So if
> > pgdat_balanced() returns true, kswapd should go to sleep.
> > 
> > This?
> 
> I was going to say this was a winner, but on the third untar run on
> non-PREEMPT, I hit the kswapd livelock.  It's got much farther than
> previous attempts, which all hang on the first run, but I think the
> essential problem is still (at least on this machine) that
> sleeping_prematurely() is doing too much work for the wakeup storm that
> allocators are causing.
> 
> Something that ratelimits the amount of time we spend in the watermark
> calculations, like the below (which incorporates your pgdat fix) seems
> to be much more stable (I've not run it for three full runs yet, but
> kswapd CPU time is way lower so far).

I've hammered it for several hours now with multiple loads; I can't seem
to break it (famous last words, of course).

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 22:58                       ` Minchan Kim
@ 2011-05-13  5:39                         ` Minchan Kim
  0 siblings, 0 replies; 77+ messages in thread
From: Minchan Kim @ 2011-05-13  5:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: James Bottomley, Pekka Enberg, Christoph Lameter, Mel Gorman,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, May 13, 2011 at 7:58 AM, Minchan Kim <minchan.kim@gmail.com> wrote:
> On Fri, May 13, 2011 at 7:15 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote:
>>> On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
>>> > Confirmed, I'm afraid ... I can trigger the problem with all three
>>> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
>>> > taking 100% system time on 1 CPU and it won't calm down after I unload
>>> > the system.
>>>
>>> Just on a "if you don't know what's wrong poke about and see" basis, I
>>> sliced out all the complex logic in sleeping_prematurely() and, as far
>>> as I can tell, it cures the problem behaviour.  I've loaded up the
>>> system, and taken the tar load generator through three runs without
>>> producing a spinning kswapd (this is PREEMPT).  I'll try with a
>>> non-PREEMPT kernel shortly.
>>>
>>> What this seems to say is that there's a problem with the complex logic
>>> in sleeping_prematurely().  I'm pretty sure hacking up
>>> sleeping_prematurely() just to dump all the calculations is the wrong
>>> thing to do, but perhaps someone can see what the right thing is ...
>>
>> I think I see the problem: the boolean logic of sleeping_prematurely()
>> is odd.  If it returns true, kswapd will keep running.  So if
>> pgdat_balanced() returns true, kswapd should go to sleep.
>>
>> This?
>
> Yes. Good catch.

In addition, I see some strange thing.
The comment in pgdat_balanced says
"Only zones that meet watermarks and are in a zone allowed by the
callers classzone_idx are added to balanced_pages"

It's true in case of balance_pgdat but it's not true in sleeping_prematurely.
This?

barrios@barrios-desktop:~/linux-mmotm$ git diff mm/vmscan.c
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 292582c..d9078cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2322,7 +2322,8 @@ static bool sleeping_prematurely(pg_data_t
*pgdat, int order, long remaining,
                                                        classzone_idx, 0))
                        all_zones_ok = false;
                else
-                       balanced += zone->present_pages;
+                       if (i <= classzone_idx)
+                               balanced += zone->present_pages;
        }

        /*
@@ -2331,7 +2332,7 @@ static bool sleeping_prematurely(pg_data_t
*pgdat, int order, long remaining,
         * must be balanced
         */
        if (order)
-               return pgdat_balanced(pgdat, balanced, classzone_idx);
+               return !pgdat_balanced(pgdat, balanced, classzone_idx);
        else
                return !all_zones_ok;
 }





-- 
Kind regards,
Minchan Kim

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 20:04                 ` James Bottomley
  2011-05-12 20:29                   ` Johannes Weiner
  2011-05-12 22:04                   ` James Bottomley
@ 2011-05-13  6:16                   ` Pekka Enberg
  2011-05-13 10:05                     ` Mel Gorman
  2 siblings, 1 reply; 77+ messages in thread
From: Pekka Enberg @ 2011-05-13  6:16 UTC (permalink / raw)
  To: James Bottomley
  Cc: Christoph Lameter, Mel Gorman, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

Hi,

On Thu, May 12, 2011 at 11:04 PM, James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
> Confirmed, I'm afraid ... I can trigger the problem with all three
> patches under PREEMPT.  It's not a hang this time, it's just kswapd
> taking 100% system time on 1 CPU and it won't calm down after I unload
> the system.

OK, that's good to know. I'd still like to take patches 1-2, though. Mel?

                        Pekka

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 18:00                         ` Andrea Arcangeli
@ 2011-05-13  9:49                           ` Mel Gorman
  2011-05-15 16:39                             ` Andrea Arcangeli
  0 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2011-05-13  9:49 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Pekka Enberg, James Bottomley, Dave Jones,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote:
> <SNIP>
>
> BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too
> (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not
> with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless
> lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL.

This is in V2 (unreleased, testing in progress and was running
overnight). I noticed that clearing __GFP_REPEAT is required for
reclaim/compaction if direct reclaimers from SLUB are to return false in
should_continue_reclaim() and bail out from high-order allocation
properly. As it is, there is a possibility for slub high-order direct
reclaimers to loop in reclaim/compaction for a long time. This is
only important when CONFIG_COMPACTION=y.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-13  6:16                   ` Pekka Enberg
@ 2011-05-13 10:05                     ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-13 10:05 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: James Bottomley, Christoph Lameter, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Rik van Riel,
	Johannes Weiner, linux-fsdevel, linux-mm, linux-kernel,
	linux-ext4

On Fri, May 13, 2011 at 09:16:24AM +0300, Pekka Enberg wrote:
> Hi,
> 
> On Thu, May 12, 2011 at 11:04 PM, James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > Confirmed, I'm afraid ... I can trigger the problem with all three
> > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > taking 100% system time on 1 CPU and it won't calm down after I unload
> > the system.
> 
> OK, that's good to know. I'd still like to take patches 1-2, though. Mel?
> 

Wait for a V2 please. __GFP_REPEAT should also be removed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-11 22:27       ` David Rientjes
@ 2011-05-13 10:14         ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-13 10:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, James Bottomley, Colin King, Raghavendra D Prabhu,
	Jan Kara, Chris Mason, Christoph Lameter, Pekka Enberg,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Wed, May 11, 2011 at 03:27:11PM -0700, David Rientjes wrote:
> On Wed, 11 May 2011, Mel Gorman wrote:
> 
> > I agree with you that there are situations where plenty of memory
> > means that that it'll perform much better. However, indications are
> > that it breaks down with high CPU usage when memory is low.  Worse,
> > once fragmentation becomes a problem, large amounts of UNMOVABLE and
> > RECLAIMABLE will make it progressively more expensive to find the
> > necessary pages. Perhaps with patches 1 and 2, this is not as much
> > of a problem but figures in the leader indicated that for a simple
> > workload with large amounts of files and data exceeding physical
> > memory that it was better off not to use high orders at all which
> > is a situation I'd expect to be encountered by more users than
> > performance-sensitive applications.
> > 
> > In other words, we're taking one hit or the other.
> > 
> 
> Seems like the ideal solution would then be to find how to best set the 
> default, and that can probably only be done with the size of the smallest 
> node since it has a higher liklihood of encountering a large amount of 
> unreclaimable slab when memory is low.
> 

Ideally yes, but glancing through this thread and thinking on it a bit
more, I'm going to drop this patch. As pointed out, SLUB with high
orders has been in use with distributions already so the breakage is
elsewhere. Patches 1 and 2 still make some sense but they're not the
root cause.

> <SNIP>

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 22:15                     ` Johannes Weiner
  2011-05-12 22:58                       ` Minchan Kim
  2011-05-13  0:47                       ` James Bottomley
@ 2011-05-13 10:30                       ` Mel Gorman
  2 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-13 10:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: James Bottomley, Pekka Enberg, Christoph Lameter, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Fri, May 13, 2011 at 12:15:06AM +0200, Johannes Weiner wrote:
> On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
> > > Confirmed, I'm afraid ... I can trigger the problem with all three
> > > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > > taking 100% system time on 1 CPU and it won't calm down after I unload
> > > the system.
> > 
> > Just on a "if you don't know what's wrong poke about and see" basis, I
> > sliced out all the complex logic in sleeping_prematurely() and, as far
> > as I can tell, it cures the problem behaviour.  I've loaded up the
> > system, and taken the tar load generator through three runs without
> > producing a spinning kswapd (this is PREEMPT).  I'll try with a
> > non-PREEMPT kernel shortly.
> > 
> > What this seems to say is that there's a problem with the complex logic
> > in sleeping_prematurely().  I'm pretty sure hacking up
> > sleeping_prematurely() just to dump all the calculations is the wrong
> > thing to do, but perhaps someone can see what the right thing is ...
> 
> I think I see the problem: the boolean logic of sleeping_prematurely()
> is odd.  If it returns true, kswapd will keep running.  So if
> pgdat_balanced() returns true, kswapd should go to sleep.
> 
> This?
> 

You're right.

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2b701e0..092d773 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2261,7 +2261,7 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  	 * must be balanced
>  	 */
>  	if (order)
> -		return pgdat_balanced(pgdat, balanced, classzone_idx);
> +		return !pgdat_balanced(pgdat, balanced, classzone_idx);
>  	else
>  		return !all_zones_ok;
>  }

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-13  0:47                       ` James Bottomley
  2011-05-13  4:12                         ` James Bottomley
@ 2011-05-13 10:55                         ` Mel Gorman
  2011-05-13 14:16                           ` James Bottomley
  1 sibling, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2011-05-13 10:55 UTC (permalink / raw)
  To: James Bottomley
  Cc: Johannes Weiner, Pekka Enberg, Christoph Lameter, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 07:47:05PM -0500, James Bottomley wrote:
> On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote:
> > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote:
> > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
> > > > Confirmed, I'm afraid ... I can trigger the problem with all three
> > > > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > > > taking 100% system time on 1 CPU and it won't calm down after I unload
> > > > the system.
> > > 
> > > Just on a "if you don't know what's wrong poke about and see" basis, I
> > > sliced out all the complex logic in sleeping_prematurely() and, as far
> > > as I can tell, it cures the problem behaviour.  I've loaded up the
> > > system, and taken the tar load generator through three runs without
> > > producing a spinning kswapd (this is PREEMPT).  I'll try with a
> > > non-PREEMPT kernel shortly.
> > > 
> > > What this seems to say is that there's a problem with the complex logic
> > > in sleeping_prematurely().  I'm pretty sure hacking up
> > > sleeping_prematurely() just to dump all the calculations is the wrong
> > > thing to do, but perhaps someone can see what the right thing is ...
> > 
> > I think I see the problem: the boolean logic of sleeping_prematurely()
> > is odd.  If it returns true, kswapd will keep running.  So if
> > pgdat_balanced() returns true, kswapd should go to sleep.
> > 
> > This?
> 
> I was going to say this was a winner, but on the third untar run on
> non-PREEMPT, I hit the kswapd livelock.  It's got much farther than
> previous attempts, which all hang on the first run, but I think the
> essential problem is still (at least on this machine) that
> sleeping_prematurely() is doing too much work for the wakeup storm that
> allocators are causing.
> 
> Something that ratelimits the amount of time we spend in the watermark
> calculations, like the below (which incorporates your pgdat fix) seems
> to be much more stable (I've not run it for three full runs yet, but
> kswapd CPU time is way lower so far).
> 
> The heuristic here is that if we're making the calculation more than ten
> times in 1/10 of a second, stop and sleep anyway.
> 

Is that heuristic not basically the same as this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index af24d1e..4d24828 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
 	unsigned long balanced = 0;
 	bool all_zones_ok = true;
 
+	/* If kswapd has been running too long, just sleep */
+	if (need_resched())
+		return false;
+
 	/* If a direct reclaimer woke kswapd within HZ/10, it's premature */
 	if (remaining)
 		return true;

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-12 18:04       ` Andrea Arcangeli
@ 2011-05-13 11:24         ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-13 11:24 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: James Bottomley, David Rientjes, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, May 12, 2011 at 08:04:57PM +0200, Andrea Arcangeli wrote:
> Hi James!
> 
> On Wed, May 11, 2011 at 05:34:27PM -0500, James Bottomley wrote:
> > Yes, but only once in all the testing.  With patches 1 and 2 the hang is
> 
> Weird patch 2 makes the large order allocation without ~__GFP_WAIT, so
> even COMPACTION=y/n shouldn't matter anymore. Am I misreading
> something Mel?
> 
> Removing ~__GFP_WAIT from patch 2 (and adding ~__GFP_REPEAT as a
> correctness improvement) and setting COMPACTION=y also should work ok.
> 


should_continue_reclaim could till be looping unless __GFP_REPEAT is
cleared if CONFIG_COMPACTION is set.

> Removing ~__GFP_WAIT from patch 2 and setting COMPACTION=n is expected
> not to work well.
> 
> But compaction should only make the difference if you remove
> ~__GFP_WAIT from patch 2.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations
  2011-05-12 15:53           ` James Bottomley
@ 2011-05-13 11:25             ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-13 11:25 UTC (permalink / raw)
  To: James Bottomley
  Cc: Pekka Enberg, David Rientjes, Andrew Morton, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Rik van Riel, Johannes Weiner, linux-fsdevel, linux-mm,
	linux-kernel, linux-ext4

On Thu, May 12, 2011 at 10:53:44AM -0500, James Bottomley wrote:
> On Thu, 2011-05-12 at 09:04 -0500, James Bottomley wrote:
> > On Thu, 2011-05-12 at 14:13 +0300, Pekka Enberg wrote:
> > > On 5/12/11 1:34 AM, James Bottomley wrote:
> > > > On Wed, 2011-05-11 at 15:28 -0700, David Rientjes wrote:
> > > >> On Wed, 11 May 2011, James Bottomley wrote:
> > > >>
> > > >>> OK, I confirm that I can't seem to break this one.  No hangs visible,
> > > >>> even when loading up the system with firefox, evolution, the usual
> > > >>> massive untar, X and even a distribution upgrade.
> > > >>>
> > > >>> You can add my tested-by
> > > >>>
> > > >> Your system still hangs with patches 1 and 2 only?
> > > > Yes, but only once in all the testing.  With patches 1 and 2 the hang is
> > > > much harder to reproduce, but it still seems to be present if I hit it
> > > > hard enough.
> > > 
> > > Patches 1-2 look reasonable to me. I'm not completely convinced of patch 
> > > 3, though. Why are we seeing these problems now? This has been in 
> > > mainline for a long time already. Shouldn't we fix kswapd?
> > 
> > So I'm open to this.  The hang occurs when kswapd races around in
> > shrink_slab and never exits.  It looks like there's a massive number of
> > wakeups triggering this, but we haven't been able to diagnose it
> > further.  turning on PREEMPT gets rid of the hang, so I could try to
> > reproduce with PREEMPT and turn on tracing.  The problem so far has been
> > that the number of events is so huge that the trace buffer only captures
> > a few microseconds of output.
> 
> OK, here's the trace from a PREEMPT kernel (2.6.38.6) when kswapd hits
> 99% and stays there.  I've only enabled the vmscan tracepoints to try
> and get a longer run.  It mosly looks like kswapd waking itself, but
> there might be more in there that mm trained eyes can see.
> 

For 2.6.38.6, commit [2876592f: mm: vmscan: stop reclaim/compaction
earlier due to insufficient progress if !__GFP_REPEAT] may also be
needed if CONFIG_COMPACTION if set.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-13 10:55                         ` Mel Gorman
@ 2011-05-13 14:16                           ` James Bottomley
  0 siblings, 0 replies; 77+ messages in thread
From: James Bottomley @ 2011-05-13 14:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Johannes Weiner, Pekka Enberg, Christoph Lameter, Andrew Morton,
	Colin King, Raghavendra D Prabhu, Jan Kara, Chris Mason,
	Rik van Riel, linux-fsdevel, linux-mm, linux-kernel, linux-ext4

On Fri, 2011-05-13 at 11:55 +0100, Mel Gorman wrote:
> On Thu, May 12, 2011 at 07:47:05PM -0500, James Bottomley wrote:
> > On Fri, 2011-05-13 at 00:15 +0200, Johannes Weiner wrote:
> > > On Thu, May 12, 2011 at 05:04:41PM -0500, James Bottomley wrote:
> > > > On Thu, 2011-05-12 at 15:04 -0500, James Bottomley wrote:
> > > > > Confirmed, I'm afraid ... I can trigger the problem with all three
> > > > > patches under PREEMPT.  It's not a hang this time, it's just kswapd
> > > > > taking 100% system time on 1 CPU and it won't calm down after I unload
> > > > > the system.
> > > > 
> > > > Just on a "if you don't know what's wrong poke about and see" basis, I
> > > > sliced out all the complex logic in sleeping_prematurely() and, as far
> > > > as I can tell, it cures the problem behaviour.  I've loaded up the
> > > > system, and taken the tar load generator through three runs without
> > > > producing a spinning kswapd (this is PREEMPT).  I'll try with a
> > > > non-PREEMPT kernel shortly.
> > > > 
> > > > What this seems to say is that there's a problem with the complex logic
> > > > in sleeping_prematurely().  I'm pretty sure hacking up
> > > > sleeping_prematurely() just to dump all the calculations is the wrong
> > > > thing to do, but perhaps someone can see what the right thing is ...
> > > 
> > > I think I see the problem: the boolean logic of sleeping_prematurely()
> > > is odd.  If it returns true, kswapd will keep running.  So if
> > > pgdat_balanced() returns true, kswapd should go to sleep.
> > > 
> > > This?
> > 
> > I was going to say this was a winner, but on the third untar run on
> > non-PREEMPT, I hit the kswapd livelock.  It's got much farther than
> > previous attempts, which all hang on the first run, but I think the
> > essential problem is still (at least on this machine) that
> > sleeping_prematurely() is doing too much work for the wakeup storm that
> > allocators are causing.
> > 
> > Something that ratelimits the amount of time we spend in the watermark
> > calculations, like the below (which incorporates your pgdat fix) seems
> > to be much more stable (I've not run it for three full runs yet, but
> > kswapd CPU time is way lower so far).
> > 
> > The heuristic here is that if we're making the calculation more than ten
> > times in 1/10 of a second, stop and sleep anyway.
> > 
> 
> Is that heuristic not basically the same as this?
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index af24d1e..4d24828 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2251,6 +2251,10 @@ static bool sleeping_prematurely(pg_data_t *pgdat, int order, long remaining,
>  	unsigned long balanced = 0;
>  	bool all_zones_ok = true;
>  
> +	/* If kswapd has been running too long, just sleep */
> +	if (need_resched())
> +		return false;

Not exactly.  That should cure the problem (and I'll test it out).
However, the traces show most of the work is being caused by
sleeping_prematurely().  The object of my patch was actually to cut that
off.  just doing a check on need_resched will still allow us to run
around that loop for hundreds of milliseconds and contribute to needless
CPU time burn of kswapd; that's why I used a number of iterations and
time cutoff in my patch.  If we've run around the loop 10 times tightly
returning true (i.e. we can't sleep and need to rebalance) each time but
the shrinkers still haven't done enough, it's time to call it quits and
sleep anyway.

James



^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-13  9:49                           ` Mel Gorman
@ 2011-05-15 16:39                             ` Andrea Arcangeli
  2011-05-16  8:42                               ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: Andrea Arcangeli @ 2011-05-15 16:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Christoph Lameter, Pekka Enberg, James Bottomley, Dave Jones,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Fri, May 13, 2011 at 10:49:58AM +0100, Mel Gorman wrote:
> On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote:
> > <SNIP>
> >
> > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too
> > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not
> > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless
> > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL.
> 
> This is in V2 (unreleased, testing in progress and was running
> overnight). I noticed that clearing __GFP_REPEAT is required for
> reclaim/compaction if direct reclaimers from SLUB are to return false in
> should_continue_reclaim() and bail out from high-order allocation
> properly. As it is, there is a possibility for slub high-order direct
> reclaimers to loop in reclaim/compaction for a long time. This is
> only important when CONFIG_COMPACTION=y.

Agreed. However I don't expect anyone to allocate from slub(/slab)
with __GFP_REPEAT so it's probably only theoretical but more correct
indeed ;).

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-15 16:39                             ` Andrea Arcangeli
@ 2011-05-16  8:42                               ` Mel Gorman
  0 siblings, 0 replies; 77+ messages in thread
From: Mel Gorman @ 2011-05-16  8:42 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Christoph Lameter, Pekka Enberg, James Bottomley, Dave Jones,
	Andrew Morton, Colin King, Raghavendra D Prabhu, Jan Kara,
	Chris Mason, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Sun, May 15, 2011 at 06:39:06PM +0200, Andrea Arcangeli wrote:
> On Fri, May 13, 2011 at 10:49:58AM +0100, Mel Gorman wrote:
> > On Thu, May 12, 2011 at 08:00:18PM +0200, Andrea Arcangeli wrote:
> > > <SNIP>
> > >
> > > BTW, it comes to mind in patch 2, SLUB should clear __GFP_REPEAT too
> > > (not only __GFP_NOFAIL). Clearing __GFP_WAIT may be worth it or not
> > > with COMPACTION=y, definitely good idea to clear __GFP_WAIT unless
> > > lumpy is restricted to __GFP_REPEAT|__GFP_NOFAIL.
> > 
> > This is in V2 (unreleased, testing in progress and was running
> > overnight). I noticed that clearing __GFP_REPEAT is required for
> > reclaim/compaction if direct reclaimers from SLUB are to return false in
> > should_continue_reclaim() and bail out from high-order allocation
> > properly. As it is, there is a possibility for slub high-order direct
> > reclaimers to loop in reclaim/compaction for a long time. This is
> > only important when CONFIG_COMPACTION=y.
> 
> Agreed. However I don't expect anyone to allocate from slub(/slab)
> with __GFP_REPEAT so it's probably only theoretical but more correct
> indeed ;).

Networking layer does specify __GFP_REPEAT.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-12 17:36     ` Andrea Arcangeli
@ 2011-05-16 21:03       ` David Rientjes
  2011-05-17  9:48         ` Mel Gorman
  0 siblings, 1 reply; 77+ messages in thread
From: David Rientjes @ 2011-05-16 21:03 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Thu, 12 May 2011, Andrea Arcangeli wrote:

> On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> > kswapd and doing compaction for the higher order allocs before falling 
> 
> Note that patch 2 disabled compaction by clearing __GFP_WAIT.
> 
> What you describe here would be patch 2 without the ~__GFP_WAIT
> addition (so keeping only ~GFP_NOFAIL).
> 

It's out of context, my sentence was:

"With the previous changes in this patchset, specifically avoiding waking 
kswapd and doing compaction for the higher order allocs before falling 
back to the min order..."

meaning this patchset avoids waking kswapd and avoids doing compaction.

> Not clearing __GFP_WAIT when compaction is enabled is possible and
> shouldn't result in bad behavior (if compaction is not enabled with
> current SLUB it's hard to imagine how it could perform decently if
> there's fragmentation). You should try to benchmark to see if it's
> worth it on the large NUMA systems with heavy network traffic (for
> normal systems I doubt compaction is worth it but I'm not against
> trying to keep it enabled just in case).
> 

The fragmentation isn't the only issue with the netperf TCP_RR benchmark, 
the problem is that the slub slowpath is being used >95% of the time on 
every allocation and free for the very large number of kmalloc-256 and 
kmalloc-2K caches.  Those caches are order 1 and 3, respectively, on my 
system by default, but the page allocator seldomly gets invoked for such a 
benchmark after the partial lists are populated: the overhead is from the 
per-node locking required in the slowpath to traverse the partial lists.  
See the data I presented two years ago: http://lkml.org/lkml/2009/3/30/15.

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-16 21:03       ` David Rientjes
@ 2011-05-17  9:48         ` Mel Gorman
  2011-05-17 19:25           ` David Rientjes
  0 siblings, 1 reply; 77+ messages in thread
From: Mel Gorman @ 2011-05-17  9:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrea Arcangeli, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Mon, May 16, 2011 at 02:03:33PM -0700, David Rientjes wrote:
> On Thu, 12 May 2011, Andrea Arcangeli wrote:
> 
> > On Wed, May 11, 2011 at 01:38:47PM -0700, David Rientjes wrote:
> > > kswapd and doing compaction for the higher order allocs before falling 
> > 
> > Note that patch 2 disabled compaction by clearing __GFP_WAIT.
> > 
> > What you describe here would be patch 2 without the ~__GFP_WAIT
> > addition (so keeping only ~GFP_NOFAIL).
> > 
> 
> It's out of context, my sentence was:
> 
> "With the previous changes in this patchset, specifically avoiding waking 
> kswapd and doing compaction for the higher order allocs before falling 
> back to the min order..."
> 
> meaning this patchset avoids waking kswapd and avoids doing compaction.
> 

Ok.

> > Not clearing __GFP_WAIT when compaction is enabled is possible and
> > shouldn't result in bad behavior (if compaction is not enabled with
> > current SLUB it's hard to imagine how it could perform decently if
> > there's fragmentation). You should try to benchmark to see if it's
> > worth it on the large NUMA systems with heavy network traffic (for
> > normal systems I doubt compaction is worth it but I'm not against
> > trying to keep it enabled just in case).
> > 
> 
> The fragmentation isn't the only issue with the netperf TCP_RR benchmark, 
> the problem is that the slub slowpath is being used >95% of the time on 
> every allocation and free for the very large number of kmalloc-256 and 
> kmalloc-2K caches. 

Ok, that makes sense as I'd full expect that benchmark to exhaust
the per-cpu page (high order or otherwise) of slab objects routinely
during default and I'd also expect the freeing on the other side to
be releasing slabs frequently to the partial or empty lists.

> Those caches are order 1 and 3, respectively, on my 
> system by default, but the page allocator seldomly gets invoked for such a 
> benchmark after the partial lists are populated: the overhead is from the 
> per-node locking required in the slowpath to traverse the partial lists.  
> See the data I presented two years ago: http://lkml.org/lkml/2009/3/30/15.

Ok, I can see how this patch would indeed make the situation worse. I
vaguely recall that there were other patches that would increase the
per-cpu lists of objects but have no recollection as to what happened
them.

Maybe Christoph remembers but one way or the other, it's out of scope
for James' and Colin's bug.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 77+ messages in thread

* Re: [PATCH 3/3] mm: slub: Default slub_max_order to 0
  2011-05-17  9:48         ` Mel Gorman
@ 2011-05-17 19:25           ` David Rientjes
  0 siblings, 0 replies; 77+ messages in thread
From: David Rientjes @ 2011-05-17 19:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Andrew Morton, James Bottomley, Colin King,
	Raghavendra D Prabhu, Jan Kara, Chris Mason, Christoph Lameter,
	Pekka Enberg, Rik van Riel, Johannes Weiner, linux-fsdevel,
	linux-mm, linux-kernel, linux-ext4

On Tue, 17 May 2011, Mel Gorman wrote:

> > The fragmentation isn't the only issue with the netperf TCP_RR benchmark, 
> > the problem is that the slub slowpath is being used >95% of the time on 
> > every allocation and free for the very large number of kmalloc-256 and 
> > kmalloc-2K caches. 
> 
> Ok, that makes sense as I'd full expect that benchmark to exhaust
> the per-cpu page (high order or otherwise) of slab objects routinely
> during default and I'd also expect the freeing on the other side to
> be releasing slabs frequently to the partial or empty lists.
> 

That's most of the problem, but it's compounded on this benchmark because 
the slab pulled from the partial list to replace the per-cpu page 
typically only has a very minimal number (2 or 3) of free objects, so it 
can only serve one allocation and then require the allocation slowpath to 
pull yet another slab from the partial list the next time around.  I had a 
patchset that addressed that, which I called "slab thrashing", by only 
pulling a slab from the partial list when it had a pre-defined proportion 
of available objects and otherwise skipping it, and that ended up helping 
the benchmark by 5-7%.  Smaller orders will make this worse, as well, 
since if there were only 2 or 3 free objects on an order-3 slab before, 
there's no chance that's going to be equivalent on an order-0 slab.

^ permalink raw reply	[flat|nested] 77+ messages in thread

end of thread, other threads:[~2011-05-17 19:25 UTC | newest]

Thread overview: 77+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-05-11 15:29 [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations Mel Gorman
2011-05-11 15:29 ` [PATCH 1/3] mm: slub: Do not wake kswapd for SLUBs speculative " Mel Gorman
2011-05-11 20:38   ` David Rientjes
2011-05-11 15:29 ` [PATCH 2/3] mm: slub: Do not take expensive steps " Mel Gorman
2011-05-11 20:38   ` David Rientjes
2011-05-11 21:10     ` Mel Gorman
2011-05-12 17:25       ` Andrea Arcangeli
2011-05-11 15:29 ` [PATCH 3/3] mm: slub: Default slub_max_order to 0 Mel Gorman
2011-05-11 20:38   ` David Rientjes
2011-05-11 20:53     ` James Bottomley
2011-05-11 21:09     ` Mel Gorman
2011-05-11 22:27       ` David Rientjes
2011-05-13 10:14         ` Mel Gorman
2011-05-12 17:36     ` Andrea Arcangeli
2011-05-16 21:03       ` David Rientjes
2011-05-17  9:48         ` Mel Gorman
2011-05-17 19:25           ` David Rientjes
2011-05-12 14:43   ` Christoph Lameter
2011-05-12 15:15     ` James Bottomley
2011-05-12 15:27       ` Christoph Lameter
2011-05-12 15:43         ` James Bottomley
2011-05-12 15:46           ` Dave Jones
2011-05-12 16:00             ` James Bottomley
2011-05-12 16:08               ` Dave Jones
2011-05-12 16:27               ` Christoph Lameter
2011-05-12 16:30                 ` James Bottomley
2011-05-12 16:48                   ` Christoph Lameter
2011-05-12 17:46                     ` Andrea Arcangeli
2011-05-12 18:00                       ` Christoph Lameter
2011-05-12 18:18                         ` Andrea Arcangeli
2011-05-12 17:06                   ` Pekka Enberg
2011-05-12 17:11                     ` Pekka Enberg
2011-05-12 17:38                       ` Christoph Lameter
2011-05-12 18:00                         ` Andrea Arcangeli
2011-05-13  9:49                           ` Mel Gorman
2011-05-15 16:39                             ` Andrea Arcangeli
2011-05-16  8:42                               ` Mel Gorman
2011-05-12 17:51                       ` Andrea Arcangeli
2011-05-12 18:03                         ` Christoph Lameter
2011-05-12 18:09                           ` Andrea Arcangeli
2011-05-12 18:16                             ` Christoph Lameter
2011-05-12 18:36                       ` James Bottomley
2011-05-12 17:40                 ` Andrea Arcangeli
2011-05-12 15:55           ` Pekka Enberg
2011-05-12 18:37             ` James Bottomley
2011-05-12 18:46               ` Christoph Lameter
2011-05-12 19:21                 ` James Bottomley
2011-05-12 19:44               ` James Bottomley
2011-05-12 20:04                 ` James Bottomley
2011-05-12 20:29                   ` Johannes Weiner
2011-05-12 20:31                     ` Johannes Weiner
2011-05-12 20:31                     ` James Bottomley
2011-05-12 22:04                   ` James Bottomley
2011-05-12 22:15                     ` Johannes Weiner
2011-05-12 22:58                       ` Minchan Kim
2011-05-13  5:39                         ` Minchan Kim
2011-05-13  0:47                       ` James Bottomley
2011-05-13  4:12                         ` James Bottomley
2011-05-13 10:55                         ` Mel Gorman
2011-05-13 14:16                           ` James Bottomley
2011-05-13 10:30                       ` Mel Gorman
2011-05-13  6:16                   ` Pekka Enberg
2011-05-13 10:05                     ` Mel Gorman
2011-05-12 16:01           ` Christoph Lameter
2011-05-12 16:10             ` Eric Dumazet
2011-05-12 17:37               ` Andrew Morton
2011-05-12 15:45         ` Dave Jones
2011-05-11 21:39 ` [PATCH 0/3] Reduce impact to overall system of SLUB using high-order allocations James Bottomley
2011-05-11 22:28   ` David Rientjes
2011-05-11 22:34     ` James Bottomley
2011-05-12 11:13       ` Pekka Enberg
2011-05-12 13:19         ` Mel Gorman
2011-05-12 14:04         ` James Bottomley
2011-05-12 15:53           ` James Bottomley
2011-05-13 11:25             ` Mel Gorman
2011-05-12 18:04       ` Andrea Arcangeli
2011-05-13 11:24         ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).