[PATCH] slub: Don't throw away partial remote slabs if there is no local memory

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-07  2:21 ` Anton Blanchard
  0 siblings, 0 replies; 229+ messages in thread
From: Anton Blanchard @ 2014-01-07  2:21 UTC (permalink / raw)
  To: benh, paulus, cl, penberg, mpm, nacc; +Cc: linux-mm, linuxppc-dev


We noticed a huge amount of slab memory consumed on a large ppc64 box:

Slab:            2094336 kB

Almost 2GB. This box is not balanced and some nodes do not have local
memory, causing slub to be very inefficient in its slab usage.

Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
sees it isn't node local, deactivates it and tries to allocate a new
slab. On empty nodes we will allocate a new remote slab and use the
first slot, but as explained above when we get called a second time
we will just deactivate that slab and retry.

As such we end up only using 1 entry in each slab:

slab                    mem  objects
                       used   active
------------------------------------
kmalloc-16384       1404 MB    4.90%
task_struct          668 MB    2.90%
kmalloc-128          193 MB    3.61%
kmalloc-192          152 MB    5.23%
kmalloc-8192          72 MB   23.40%
kmalloc-16            64 MB    7.43%
kmalloc-512           33 MB   22.41%

The patch below checks that a node is not empty before deactivating a
slab and trying to allocate it again. With this patch applied we now
use about 352MB:

Slab:             360192 kB

And our efficiency is much better:

slab                    mem  objects
                       used   active
------------------------------------
kmalloc-16384         92 MB   74.27%
task_struct           23 MB   83.46%
idr_layer_cache       18 MB  100.00%
pgtable-2^12          17 MB  100.00%
kmalloc-65536         15 MB  100.00%
inode_cache           14 MB  100.00%
kmalloc-256           14 MB   97.81%
kmalloc-8192          14 MB   85.71%

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Thoughts? It seems like we could hit a similar situation if a machine
is balanced but we run out of memory on a single node.

Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,17 @@ redo:
 
 	if (unlikely(!node_match(page, node))) {
 		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+
+		/*
+		 * If the node contains no memory there is no point in trying
+		 * to allocate a new node local slab
+		 */
+		if (node_spanned_pages(node)) {
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}
 	}
 
 	/*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-07  2:21 ` Anton Blanchard
  0 siblings, 0 replies; 229+ messages in thread
From: Anton Blanchard @ 2014-01-07  2:21 UTC (permalink / raw)
  To: benh, paulus, cl, penberg, mpm, nacc; +Cc: linux-mm, linuxppc-dev

We noticed a huge amount of slab memory consumed on a large ppc64 box:

Slab:            2094336 kB

Almost 2GB. This box is not balanced and some nodes do not have local
memory, causing slub to be very inefficient in its slab usage.

Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
sees it isn't node local, deactivates it and tries to allocate a new
slab. On empty nodes we will allocate a new remote slab and use the
first slot, but as explained above when we get called a second time
we will just deactivate that slab and retry.

As such we end up only using 1 entry in each slab:

slab                    mem  objects
                       used   active
------------------------------------
kmalloc-16384       1404 MB    4.90%
task_struct          668 MB    2.90%
kmalloc-128          193 MB    3.61%
kmalloc-192          152 MB    5.23%
kmalloc-8192          72 MB   23.40%
kmalloc-16            64 MB    7.43%
kmalloc-512           33 MB   22.41%

The patch below checks that a node is not empty before deactivating a
slab and trying to allocate it again. With this patch applied we now
use about 352MB:

Slab:             360192 kB

And our efficiency is much better:

slab                    mem  objects
                       used   active
------------------------------------
kmalloc-16384         92 MB   74.27%
task_struct           23 MB   83.46%
idr_layer_cache       18 MB  100.00%
pgtable-2^12          17 MB  100.00%
kmalloc-65536         15 MB  100.00%
inode_cache           14 MB  100.00%
kmalloc-256           14 MB   97.81%
kmalloc-8192          14 MB   85.71%

Signed-off-by: Anton Blanchard <anton@samba.org>
---

Thoughts? It seems like we could hit a similar situation if a machine
is balanced but we run out of memory on a single node.

Index: b/mm/slub.c
===================================================================
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,17 @@ redo:

 	if (unlikely(!node_match(page, node))) {
 		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+
+		/*
+		 * If the node contains no memory there is no point in trying
+		 * to allocate a new node local slab
+		 */
+		if (node_spanned_pages(node)) {
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}
 	}

 	/*

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
                   ` (3 preceding siblings ...)
  (?)
@ 2014-01-07  4:19 ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  4:19 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: benh, paulus, cl, penberg, mpm, nacc, linux-mm, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>
>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {

s/node_spanned_pages/node_present_pages 

>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
  (?)
  (?)
@ 2014-01-07  4:19 ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  4:19 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: cl, nacc, penberg, linux-mm, paulus, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>
>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {

s/node_spanned_pages/node_present_pages 

>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
  (?)
@ 2014-01-07  4:19 ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  4:19 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: cl, nacc, penberg, linux-mm, paulus, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>
>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {

s/node_spanned_pages/node_present_pages 

>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
                   ` (2 preceding siblings ...)
  (?)
@ 2014-01-07  4:19 ` Wanpeng Li
  2014-01-08 14:17     ` Anton Blanchard
  -1 siblings, 1 reply; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  4:19 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: cl, nacc, penberg, linux-mm, paulus, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>
>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>

Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>

>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {

s/node_spanned_pages/node_present_pages 

>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
@ 2014-01-07  6:49   ` Andi Kleen
  -1 siblings, 0 replies; 229+ messages in thread
From: Andi Kleen @ 2014-01-07  6:49 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: benh, paulus, cl, penberg, mpm, nacc, linux-mm, linuxppc-dev

Anton Blanchard <anton@samba.org> writes:
>
> Thoughts? It seems like we could hit a similar situation if a machine
> is balanced but we run out of memory on a single node.

Yes I agree, but your patch doesn't seem to attempt to handle this?

-Andi
>
> Index: b/mm/slub.c
> ===================================================================
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2278,10 +2278,17 @@ redo:
>  
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +
> +		/*
> +		 * If the node contains no memory there is no point in trying
> +		 * to allocate a new node local slab
> +		 */
> +		if (node_spanned_pages(node)) {
> +			deactivate_slab(s, page, c->freelist);
> +			c->page = NULL;
> +			c->freelist = NULL;
> +			goto new_slab;
> +		}
>  	}
>  
>  	/*
-- 
ak@linux.intel.com -- Speaking for myself only

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-07  6:49   ` Andi Kleen
  0 siblings, 0 replies; 229+ messages in thread
From: Andi Kleen @ 2014-01-07  6:49 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: cl, nacc, penberg, linux-mm, paulus, mpm, linuxppc-dev

Anton Blanchard <anton@samba.org> writes:
>
> Thoughts? It seems like we could hit a similar situation if a machine
> is balanced but we run out of memory on a single node.

Yes I agree, but your patch doesn't seem to attempt to handle this?

-Andi
>
> Index: b/mm/slub.c
> ===================================================================
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2278,10 +2278,17 @@ redo:
>  
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +
> +		/*
> +		 * If the node contains no memory there is no point in trying
> +		 * to allocate a new node local slab
> +		 */
> +		if (node_spanned_pages(node)) {
> +			deactivate_slab(s, page, c->freelist);
> +			c->page = NULL;
> +			c->freelist = NULL;
> +			goto new_slab;
> +		}
>  	}
>  
>  	/*
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
@ 2014-01-07  7:41   ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-07  7:41 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: benh, paulus, cl, penberg, mpm, nacc, linux-mm, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
> 
> We noticed a huge amount of slab memory consumed on a large ppc64 box:
> 
> Slab:            2094336 kB
> 
> Almost 2GB. This box is not balanced and some nodes do not have local
> memory, causing slub to be very inefficient in its slab usage.
> 
> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
> sees it isn't node local, deactivates it and tries to allocate a new
> slab. On empty nodes we will allocate a new remote slab and use the
> first slot, but as explained above when we get called a second time
> we will just deactivate that slab and retry.
> 
> As such we end up only using 1 entry in each slab:
> 
> slab                    mem  objects
>                        used   active
> ------------------------------------
> kmalloc-16384       1404 MB    4.90%
> task_struct          668 MB    2.90%
> kmalloc-128          193 MB    3.61%
> kmalloc-192          152 MB    5.23%
> kmalloc-8192          72 MB   23.40%
> kmalloc-16            64 MB    7.43%
> kmalloc-512           33 MB   22.41%
> 
> The patch below checks that a node is not empty before deactivating a
> slab and trying to allocate it again. With this patch applied we now
> use about 352MB:
> 
> Slab:             360192 kB
> 
> And our efficiency is much better:
> 
> slab                    mem  objects
>                        used   active
> ------------------------------------
> kmalloc-16384         92 MB   74.27%
> task_struct           23 MB   83.46%
> idr_layer_cache       18 MB  100.00%
> pgtable-2^12          17 MB  100.00%
> kmalloc-65536         15 MB  100.00%
> inode_cache           14 MB  100.00%
> kmalloc-256           14 MB   97.81%
> kmalloc-8192          14 MB   85.71%
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> 
> Thoughts? It seems like we could hit a similar situation if a machine
> is balanced but we run out of memory on a single node.
> 
> Index: b/mm/slub.c
> ===================================================================
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2278,10 +2278,17 @@ redo:
>  
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +
> +		/*
> +		 * If the node contains no memory there is no point in trying
> +		 * to allocate a new node local slab
> +		 */
> +		if (node_spanned_pages(node)) {
> +			deactivate_slab(s, page, c->freelist);
> +			c->page = NULL;
> +			c->freelist = NULL;
> +			goto new_slab;
> +		}
>  	}
>  
>  	/*

Hello,

I think that we need more efforts to solve unbalanced node problem.

With this patch, even if node of current cpu slab is not favorable to
unbalanced node, allocation would proceed and we would get the unintended memory.

And there is one more problem. Even if we have some partial slabs on
compatible node, we would allocate new slab, because get_partial() cannot handle
this unbalance node case.

To fix this correctly, how about following patch?

Thanks.

------------->8--------------------
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 {
        void *object;
        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+       struct zonelist *zonelist;
+       struct zoneref *z;
+       struct zone *zone;
+       enum zone_type high_zoneidx = gfp_zone(flags);
 
+       if (!node_present_pages(searchnode)) {
+               zonelist = node_zonelist(searchnode, flags);
+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+                       searchnode = zone_to_nid(zone);
+                       if (node_present_pages(searchnode))
+                               break;
+               }
+       }
        object = get_partial_node(s, get_node(s, searchnode), c, flags);
        if (object || node != NUMA_NO_NODE)
                return object;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-07  7:41   ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-07  7:41 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: cl, nacc, penberg, linux-mm, paulus, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
> 
> We noticed a huge amount of slab memory consumed on a large ppc64 box:
> 
> Slab:            2094336 kB
> 
> Almost 2GB. This box is not balanced and some nodes do not have local
> memory, causing slub to be very inefficient in its slab usage.
> 
> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
> sees it isn't node local, deactivates it and tries to allocate a new
> slab. On empty nodes we will allocate a new remote slab and use the
> first slot, but as explained above when we get called a second time
> we will just deactivate that slab and retry.
> 
> As such we end up only using 1 entry in each slab:
> 
> slab                    mem  objects
>                        used   active
> ------------------------------------
> kmalloc-16384       1404 MB    4.90%
> task_struct          668 MB    2.90%
> kmalloc-128          193 MB    3.61%
> kmalloc-192          152 MB    5.23%
> kmalloc-8192          72 MB   23.40%
> kmalloc-16            64 MB    7.43%
> kmalloc-512           33 MB   22.41%
> 
> The patch below checks that a node is not empty before deactivating a
> slab and trying to allocate it again. With this patch applied we now
> use about 352MB:
> 
> Slab:             360192 kB
> 
> And our efficiency is much better:
> 
> slab                    mem  objects
>                        used   active
> ------------------------------------
> kmalloc-16384         92 MB   74.27%
> task_struct           23 MB   83.46%
> idr_layer_cache       18 MB  100.00%
> pgtable-2^12          17 MB  100.00%
> kmalloc-65536         15 MB  100.00%
> inode_cache           14 MB  100.00%
> kmalloc-256           14 MB   97.81%
> kmalloc-8192          14 MB   85.71%
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> ---
> 
> Thoughts? It seems like we could hit a similar situation if a machine
> is balanced but we run out of memory on a single node.
> 
> Index: b/mm/slub.c
> ===================================================================
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2278,10 +2278,17 @@ redo:
>  
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +
> +		/*
> +		 * If the node contains no memory there is no point in trying
> +		 * to allocate a new node local slab
> +		 */
> +		if (node_spanned_pages(node)) {
> +			deactivate_slab(s, page, c->freelist);
> +			c->page = NULL;
> +			c->freelist = NULL;
> +			goto new_slab;
> +		}
>  	}
>  
>  	/*

Hello,

I think that we need more efforts to solve unbalanced node problem.

With this patch, even if node of current cpu slab is not favorable to
unbalanced node, allocation would proceed and we would get the unintended memory.

And there is one more problem. Even if we have some partial slabs on
compatible node, we would allocate new slab, because get_partial() cannot handle
this unbalance node case.

To fix this correctly, how about following patch?

Thanks.

------------->8--------------------
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 {
        void *object;
        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+       struct zonelist *zonelist;
+       struct zoneref *z;
+       struct zone *zone;
+       enum zone_type high_zoneidx = gfp_zone(flags);
 
+       if (!node_present_pages(searchnode)) {
+               zonelist = node_zonelist(searchnode, flags);
+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+                       searchnode = zone_to_nid(zone);
+                       if (node_present_pages(searchnode))
+                               break;
+               }
+       }
        object = get_partial_node(s, get_node(s, searchnode), c, flags);
        if (object || node != NUMA_NO_NODE)
                return object;

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (3 preceding siblings ...)
  (?)
@ 2014-01-07  8:48   ` Wanpeng Li
  2014-01-07  9:10       ` Joonsoo Kim
  -1 siblings, 1 reply; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  8:48 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Anton Blanchard, benh, paulus, cl, penberg, mpm, nacc, linux-mm,
	linuxppc-dev

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
[...]
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>

We have a machine:

[    0.000000] Node 0 Memory:
[    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
[    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
[    0.000000] Node 10 Memory: 0xc0000000-0x180000000

[    0.041486] Node 0 CPUs: 0-19
[    0.041490] Node 4 CPUs:
[    0.041492] Node 6 CPUs:
[    0.041495] Node 10 CPUs:

The pages of current cpu slab should be allocated from fallback zones/nodes 
of the memoryless node in buddy system, how can not favorable happen? 

>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>

So I think we should fold both of your two patches to one.

Regards,
Wanpeng Li 

>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
  (?)
@ 2014-01-07  8:48   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  8:48 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
[...]
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>

We have a machine:

[    0.000000] Node 0 Memory:
[    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
[    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
[    0.000000] Node 10 Memory: 0xc0000000-0x180000000

[    0.041486] Node 0 CPUs: 0-19
[    0.041490] Node 4 CPUs:
[    0.041492] Node 6 CPUs:
[    0.041495] Node 10 CPUs:

The pages of current cpu slab should be allocated from fallback zones/nodes 
of the memoryless node in buddy system, how can not favorable happen? 

>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>

So I think we should fold both of your two patches to one.

Regards,
Wanpeng Li 

>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
  (?)
  (?)
@ 2014-01-07  8:48   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  8:48 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
[...]
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>

We have a machine:

[    0.000000] Node 0 Memory:
[    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
[    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
[    0.000000] Node 10 Memory: 0xc0000000-0x180000000

[    0.041486] Node 0 CPUs: 0-19
[    0.041490] Node 4 CPUs:
[    0.041492] Node 6 CPUs:
[    0.041495] Node 10 CPUs:

The pages of current cpu slab should be allocated from fallback zones/nodes 
of the memoryless node in buddy system, how can not favorable happen? 

>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>

So I think we should fold both of your two patches to one.

Regards,
Wanpeng Li 

>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (2 preceding siblings ...)
  (?)
@ 2014-01-07  8:48   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  8:48 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
[...]
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>

We have a machine:

[    0.000000] Node 0 Memory:
[    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
[    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
[    0.000000] Node 10 Memory: 0xc0000000-0x180000000

[    0.041486] Node 0 CPUs: 0-19
[    0.041490] Node 4 CPUs:
[    0.041492] Node 6 CPUs:
[    0.041495] Node 10 CPUs:

The pages of current cpu slab should be allocated from fallback zones/nodes 
of the memoryless node in buddy system, how can not favorable happen? 

>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>

So I think we should fold both of your two patches to one.

Regards,
Wanpeng Li 

>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  8:48   ` Wanpeng Li
@ 2014-01-07  9:10       ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-07  9:10 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Anton Blanchard, benh, paulus, cl, penberg, mpm, nacc, linux-mm,
	linuxppc-dev

On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
> Hi Joonsoo,
> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
> >> 
> [...]
> >Hello,
> >
> >I think that we need more efforts to solve unbalanced node problem.
> >
> >With this patch, even if node of current cpu slab is not favorable to
> >unbalanced node, allocation would proceed and we would get the unintended memory.
> >
> 
> We have a machine:
> 
> [    0.000000] Node 0 Memory:
> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
> 
> [    0.041486] Node 0 CPUs: 0-19
> [    0.041490] Node 4 CPUs:
> [    0.041492] Node 6 CPUs:
> [    0.041495] Node 10 CPUs:
> 
> The pages of current cpu slab should be allocated from fallback zones/nodes 
> of the memoryless node in buddy system, how can not favorable happen? 

Hi, Wanpeng.

IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
allocate the page in fallback zones/node of that node #. So fallback list isn't
related to fallback one of memoryless node #. Am I wrong?

Thanks.

> 
> >And there is one more problem. Even if we have some partial slabs on
> >compatible node, we would allocate new slab, because get_partial() cannot handle
> >this unbalance node case.
> >
> >To fix this correctly, how about following patch?
> >
> 
> So I think we should fold both of your two patches to one.
> 
> Regards,
> Wanpeng Li 
> 
> >Thanks.
> >
> >------------->8--------------------
> >diff --git a/mm/slub.c b/mm/slub.c
> >index c3eb3d3..a1f6dfa 100644
> >--- a/mm/slub.c
> >+++ b/mm/slub.c
> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > {
> >        void *object;
> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> >+       struct zonelist *zonelist;
> >+       struct zoneref *z;
> >+       struct zone *zone;
> >+       enum zone_type high_zoneidx = gfp_zone(flags);
> >
> >+       if (!node_present_pages(searchnode)) {
> >+               zonelist = node_zonelist(searchnode, flags);
> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> >+                       searchnode = zone_to_nid(zone);
> >+                       if (node_present_pages(searchnode))
> >+                               break;
> >+               }
> >+       }
> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >        if (object || node != NUMA_NO_NODE)
> >                return object;
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majordomo@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-07  9:10       ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-07  9:10 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
> Hi Joonsoo,
> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
> >> 
> [...]
> >Hello,
> >
> >I think that we need more efforts to solve unbalanced node problem.
> >
> >With this patch, even if node of current cpu slab is not favorable to
> >unbalanced node, allocation would proceed and we would get the unintended memory.
> >
> 
> We have a machine:
> 
> [    0.000000] Node 0 Memory:
> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
> 
> [    0.041486] Node 0 CPUs: 0-19
> [    0.041490] Node 4 CPUs:
> [    0.041492] Node 6 CPUs:
> [    0.041495] Node 10 CPUs:
> 
> The pages of current cpu slab should be allocated from fallback zones/nodes 
> of the memoryless node in buddy system, how can not favorable happen? 

Hi, Wanpeng.

IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
allocate the page in fallback zones/node of that node #. So fallback list isn't
related to fallback one of memoryless node #. Am I wrong?

Thanks.

> 
> >And there is one more problem. Even if we have some partial slabs on
> >compatible node, we would allocate new slab, because get_partial() cannot handle
> >this unbalance node case.
> >
> >To fix this correctly, how about following patch?
> >
> 
> So I think we should fold both of your two patches to one.
> 
> Regards,
> Wanpeng Li 
> 
> >Thanks.
> >
> >------------->8--------------------
> >diff --git a/mm/slub.c b/mm/slub.c
> >index c3eb3d3..a1f6dfa 100644
> >--- a/mm/slub.c
> >+++ b/mm/slub.c
> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > {
> >        void *object;
> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> >+       struct zonelist *zonelist;
> >+       struct zoneref *z;
> >+       struct zone *zone;
> >+       enum zone_type high_zoneidx = gfp_zone(flags);
> >
> >+       if (!node_present_pages(searchnode)) {
> >+               zonelist = node_zonelist(searchnode, flags);
> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> >+                       searchnode = zone_to_nid(zone);
> >+                       if (node_present_pages(searchnode))
> >+                               break;
> >+               }
> >+       }
> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >        if (object || node != NUMA_NO_NODE)
> >                return object;
> >
> >--
> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >the body to majordomo@kvack.org.  For more info on Linux MM,
> >see: http://www.linux-mm.org/ .
> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:10       ` Joonsoo Kim
  (?)
  (?)
@ 2014-01-07  9:21       ` Wanpeng Li
  2014-01-07  9:31           ` Joonsoo Kim
  -1 siblings, 1 reply; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Anton Blanchard, benh, paulus, cl, penberg, mpm, nacc, linux-mm,
	linuxppc-dev

On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> Hi Joonsoo,
>> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> 
>> [...]
>> >Hello,
>> >
>> >I think that we need more efforts to solve unbalanced node problem.
>> >
>> >With this patch, even if node of current cpu slab is not favorable to
>> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >
>> 
>> We have a machine:
>> 
>> [    0.000000] Node 0 Memory:
>> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> 
>> [    0.041486] Node 0 CPUs: 0-19
>> [    0.041490] Node 4 CPUs:
>> [    0.041492] Node 6 CPUs:
>> [    0.041495] Node 10 CPUs:
>> 
>> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> of the memoryless node in buddy system, how can not favorable happen? 
>
>Hi, Wanpeng.
>
>IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>allocate the page in fallback zones/node of that node #. So fallback list isn't
>related to fallback one of memoryless node #. Am I wrong?
>

Anton add node_spanned_pages(node) check, so current cpu slab mentioned
above is against memoryless node. If I miss something?

Regards,
Wanpeng Li 

>Thanks.
>
>> 
>> >And there is one more problem. Even if we have some partial slabs on
>> >compatible node, we would allocate new slab, because get_partial() cannot handle
>> >this unbalance node case.
>> >
>> >To fix this correctly, how about following patch?
>> >
>> 
>> So I think we should fold both of your two patches to one.
>> 
>> Regards,
>> Wanpeng Li 
>> 
>> >Thanks.
>> >
>> >------------->8--------------------
>> >diff --git a/mm/slub.c b/mm/slub.c
>> >index c3eb3d3..a1f6dfa 100644
>> >--- a/mm/slub.c
>> >+++ b/mm/slub.c
>> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>> > {
>> >        void *object;
>> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>> >+       struct zonelist *zonelist;
>> >+       struct zoneref *z;
>> >+       struct zone *zone;
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>> >--
>> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >the body to majordomo@kvack.org.  For more info on Linux MM,
>> >see: http://www.linux-mm.org/ .
>> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:10       ` Joonsoo Kim
  (?)
@ 2014-01-07  9:21       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> Hi Joonsoo,
>> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> 
>> [...]
>> >Hello,
>> >
>> >I think that we need more efforts to solve unbalanced node problem.
>> >
>> >With this patch, even if node of current cpu slab is not favorable to
>> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >
>> 
>> We have a machine:
>> 
>> [    0.000000] Node 0 Memory:
>> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> 
>> [    0.041486] Node 0 CPUs: 0-19
>> [    0.041490] Node 4 CPUs:
>> [    0.041492] Node 6 CPUs:
>> [    0.041495] Node 10 CPUs:
>> 
>> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> of the memoryless node in buddy system, how can not favorable happen? 
>
>Hi, Wanpeng.
>
>IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>allocate the page in fallback zones/node of that node #. So fallback list isn't
>related to fallback one of memoryless node #. Am I wrong?
>

Anton add node_spanned_pages(node) check, so current cpu slab mentioned
above is against memoryless node. If I miss something?

Regards,
Wanpeng Li 

>Thanks.
>
>> 
>> >And there is one more problem. Even if we have some partial slabs on
>> >compatible node, we would allocate new slab, because get_partial() cannot handle
>> >this unbalance node case.
>> >
>> >To fix this correctly, how about following patch?
>> >
>> 
>> So I think we should fold both of your two patches to one.
>> 
>> Regards,
>> Wanpeng Li 
>> 
>> >Thanks.
>> >
>> >------------->8--------------------
>> >diff --git a/mm/slub.c b/mm/slub.c
>> >index c3eb3d3..a1f6dfa 100644
>> >--- a/mm/slub.c
>> >+++ b/mm/slub.c
>> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>> > {
>> >        void *object;
>> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>> >+       struct zonelist *zonelist;
>> >+       struct zoneref *z;
>> >+       struct zone *zone;
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>> >--
>> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >the body to majordomo@kvack.org.  For more info on Linux MM,
>> >see: http://www.linux-mm.org/ .
>> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:10       ` Joonsoo Kim
                         ` (2 preceding siblings ...)
  (?)
@ 2014-01-07  9:21       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> Hi Joonsoo,
>> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> 
>> [...]
>> >Hello,
>> >
>> >I think that we need more efforts to solve unbalanced node problem.
>> >
>> >With this patch, even if node of current cpu slab is not favorable to
>> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >
>> 
>> We have a machine:
>> 
>> [    0.000000] Node 0 Memory:
>> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> 
>> [    0.041486] Node 0 CPUs: 0-19
>> [    0.041490] Node 4 CPUs:
>> [    0.041492] Node 6 CPUs:
>> [    0.041495] Node 10 CPUs:
>> 
>> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> of the memoryless node in buddy system, how can not favorable happen? 
>
>Hi, Wanpeng.
>
>IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>allocate the page in fallback zones/node of that node #. So fallback list isn't
>related to fallback one of memoryless node #. Am I wrong?
>

Anton add node_spanned_pages(node) check, so current cpu slab mentioned
above is against memoryless node. If I miss something?

Regards,
Wanpeng Li 

>Thanks.
>
>> 
>> >And there is one more problem. Even if we have some partial slabs on
>> >compatible node, we would allocate new slab, because get_partial() cannot handle
>> >this unbalance node case.
>> >
>> >To fix this correctly, how about following patch?
>> >
>> 
>> So I think we should fold both of your two patches to one.
>> 
>> Regards,
>> Wanpeng Li 
>> 
>> >Thanks.
>> >
>> >------------->8--------------------
>> >diff --git a/mm/slub.c b/mm/slub.c
>> >index c3eb3d3..a1f6dfa 100644
>> >--- a/mm/slub.c
>> >+++ b/mm/slub.c
>> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>> > {
>> >        void *object;
>> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>> >+       struct zonelist *zonelist;
>> >+       struct zoneref *z;
>> >+       struct zone *zone;
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>> >--
>> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >the body to majordomo@kvack.org.  For more info on Linux MM,
>> >see: http://www.linux-mm.org/ .
>> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:10       ` Joonsoo Kim
                         ` (3 preceding siblings ...)
  (?)
@ 2014-01-07  9:21       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> Hi Joonsoo,
>> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> 
>> [...]
>> >Hello,
>> >
>> >I think that we need more efforts to solve unbalanced node problem.
>> >
>> >With this patch, even if node of current cpu slab is not favorable to
>> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >
>> 
>> We have a machine:
>> 
>> [    0.000000] Node 0 Memory:
>> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> 
>> [    0.041486] Node 0 CPUs: 0-19
>> [    0.041490] Node 4 CPUs:
>> [    0.041492] Node 6 CPUs:
>> [    0.041495] Node 10 CPUs:
>> 
>> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> of the memoryless node in buddy system, how can not favorable happen? 
>
>Hi, Wanpeng.
>
>IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>allocate the page in fallback zones/node of that node #. So fallback list isn't
>related to fallback one of memoryless node #. Am I wrong?
>

Anton add node_spanned_pages(node) check, so current cpu slab mentioned
above is against memoryless node. If I miss something?

Regards,
Wanpeng Li 

>Thanks.
>
>> 
>> >And there is one more problem. Even if we have some partial slabs on
>> >compatible node, we would allocate new slab, because get_partial() cannot handle
>> >this unbalance node case.
>> >
>> >To fix this correctly, how about following patch?
>> >
>> 
>> So I think we should fold both of your two patches to one.
>> 
>> Regards,
>> Wanpeng Li 
>> 
>> >Thanks.
>> >
>> >------------->8--------------------
>> >diff --git a/mm/slub.c b/mm/slub.c
>> >index c3eb3d3..a1f6dfa 100644
>> >--- a/mm/slub.c
>> >+++ b/mm/slub.c
>> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>> > {
>> >        void *object;
>> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>> >+       struct zonelist *zonelist;
>> >+       struct zoneref *z;
>> >+       struct zone *zone;
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>> >--
>> >To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >the body to majordomo@kvack.org.  For more info on Linux MM,
>> >see: http://www.linux-mm.org/ .
>> >Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:21       ` Wanpeng Li
@ 2014-01-07  9:31           ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-07  9:31 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Anton Blanchard, benh, paulus, cl, penberg, mpm, nacc, linux-mm,
	linuxppc-dev

On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
> On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
> >On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
> >> Hi Joonsoo,
> >> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
> >> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
> >> >> 
> >> [...]
> >> >Hello,
> >> >
> >> >I think that we need more efforts to solve unbalanced node problem.
> >> >
> >> >With this patch, even if node of current cpu slab is not favorable to
> >> >unbalanced node, allocation would proceed and we would get the unintended memory.
> >> >
> >> 
> >> We have a machine:
> >> 
> >> [    0.000000] Node 0 Memory:
> >> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
> >> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
> >> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
> >> 
> >> [    0.041486] Node 0 CPUs: 0-19
> >> [    0.041490] Node 4 CPUs:
> >> [    0.041492] Node 6 CPUs:
> >> [    0.041495] Node 10 CPUs:
> >> 
> >> The pages of current cpu slab should be allocated from fallback zones/nodes 
> >> of the memoryless node in buddy system, how can not favorable happen? 
> >
> >Hi, Wanpeng.
> >
> >IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
> >allocate the page in fallback zones/node of that node #. So fallback list isn't
> >related to fallback one of memoryless node #. Am I wrong?
> >
> 
> Anton add node_spanned_pages(node) check, so current cpu slab mentioned
> above is against memoryless node. If I miss something?

I thought following scenario.

memoryless node # : 1
1's fallback node # : 0

On node 1's cpu,

1. kmem_cache_alloc_node (node 2)
2. allocate the page on node 2 for the slab, now cpu slab is that one.
3. kmem_cache_alloc_node (local node, that is, node 1)
4. It check node_spanned_pages() and find it is memoryless node.
So return node 2's memory.

Is it impossible scenario?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-07  9:31           ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-07  9:31 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
> On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
> >On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
> >> Hi Joonsoo,
> >> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
> >> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
> >> >> 
> >> [...]
> >> >Hello,
> >> >
> >> >I think that we need more efforts to solve unbalanced node problem.
> >> >
> >> >With this patch, even if node of current cpu slab is not favorable to
> >> >unbalanced node, allocation would proceed and we would get the unintended memory.
> >> >
> >> 
> >> We have a machine:
> >> 
> >> [    0.000000] Node 0 Memory:
> >> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
> >> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
> >> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
> >> 
> >> [    0.041486] Node 0 CPUs: 0-19
> >> [    0.041490] Node 4 CPUs:
> >> [    0.041492] Node 6 CPUs:
> >> [    0.041495] Node 10 CPUs:
> >> 
> >> The pages of current cpu slab should be allocated from fallback zones/nodes 
> >> of the memoryless node in buddy system, how can not favorable happen? 
> >
> >Hi, Wanpeng.
> >
> >IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
> >allocate the page in fallback zones/node of that node #. So fallback list isn't
> >related to fallback one of memoryless node #. Am I wrong?
> >
> 
> Anton add node_spanned_pages(node) check, so current cpu slab mentioned
> above is against memoryless node. If I miss something?

I thought following scenario.

memoryless node # : 1
1's fallback node # : 0

On node 1's cpu,

1. kmem_cache_alloc_node (node 2)
2. allocate the page on node 2 for the slab, now cpu slab is that one.
3. kmem_cache_alloc_node (local node, that is, node 1)
4. It check node_spanned_pages() and find it is memoryless node.
So return node 2's memory.

Is it impossible scenario?

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* RE: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
@ 2014-01-07  9:42   ` David Laight
  -1 siblings, 0 replies; 229+ messages in thread
From: David Laight @ 2014-01-07  9:42 UTC (permalink / raw)
  To: 'Anton Blanchard', benh, paulus, cl, penberg, mpm, nacc
  Cc: linux-mm, linuxppc-dev

> From: Anton Blanchard
> We noticed a huge amount of slab memory consumed on a large ppc64 box:
> 
> Slab:            2094336 kB
> 
> Almost 2GB. This box is not balanced and some nodes do not have local
> memory, causing slub to be very inefficient in its slab usage.
> 
> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
> sees it isn't node local, deactivates it and tries to allocate a new
> slab. ...
...
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> 		deactivate_slab(s, page, c->freelist);
> 		c->page = NULL;
> 		c->freelist = NULL;
> 		goto new_slab;
>  	}

Why not just delete the entire test?
Presumably some time a little earlier no local memory was available.
Even if there is some available now, it is very likely that some won't
be available again in the near future.

	David.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* RE: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-07  9:42   ` David Laight
  0 siblings, 0 replies; 229+ messages in thread
From: David Laight @ 2014-01-07  9:42 UTC (permalink / raw)
  To: 'Anton Blanchard', benh, paulus, cl, penberg, mpm, nacc
  Cc: linux-mm, linuxppc-dev

> From: Anton Blanchard
> We noticed a huge amount of slab memory consumed on a large ppc64 box:
>=20
> Slab:            2094336 kB
>=20
> Almost 2GB. This box is not balanced and some nodes do not have local
> memory, causing slub to be very inefficient in its slab usage.
>=20
> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
> sees it isn't node local, deactivates it and tries to allocate a new
> slab. ...
...
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> 		deactivate_slab(s, page, c->freelist);
> 		c->page =3D NULL;
> 		c->freelist =3D NULL;
> 		goto new_slab;
>  	}

Why not just delete the entire test?
Presumably some time a little earlier no local memory was available.
Even if there is some available now, it is very likely that some won't
be available again in the near future.

	David.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:31           ` Joonsoo Kim
                             ` (3 preceding siblings ...)
  (?)
@ 2014-01-07  9:49           ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Anton Blanchard, benh, paulus, cl, penberg, mpm, nacc, linux-mm,
	linuxppc-dev

On Tue, Jan 07, 2014 at 06:31:56PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
>> On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> >> Hi Joonsoo,
>> >> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> >> 
>> >> [...]
>> >> >Hello,
>> >> >
>> >> >I think that we need more efforts to solve unbalanced node problem.
>> >> >
>> >> >With this patch, even if node of current cpu slab is not favorable to
>> >> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >> >
>> >> 
>> >> We have a machine:
>> >> 
>> >> [    0.000000] Node 0 Memory:
>> >> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> >> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> >> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> >> 
>> >> [    0.041486] Node 0 CPUs: 0-19
>> >> [    0.041490] Node 4 CPUs:
>> >> [    0.041492] Node 6 CPUs:
>> >> [    0.041495] Node 10 CPUs:
>> >> 
>> >> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> >> of the memoryless node in buddy system, how can not favorable happen? 
>> >
>> >Hi, Wanpeng.
>> >
>> >IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>> >allocate the page in fallback zones/node of that node #. So fallback list isn't
>> >related to fallback one of memoryless node #. Am I wrong?
>> >
>> 
>> Anton add node_spanned_pages(node) check, so current cpu slab mentioned
>> above is against memoryless node. If I miss something?
>
>I thought following scenario.
>
>memoryless node # : 1
>1's fallback node # : 0
>
>On node 1's cpu,
>
>1. kmem_cache_alloc_node (node 2)
>2. allocate the page on node 2 for the slab, now cpu slab is that one.
>3. kmem_cache_alloc_node (local node, that is, node 1)
>4. It check node_spanned_pages() and find it is memoryless node.
>So return node 2's memory.
>
>Is it impossible scenario?
>

Indeed, it can happen. 

Regards,
Wanpeng Li 

>Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:31           ` Joonsoo Kim
                             ` (2 preceding siblings ...)
  (?)
@ 2014-01-07  9:49           ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 06:31:56PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
>> On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> >> Hi Joonsoo,
>> >> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> >> 
>> >> [...]
>> >> >Hello,
>> >> >
>> >> >I think that we need more efforts to solve unbalanced node problem.
>> >> >
>> >> >With this patch, even if node of current cpu slab is not favorable to
>> >> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >> >
>> >> 
>> >> We have a machine:
>> >> 
>> >> [    0.000000] Node 0 Memory:
>> >> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> >> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> >> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> >> 
>> >> [    0.041486] Node 0 CPUs: 0-19
>> >> [    0.041490] Node 4 CPUs:
>> >> [    0.041492] Node 6 CPUs:
>> >> [    0.041495] Node 10 CPUs:
>> >> 
>> >> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> >> of the memoryless node in buddy system, how can not favorable happen? 
>> >
>> >Hi, Wanpeng.
>> >
>> >IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>> >allocate the page in fallback zones/node of that node #. So fallback list isn't
>> >related to fallback one of memoryless node #. Am I wrong?
>> >
>> 
>> Anton add node_spanned_pages(node) check, so current cpu slab mentioned
>> above is against memoryless node. If I miss something?
>
>I thought following scenario.
>
>memoryless node # : 1
>1's fallback node # : 0
>
>On node 1's cpu,
>
>1. kmem_cache_alloc_node (node 2)
>2. allocate the page on node 2 for the slab, now cpu slab is that one.
>3. kmem_cache_alloc_node (local node, that is, node 1)
>4. It check node_spanned_pages() and find it is memoryless node.
>So return node 2's memory.
>
>Is it impossible scenario?
>

Indeed, it can happen. 

Regards,
Wanpeng Li 

>Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:31           ` Joonsoo Kim
  (?)
  (?)
@ 2014-01-07  9:49           ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 06:31:56PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
>> On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> >> Hi Joonsoo,
>> >> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> >> 
>> >> [...]
>> >> >Hello,
>> >> >
>> >> >I think that we need more efforts to solve unbalanced node problem.
>> >> >
>> >> >With this patch, even if node of current cpu slab is not favorable to
>> >> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >> >
>> >> 
>> >> We have a machine:
>> >> 
>> >> [    0.000000] Node 0 Memory:
>> >> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> >> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> >> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> >> 
>> >> [    0.041486] Node 0 CPUs: 0-19
>> >> [    0.041490] Node 4 CPUs:
>> >> [    0.041492] Node 6 CPUs:
>> >> [    0.041495] Node 10 CPUs:
>> >> 
>> >> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> >> of the memoryless node in buddy system, how can not favorable happen? 
>> >
>> >Hi, Wanpeng.
>> >
>> >IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>> >allocate the page in fallback zones/node of that node #. So fallback list isn't
>> >related to fallback one of memoryless node #. Am I wrong?
>> >
>> 
>> Anton add node_spanned_pages(node) check, so current cpu slab mentioned
>> above is against memoryless node. If I miss something?
>
>I thought following scenario.
>
>memoryless node # : 1
>1's fallback node # : 0
>
>On node 1's cpu,
>
>1. kmem_cache_alloc_node (node 2)
>2. allocate the page on node 2 for the slab, now cpu slab is that one.
>3. kmem_cache_alloc_node (local node, that is, node 1)
>4. It check node_spanned_pages() and find it is memoryless node.
>So return node 2's memory.
>
>Is it impossible scenario?
>

Indeed, it can happen. 

Regards,
Wanpeng Li 

>Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:31           ` Joonsoo Kim
  (?)
@ 2014-01-07  9:49           ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 06:31:56PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
>> On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
>> >On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
>> >> Hi Joonsoo,
>> >> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>> >> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> >> >> 
>> >> [...]
>> >> >Hello,
>> >> >
>> >> >I think that we need more efforts to solve unbalanced node problem.
>> >> >
>> >> >With this patch, even if node of current cpu slab is not favorable to
>> >> >unbalanced node, allocation would proceed and we would get the unintended memory.
>> >> >
>> >> 
>> >> We have a machine:
>> >> 
>> >> [    0.000000] Node 0 Memory:
>> >> [    0.000000] Node 4 Memory: 0x0-0x10000000 0x20000000-0x60000000 0x80000000-0xc0000000
>> >> [    0.000000] Node 6 Memory: 0x10000000-0x20000000 0x60000000-0x80000000
>> >> [    0.000000] Node 10 Memory: 0xc0000000-0x180000000
>> >> 
>> >> [    0.041486] Node 0 CPUs: 0-19
>> >> [    0.041490] Node 4 CPUs:
>> >> [    0.041492] Node 6 CPUs:
>> >> [    0.041495] Node 10 CPUs:
>> >> 
>> >> The pages of current cpu slab should be allocated from fallback zones/nodes 
>> >> of the memoryless node in buddy system, how can not favorable happen? 
>> >
>> >Hi, Wanpeng.
>> >
>> >IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
>> >allocate the page in fallback zones/node of that node #. So fallback list isn't
>> >related to fallback one of memoryless node #. Am I wrong?
>> >
>> 
>> Anton add node_spanned_pages(node) check, so current cpu slab mentioned
>> above is against memoryless node. If I miss something?
>
>I thought following scenario.
>
>memoryless node # : 1
>1's fallback node # : 0
>
>On node 1's cpu,
>
>1. kmem_cache_alloc_node (node 2)
>2. allocate the page on node 2 for the slab, now cpu slab is that one.
>3. kmem_cache_alloc_node (local node, that is, node 1)
>4. It check node_spanned_pages() and find it is memoryless node.
>So return node 2's memory.
>
>Is it impossible scenario?
>

Indeed, it can happen. 

Regards,
Wanpeng Li 

>Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (6 preceding siblings ...)
  (?)
@ 2014-01-07  9:52   ` Wanpeng Li
  2014-01-09  0:20       ` Joonsoo Kim
  -1 siblings, 1 reply; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Anton Blanchard, benh, paulus, cl, penberg, mpm, nacc, linux-mm,
	linuxppc-dev

On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
>> We noticed a huge amount of slab memory consumed on a large ppc64 box:
>> 
>> Slab:            2094336 kB
>> 
>> Almost 2GB. This box is not balanced and some nodes do not have local
>> memory, causing slub to be very inefficient in its slab usage.
>> 
>> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>> sees it isn't node local, deactivates it and tries to allocate a new
>> slab. On empty nodes we will allocate a new remote slab and use the
>> first slot, but as explained above when we get called a second time
>> we will just deactivate that slab and retry.
>> 
>> As such we end up only using 1 entry in each slab:
>> 
>> slab                    mem  objects
>>                        used   active
>> ------------------------------------
>> kmalloc-16384       1404 MB    4.90%
>> task_struct          668 MB    2.90%
>> kmalloc-128          193 MB    3.61%
>> kmalloc-192          152 MB    5.23%
>> kmalloc-8192          72 MB   23.40%
>> kmalloc-16            64 MB    7.43%
>> kmalloc-512           33 MB   22.41%
>> 
>> The patch below checks that a node is not empty before deactivating a
>> slab and trying to allocate it again. With this patch applied we now
>> use about 352MB:
>> 
>> Slab:             360192 kB
>> 
>> And our efficiency is much better:
>> 
>> slab                    mem  objects
>>                        used   active
>> ------------------------------------
>> kmalloc-16384         92 MB   74.27%
>> task_struct           23 MB   83.46%
>> idr_layer_cache       18 MB  100.00%
>> pgtable-2^12          17 MB  100.00%
>> kmalloc-65536         15 MB  100.00%
>> inode_cache           14 MB  100.00%
>> kmalloc-256           14 MB   97.81%
>> kmalloc-8192          14 MB   85.71%
>> 
>> Signed-off-by: Anton Blanchard <anton@samba.org>
>> ---
>> 
>> Thoughts? It seems like we could hit a similar situation if a machine
>> is balanced but we run out of memory on a single node.
>> 
>> Index: b/mm/slub.c
>> ===================================================================
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2278,10 +2278,17 @@ redo:
>>  
>>  	if (unlikely(!node_match(page, node))) {
>>  		stat(s, ALLOC_NODE_MISMATCH);
>> -		deactivate_slab(s, page, c->freelist);
>> -		c->page = NULL;
>> -		c->freelist = NULL;
>> -		goto new_slab;
>> +
>> +		/*
>> +		 * If the node contains no memory there is no point in trying
>> +		 * to allocate a new node local slab
>> +		 */
>> +		if (node_spanned_pages(node)) {
>> +			deactivate_slab(s, page, c->freelist);
>> +			c->page = NULL;
>> +			c->freelist = NULL;
>> +			goto new_slab;
>> +		}
>>  	}
>>  
>>  	/*
>
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>
>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>
>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }

Why change searchnode instead of depending on fallback zones/nodes in 
get_any_partial() to allocate partial slabs?

Regards,
Wanpeng Li 

>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (5 preceding siblings ...)
  (?)
@ 2014-01-07  9:52   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
>> We noticed a huge amount of slab memory consumed on a large ppc64 box:
>> 
>> Slab:            2094336 kB
>> 
>> Almost 2GB. This box is not balanced and some nodes do not have local
>> memory, causing slub to be very inefficient in its slab usage.
>> 
>> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>> sees it isn't node local, deactivates it and tries to allocate a new
>> slab. On empty nodes we will allocate a new remote slab and use the
>> first slot, but as explained above when we get called a second time
>> we will just deactivate that slab and retry.
>> 
>> As such we end up only using 1 entry in each slab:
>> 
>> slab                    mem  objects
>>                        used   active
>> ------------------------------------
>> kmalloc-16384       1404 MB    4.90%
>> task_struct          668 MB    2.90%
>> kmalloc-128          193 MB    3.61%
>> kmalloc-192          152 MB    5.23%
>> kmalloc-8192          72 MB   23.40%
>> kmalloc-16            64 MB    7.43%
>> kmalloc-512           33 MB   22.41%
>> 
>> The patch below checks that a node is not empty before deactivating a
>> slab and trying to allocate it again. With this patch applied we now
>> use about 352MB:
>> 
>> Slab:             360192 kB
>> 
>> And our efficiency is much better:
>> 
>> slab                    mem  objects
>>                        used   active
>> ------------------------------------
>> kmalloc-16384         92 MB   74.27%
>> task_struct           23 MB   83.46%
>> idr_layer_cache       18 MB  100.00%
>> pgtable-2^12          17 MB  100.00%
>> kmalloc-65536         15 MB  100.00%
>> inode_cache           14 MB  100.00%
>> kmalloc-256           14 MB   97.81%
>> kmalloc-8192          14 MB   85.71%
>> 
>> Signed-off-by: Anton Blanchard <anton@samba.org>
>> ---
>> 
>> Thoughts? It seems like we could hit a similar situation if a machine
>> is balanced but we run out of memory on a single node.
>> 
>> Index: b/mm/slub.c
>> ===================================================================
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2278,10 +2278,17 @@ redo:
>>  
>>  	if (unlikely(!node_match(page, node))) {
>>  		stat(s, ALLOC_NODE_MISMATCH);
>> -		deactivate_slab(s, page, c->freelist);
>> -		c->page = NULL;
>> -		c->freelist = NULL;
>> -		goto new_slab;
>> +
>> +		/*
>> +		 * If the node contains no memory there is no point in trying
>> +		 * to allocate a new node local slab
>> +		 */
>> +		if (node_spanned_pages(node)) {
>> +			deactivate_slab(s, page, c->freelist);
>> +			c->page = NULL;
>> +			c->freelist = NULL;
>> +			goto new_slab;
>> +		}
>>  	}
>>  
>>  	/*
>
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>
>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>
>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }

Why change searchnode instead of depending on fallback zones/nodes in 
get_any_partial() to allocate partial slabs?

Regards,
Wanpeng Li 

>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (7 preceding siblings ...)
  (?)
@ 2014-01-07  9:52   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
>> We noticed a huge amount of slab memory consumed on a large ppc64 box:
>> 
>> Slab:            2094336 kB
>> 
>> Almost 2GB. This box is not balanced and some nodes do not have local
>> memory, causing slub to be very inefficient in its slab usage.
>> 
>> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>> sees it isn't node local, deactivates it and tries to allocate a new
>> slab. On empty nodes we will allocate a new remote slab and use the
>> first slot, but as explained above when we get called a second time
>> we will just deactivate that slab and retry.
>> 
>> As such we end up only using 1 entry in each slab:
>> 
>> slab                    mem  objects
>>                        used   active
>> ------------------------------------
>> kmalloc-16384       1404 MB    4.90%
>> task_struct          668 MB    2.90%
>> kmalloc-128          193 MB    3.61%
>> kmalloc-192          152 MB    5.23%
>> kmalloc-8192          72 MB   23.40%
>> kmalloc-16            64 MB    7.43%
>> kmalloc-512           33 MB   22.41%
>> 
>> The patch below checks that a node is not empty before deactivating a
>> slab and trying to allocate it again. With this patch applied we now
>> use about 352MB:
>> 
>> Slab:             360192 kB
>> 
>> And our efficiency is much better:
>> 
>> slab                    mem  objects
>>                        used   active
>> ------------------------------------
>> kmalloc-16384         92 MB   74.27%
>> task_struct           23 MB   83.46%
>> idr_layer_cache       18 MB  100.00%
>> pgtable-2^12          17 MB  100.00%
>> kmalloc-65536         15 MB  100.00%
>> inode_cache           14 MB  100.00%
>> kmalloc-256           14 MB   97.81%
>> kmalloc-8192          14 MB   85.71%
>> 
>> Signed-off-by: Anton Blanchard <anton@samba.org>
>> ---
>> 
>> Thoughts? It seems like we could hit a similar situation if a machine
>> is balanced but we run out of memory on a single node.
>> 
>> Index: b/mm/slub.c
>> ===================================================================
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2278,10 +2278,17 @@ redo:
>>  
>>  	if (unlikely(!node_match(page, node))) {
>>  		stat(s, ALLOC_NODE_MISMATCH);
>> -		deactivate_slab(s, page, c->freelist);
>> -		c->page = NULL;
>> -		c->freelist = NULL;
>> -		goto new_slab;
>> +
>> +		/*
>> +		 * If the node contains no memory there is no point in trying
>> +		 * to allocate a new node local slab
>> +		 */
>> +		if (node_spanned_pages(node)) {
>> +			deactivate_slab(s, page, c->freelist);
>> +			c->page = NULL;
>> +			c->freelist = NULL;
>> +			goto new_slab;
>> +		}
>>  	}
>>  
>>  	/*
>
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>
>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>
>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }

Why change searchnode instead of depending on fallback zones/nodes in 
get_any_partial() to allocate partial slabs?

Regards,
Wanpeng Li 

>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (4 preceding siblings ...)
  (?)
@ 2014-01-07  9:52   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07  9:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
>On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>> 
>> We noticed a huge amount of slab memory consumed on a large ppc64 box:
>> 
>> Slab:            2094336 kB
>> 
>> Almost 2GB. This box is not balanced and some nodes do not have local
>> memory, causing slub to be very inefficient in its slab usage.
>> 
>> Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>> sees it isn't node local, deactivates it and tries to allocate a new
>> slab. On empty nodes we will allocate a new remote slab and use the
>> first slot, but as explained above when we get called a second time
>> we will just deactivate that slab and retry.
>> 
>> As such we end up only using 1 entry in each slab:
>> 
>> slab                    mem  objects
>>                        used   active
>> ------------------------------------
>> kmalloc-16384       1404 MB    4.90%
>> task_struct          668 MB    2.90%
>> kmalloc-128          193 MB    3.61%
>> kmalloc-192          152 MB    5.23%
>> kmalloc-8192          72 MB   23.40%
>> kmalloc-16            64 MB    7.43%
>> kmalloc-512           33 MB   22.41%
>> 
>> The patch below checks that a node is not empty before deactivating a
>> slab and trying to allocate it again. With this patch applied we now
>> use about 352MB:
>> 
>> Slab:             360192 kB
>> 
>> And our efficiency is much better:
>> 
>> slab                    mem  objects
>>                        used   active
>> ------------------------------------
>> kmalloc-16384         92 MB   74.27%
>> task_struct           23 MB   83.46%
>> idr_layer_cache       18 MB  100.00%
>> pgtable-2^12          17 MB  100.00%
>> kmalloc-65536         15 MB  100.00%
>> inode_cache           14 MB  100.00%
>> kmalloc-256           14 MB   97.81%
>> kmalloc-8192          14 MB   85.71%
>> 
>> Signed-off-by: Anton Blanchard <anton@samba.org>
>> ---
>> 
>> Thoughts? It seems like we could hit a similar situation if a machine
>> is balanced but we run out of memory on a single node.
>> 
>> Index: b/mm/slub.c
>> ===================================================================
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2278,10 +2278,17 @@ redo:
>>  
>>  	if (unlikely(!node_match(page, node))) {
>>  		stat(s, ALLOC_NODE_MISMATCH);
>> -		deactivate_slab(s, page, c->freelist);
>> -		c->page = NULL;
>> -		c->freelist = NULL;
>> -		goto new_slab;
>> +
>> +		/*
>> +		 * If the node contains no memory there is no point in trying
>> +		 * to allocate a new node local slab
>> +		 */
>> +		if (node_spanned_pages(node)) {
>> +			deactivate_slab(s, page, c->freelist);
>> +			c->page = NULL;
>> +			c->freelist = NULL;
>> +			goto new_slab;
>> +		}
>>  	}
>>  
>>  	/*
>
>Hello,
>
>I think that we need more efforts to solve unbalanced node problem.
>
>With this patch, even if node of current cpu slab is not favorable to
>unbalanced node, allocation would proceed and we would get the unintended memory.
>
>And there is one more problem. Even if we have some partial slabs on
>compatible node, we would allocate new slab, because get_partial() cannot handle
>this unbalance node case.
>
>To fix this correctly, how about following patch?
>
>Thanks.
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }

Why change searchnode instead of depending on fallback zones/nodes in 
get_any_partial() to allocate partial slabs?

Regards,
Wanpeng Li 

>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
                   ` (7 preceding siblings ...)
  (?)
@ 2014-01-07 10:28 ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07 10:28 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: benh, paulus, cl, penberg, mpm, nacc, linux-mm, linuxppc-dev,
	Joonsoo Kim, Andi Kleen

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>

Deactive cpu slab cache doesn't always mean free the slab cache to buddy system, 
maybe the slab cache will be putback to the remote node's partial list if there 
are objects still in used in this unbalance situation. In this case, the slub slow 
path can freeze the partial slab in remote node again. So why the slab cache is 
fragmented as below? 

Regards,
Wanpeng Li 

>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>
>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {
>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
                   ` (10 preceding siblings ...)
  (?)
@ 2014-01-07 10:28 ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07 10:28 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: cl, nacc, penberg, linux-mm, Andi Kleen, paulus, mpm,
	Joonsoo Kim, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>

Deactive cpu slab cache doesn't always mean free the slab cache to buddy system, 
maybe the slab cache will be putback to the remote node's partial list if there 
are objects still in used in this unbalance situation. In this case, the slub slow 
path can freeze the partial slab in remote node again. So why the slab cache is 
fragmented as below? 

Regards,
Wanpeng Li 

>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>
>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {
>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
                   ` (8 preceding siblings ...)
  (?)
@ 2014-01-07 10:28 ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07 10:28 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: cl, nacc, penberg, linux-mm, Andi Kleen, paulus, mpm,
	Joonsoo Kim, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>

Deactive cpu slab cache doesn't always mean free the slab cache to buddy system, 
maybe the slab cache will be putback to the remote node's partial list if there 
are objects still in used in this unbalance situation. In this case, the slub slow 
path can freeze the partial slab in remote node again. So why the slab cache is 
fragmented as below? 

Regards,
Wanpeng Li 

>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>
>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {
>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  2:21 ` Anton Blanchard
                   ` (9 preceding siblings ...)
  (?)
@ 2014-01-07 10:28 ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-07 10:28 UTC (permalink / raw)
  To: Anton Blanchard
  Cc: cl, nacc, penberg, linux-mm, Andi Kleen, paulus, mpm,
	Joonsoo Kim, linuxppc-dev

On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
>
>We noticed a huge amount of slab memory consumed on a large ppc64 box:
>
>Slab:            2094336 kB
>
>Almost 2GB. This box is not balanced and some nodes do not have local
>memory, causing slub to be very inefficient in its slab usage.
>
>Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
>sees it isn't node local, deactivates it and tries to allocate a new
>slab. On empty nodes we will allocate a new remote slab and use the
>first slot, but as explained above when we get called a second time
>we will just deactivate that slab and retry.
>

Deactive cpu slab cache doesn't always mean free the slab cache to buddy system, 
maybe the slab cache will be putback to the remote node's partial list if there 
are objects still in used in this unbalance situation. In this case, the slub slow 
path can freeze the partial slab in remote node again. So why the slab cache is 
fragmented as below? 

Regards,
Wanpeng Li 

>As such we end up only using 1 entry in each slab:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384       1404 MB    4.90%
>task_struct          668 MB    2.90%
>kmalloc-128          193 MB    3.61%
>kmalloc-192          152 MB    5.23%
>kmalloc-8192          72 MB   23.40%
>kmalloc-16            64 MB    7.43%
>kmalloc-512           33 MB   22.41%
>
>The patch below checks that a node is not empty before deactivating a
>slab and trying to allocate it again. With this patch applied we now
>use about 352MB:
>
>Slab:             360192 kB
>
>And our efficiency is much better:
>
>slab                    mem  objects
>                       used   active
>------------------------------------
>kmalloc-16384         92 MB   74.27%
>task_struct           23 MB   83.46%
>idr_layer_cache       18 MB  100.00%
>pgtable-2^12          17 MB  100.00%
>kmalloc-65536         15 MB  100.00%
>inode_cache           14 MB  100.00%
>kmalloc-256           14 MB   97.81%
>kmalloc-8192          14 MB   85.71%
>
>Signed-off-by: Anton Blanchard <anton@samba.org>
>---
>
>Thoughts? It seems like we could hit a similar situation if a machine
>is balanced but we run out of memory on a single node.
>
>Index: b/mm/slub.c
>===================================================================
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -2278,10 +2278,17 @@ redo:
>
> 	if (unlikely(!node_match(page, node))) {
> 		stat(s, ALLOC_NODE_MISMATCH);
>-		deactivate_slab(s, page, c->freelist);
>-		c->page = NULL;
>-		c->freelist = NULL;
>-		goto new_slab;
>+
>+		/*
>+		 * If the node contains no memory there is no point in trying
>+		 * to allocate a new node local slab
>+		 */
>+		if (node_spanned_pages(node)) {
>+			deactivate_slab(s, page, c->freelist);
>+			c->page = NULL;
>+			c->freelist = NULL;
>+			goto new_slab;
>+		}
> 	}
>
> 	/*
>
>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  6:49   ` Andi Kleen
@ 2014-01-08 14:03     ` Anton Blanchard
  -1 siblings, 0 replies; 229+ messages in thread
From: Anton Blanchard @ 2014-01-08 14:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: benh, paulus, cl, penberg, mpm, nacc, linux-mm, linuxppc-dev


Hi Andi,

> > Thoughts? It seems like we could hit a similar situation if a
> > machine is balanced but we run out of memory on a single node.
> 
> Yes I agree, but your patch doesn't seem to attempt to handle this?

It doesn't. I was hoping someone with more mm knowledge than I could
suggest a lightweight way of doing this.

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-08 14:03     ` Anton Blanchard
  0 siblings, 0 replies; 229+ messages in thread
From: Anton Blanchard @ 2014-01-08 14:03 UTC (permalink / raw)
  To: Andi Kleen; +Cc: cl, nacc, penberg, linux-mm, paulus, mpm, linuxppc-dev


Hi Andi,

> > Thoughts? It seems like we could hit a similar situation if a
> > machine is balanced but we run out of memory on a single node.
> 
> Yes I agree, but your patch doesn't seem to attempt to handle this?

It doesn't. I was hoping someone with more mm knowledge than I could
suggest a lightweight way of doing this.

Anton

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:42   ` David Laight
@ 2014-01-08 14:14     ` Anton Blanchard
  -1 siblings, 0 replies; 229+ messages in thread
From: Anton Blanchard @ 2014-01-08 14:14 UTC (permalink / raw)
  To: David Laight; +Cc: benh, paulus, cl, penberg, mpm, nacc, linux-mm, linuxppc-dev


Hi David,

> Why not just delete the entire test?
> Presumably some time a little earlier no local memory was available.
> Even if there is some available now, it is very likely that some won't
> be available again in the near future.

I agree, the current behaviour seems strange but it has been around
since the inital slub commit.

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-08 14:14     ` Anton Blanchard
  0 siblings, 0 replies; 229+ messages in thread
From: Anton Blanchard @ 2014-01-08 14:14 UTC (permalink / raw)
  To: David Laight; +Cc: cl, nacc, penberg, linux-mm, paulus, mpm, linuxppc-dev


Hi David,

> Why not just delete the entire test?
> Presumably some time a little earlier no local memory was available.
> Even if there is some available now, it is very likely that some won't
> be available again in the near future.

I agree, the current behaviour seems strange but it has been around
since the inital slub commit.

Anton

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  4:19 ` Wanpeng Li
@ 2014-01-08 14:17     ` Anton Blanchard
  0 siblings, 0 replies; 229+ messages in thread
From: Anton Blanchard @ 2014-01-08 14:17 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: benh, paulus, cl, penberg, mpm, nacc, linux-mm, linuxppc-dev


Hi Wanpeng,

> >+		if (node_spanned_pages(node)) {
> 
> s/node_spanned_pages/node_present_pages 

Thanks, I hadn't come across node_present_pages() before.

Anton

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-08 14:17     ` Anton Blanchard
  0 siblings, 0 replies; 229+ messages in thread
From: Anton Blanchard @ 2014-01-08 14:17 UTC (permalink / raw)
  To: Wanpeng Li; +Cc: cl, nacc, penberg, linux-mm, paulus, mpm, linuxppc-dev


Hi Wanpeng,

> >+		if (node_spanned_pages(node)) {
> 
> s/node_spanned_pages/node_present_pages 

Thanks, I hadn't come across node_present_pages() before.

Anton

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  9:52   ` Wanpeng Li
@ 2014-01-09  0:20       ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-09  0:20 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Anton Blanchard, benh, paulus, cl, penberg, mpm, nacc, linux-mm,
	linuxppc-dev

On Tue, Jan 07, 2014 at 05:52:31PM +0800, Wanpeng Li wrote:
> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:

> >> Index: b/mm/slub.c
> >> ===================================================================
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -2278,10 +2278,17 @@ redo:
> >>  
> >>  	if (unlikely(!node_match(page, node))) {
> >>  		stat(s, ALLOC_NODE_MISMATCH);
> >> -		deactivate_slab(s, page, c->freelist);
> >> -		c->page = NULL;
> >> -		c->freelist = NULL;
> >> -		goto new_slab;
> >> +
> >> +		/*
> >> +		 * If the node contains no memory there is no point in trying
> >> +		 * to allocate a new node local slab
> >> +		 */
> >> +		if (node_spanned_pages(node)) {
> >> +			deactivate_slab(s, page, c->freelist);
> >> +			c->page = NULL;
> >> +			c->freelist = NULL;
> >> +			goto new_slab;
> >> +		}
> >>  	}
> >>  
> >>  	/*
> >
> >Hello,
> >
> >I think that we need more efforts to solve unbalanced node problem.
> >
> >With this patch, even if node of current cpu slab is not favorable to
> >unbalanced node, allocation would proceed and we would get the unintended memory.
> >
> >And there is one more problem. Even if we have some partial slabs on
> >compatible node, we would allocate new slab, because get_partial() cannot handle
> >this unbalance node case.
> >
> >To fix this correctly, how about following patch?
> >
> >Thanks.
> >
> >------------->8--------------------
> >diff --git a/mm/slub.c b/mm/slub.c
> >index c3eb3d3..a1f6dfa 100644
> >--- a/mm/slub.c
> >+++ b/mm/slub.c
> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > {
> >        void *object;
> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> >+       struct zonelist *zonelist;
> >+       struct zoneref *z;
> >+       struct zone *zone;
> >+       enum zone_type high_zoneidx = gfp_zone(flags);
> >
> >+       if (!node_present_pages(searchnode)) {
> >+               zonelist = node_zonelist(searchnode, flags);
> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> >+                       searchnode = zone_to_nid(zone);
> >+                       if (node_present_pages(searchnode))
> >+                               break;
> >+               }
> >+       }
> 
> Why change searchnode instead of depending on fallback zones/nodes in 
> get_any_partial() to allocate partial slabs?
> 

If node != NUMA_NO_NODE, get_any_partial() isn't called.
That's why I change searchnode here instead of get_any_partial().

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-09  0:20       ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-09  0:20 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: cl, nacc, penberg, linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev

On Tue, Jan 07, 2014 at 05:52:31PM +0800, Wanpeng Li wrote:
> On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
> >On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:

> >> Index: b/mm/slub.c
> >> ===================================================================
> >> --- a/mm/slub.c
> >> +++ b/mm/slub.c
> >> @@ -2278,10 +2278,17 @@ redo:
> >>  
> >>  	if (unlikely(!node_match(page, node))) {
> >>  		stat(s, ALLOC_NODE_MISMATCH);
> >> -		deactivate_slab(s, page, c->freelist);
> >> -		c->page = NULL;
> >> -		c->freelist = NULL;
> >> -		goto new_slab;
> >> +
> >> +		/*
> >> +		 * If the node contains no memory there is no point in trying
> >> +		 * to allocate a new node local slab
> >> +		 */
> >> +		if (node_spanned_pages(node)) {
> >> +			deactivate_slab(s, page, c->freelist);
> >> +			c->page = NULL;
> >> +			c->freelist = NULL;
> >> +			goto new_slab;
> >> +		}
> >>  	}
> >>  
> >>  	/*
> >
> >Hello,
> >
> >I think that we need more efforts to solve unbalanced node problem.
> >
> >With this patch, even if node of current cpu slab is not favorable to
> >unbalanced node, allocation would proceed and we would get the unintended memory.
> >
> >And there is one more problem. Even if we have some partial slabs on
> >compatible node, we would allocate new slab, because get_partial() cannot handle
> >this unbalance node case.
> >
> >To fix this correctly, how about following patch?
> >
> >Thanks.
> >
> >------------->8--------------------
> >diff --git a/mm/slub.c b/mm/slub.c
> >index c3eb3d3..a1f6dfa 100644
> >--- a/mm/slub.c
> >+++ b/mm/slub.c
> >@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > {
> >        void *object;
> >        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> >+       struct zonelist *zonelist;
> >+       struct zoneref *z;
> >+       struct zone *zone;
> >+       enum zone_type high_zoneidx = gfp_zone(flags);
> >
> >+       if (!node_present_pages(searchnode)) {
> >+               zonelist = node_zonelist(searchnode, flags);
> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> >+                       searchnode = zone_to_nid(zone);
> >+                       if (node_present_pages(searchnode))
> >+                               break;
> >+               }
> >+       }
> 
> Why change searchnode instead of depending on fallback zones/nodes in 
> get_any_partial() to allocate partial slabs?
> 

If node != NUMA_NO_NODE, get_any_partial() isn't called.
That's why I change searchnode here instead of get_any_partial().

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (9 preceding siblings ...)
  (?)
@ 2014-01-20  9:10   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-20  9:10 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: benh, paulus, cl, penberg, mpm, nacc, Anton Blanchard, linux-mm,
	linuxppc-dev, Han Pingtian

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
[...]
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>

The patch fix the bug. However, the kernel crashed very quickly after running 
stress tests for a short while:


[-- Attachment #2: oops --]
[-- Type: text/plain, Size: 4918 bytes --]

[  287.464285] Unable to handle kernel paging request for data at address 0x00000001
[  287.464289] Faulting instruction address: 0xc000000000445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c000000000fde590 ti: c0000001fffd0000 task.ti: c0000000010a4000
[  287.464382] NIP: c000000000445af8 LR: c000000000445bcc CTR: c000000000445b90
[  287.464385] REGS: c0000001fffd38e0 TRAP: 0300   Not tainted  (3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 88002084  XER: 00000001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c00000000000908c
[  287.464401] DAR: 0000000000000001, DSISR: 40000000
[  287.464403]
GPR00: d000000003649a04 c0000001fffd3b60 c0000000010a94d0 0000000000000003
GPR04: c00000018d841048 c0000001fffd3bd0 0000000000000012 d00000000364eff0
GPR08: c0000001fffd3bd0 0000000000000001 d00000000364d688 c000000000445b90
GPR12: d00000000364b960 c000000007e00000 00000000042ac510 0000000000000060
GPR16: 0000000000200000 00000000fffffb19 c000000001122100 0000000000000000
GPR20: c000000000a94680 c000000001122180 c000000000a94680 000000000000000a
GPR24: 0000000000000100 0000000000000000 0000000000000001 c0000001ef900000
GPR28: c0000001d6c066f0 c0000001aea03520 c0000001bc9a2640 c00000018d841680
[  287.464447] NIP [c000000000445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c000000000445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [8000000000009032]
[  287.464455] Call Trace:
[  287.464458] [c0000001fffd3b60] [c0000001fffd3c00] 0xc0000001fffd3c00 (unreliable)
[  287.464467] [c0000001fffd3bf0] [d000000003649a04] .ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c0000001fffd3cb0] [d0000000036495b8] .ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c0000001fffd3d30] [d000000003649fe4] .ibmvfc_tasklet+0xd4/0x250 [ibmvfc]
[  287.464494] [c0000001fffd3de0] [c00000000009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c0000001fffd3e90] [c00000000009a668] .__do_softirq+0x148/0x360
[  287.464503] [c0000001fffd3f90] [c0000000000218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c0000001fffcfdf0] [c0000000000107e0] .do_softirq+0xd0/0x100
[  287.464511] [c0000001fffcfe80] [c00000000009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c0000001fffcff10] [c000000000010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c0000001fffcff90] [c0000000000218cc] .call_do_irq+0x14/0x24
[  287.464522] [c0000000010a76d0] [c0000000000105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0xffff
[  287.464527]     LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c0000000010a7770] [c000000000002494] hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540]     LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c0000000010a7a60] [0000000000000001] 0x1 (unreliable)
[  287.464550] [c0000000010a7ad0] [c000000000074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c0000000010a7b50] [c0000000005538f4] .cpuidle_enter_state+0x64/0x150
[  287.464559] [c0000000010a7c10] [c000000000553ad0] .cpuidle_idle_call+0xf0/0x300
[  287.464563] [c0000000010a7cc0] [c0000000000695c0] .pseries_lpar_idle+0x10/0x50
[  287.464568] [c0000000010a7d30] [c000000000016ee4] .arch_cpu_idle+0x64/0x150
[  287.464572] [c0000000010a7db0] [c0000000000f6504] .cpu_startup_entry+0x1a4/0x2d0
[  287.464577] [c0000000010a7e80] [c00000000000bd04] .rest_init+0x94/0xb0
[  287.464582] [c0000000010a7ef0] [c000000000a044d0] .start_kernel+0x4b0/0x4cc
[  287.464586] [c0000000010a7f90] [c000000000009d30] .start_here_common+0x20/0x70
[  287.464589] Instruction dump:
[  287.464591] 60000000 60420000 2c240000 7c6a1b78 41c20088 e9240090 88630001 7ca82b78
[  287.464598] 2fa90000 3863ffd0 7c6307b4 419e002c <e8c90000> e8e40050 2fa70000 419e004c
[  287.464606] ---[ end trace c469801a8c53d8f1 ]---
[  287.466576]
[  287.466582] Sending IPI to other CPUs
[  287.468526] IPI complete


^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (11 preceding siblings ...)
  (?)
@ 2014-01-20  9:10   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-20  9:10 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
[...]
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>

The patch fix the bug. However, the kernel crashed very quickly after running 
stress tests for a short while:


[-- Attachment #2: oops --]
[-- Type: text/plain, Size: 4920 bytes --]

[  287.464285] Unable to handle kernel paging request for data at address 0x00000001
[  287.464289] Faulting instruction address: 0xc000000000445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_tran
 sport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c000000000fde590 ti: c0000001fffd0000 task.ti: c0000000010a4000
[  287.464382] NIP: c000000000445af8 LR: c000000000445bcc CTR: c000000000445b90
[  287.464385] REGS: c0000001fffd38e0 TRAP: 0300   Not tainted  (3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 88002084  XER: 00000001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c00000000000908c
[  287.464401] DAR: 0000000000000001, DSISR: 40000000
[  287.464403]
GPR00: d000000003649a04 c0000001fffd3b60 c0000000010a94d0 0000000000000003
GPR04: c00000018d841048 c0000001fffd3bd0 0000000000000012 d00000000364eff0
GPR08: c0000001fffd3bd0 0000000000000001 d00000000364d688 c000000000445b90
GPR12: d00000000364b960 c000000007e00000 00000000042ac510 0000000000000060
GPR16: 0000000000200000 00000000fffffb19 c000000001122100 0000000000000000
GPR20: c000000000a94680 c000000001122180 c000000000a94680 000000000000000a
GPR24: 0000000000000100 0000000000000000 0000000000000001 c0000001ef900000
GPR28: c0000001d6c066f0 c0000001aea03520 c0000001bc9a2640 c00000018d841680
[  287.464447] NIP [c000000000445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c000000000445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [8000000000009032]
[  287.464455] Call Trace:
[  287.464458] [c0000001fffd3b60] [c0000001fffd3c00] 0xc0000001fffd3c00 (unreliable)
[  287.464467] [c0000001fffd3bf0] [d000000003649a04] .ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c0000001fffd3cb0] [d0000000036495b8] .ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c0000001fffd3d30] [d000000003649fe4] .ibmvfc_tasklet+0xd4/0x250 [ibmvfc]
[  287.464494] [c0000001fffd3de0] [c00000000009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c0000001fffd3e90] [c00000000009a668] .__do_softirq+0x148/0x360
[  287.464503] [c0000001fffd3f90] [c0000000000218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c0000001fffcfdf0] [c0000000000107e0] .do_softirq+0xd0/0x100
[  287.464511] [c0000001fffcfe80] [c00000000009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c0000001fffcff10] [c000000000010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c0000001fffcff90] [c0000000000218cc] .call_do_irq+0x14/0x24
[  287.464522] [c0000000010a76d0] [c0000000000105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0xffff
[  287.464527]     LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c0000000010a7770] [c000000000002494] hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540]     LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c0000000010a7a60] [0000000000000001] 0x1 (unreliable)
[  287.464550] [c0000000010a7ad0] [c000000000074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c0000000010a7b50] [c0000000005538f4] .cpuidle_enter_state+0x64/0x150
[  287.464559] [c0000000010a7c10] [c000000000553ad0] .cpuidle_idle_call+0xf0/0x300
[  287.464563] [c0000000010a7cc0] [c0000000000695c0] .pseries_lpar_idle+0x10/0x50
[  287.464568] [c0000000010a7d30] [c000000000016ee4] .arch_cpu_idle+0x64/0x150
[  287.464572] [c0000000010a7db0] [c0000000000f6504] .cpu_startup_entry+0x1a4/0x2d0
[  287.464577] [c0000000010a7e80] [c00000000000bd04] .rest_init+0x94/0xb0
[  287.464582] [c0000000010a7ef0] [c000000000a044d0] .start_kernel+0x4b0/0x4cc
[  287.464586] [c0000000010a7f90] [c000000000009d30] .start_here_common+0x20/0x70
[  287.464589] Instruction dump:
[  287.464591] 60000000 60420000 2c240000 7c6a1b78 41c20088 e9240090 88630001 7ca82b78
[  287.464598] 2fa90000 3863ffd0 7c6307b4 419e002c <e8c90000> e8e40050 2fa70000 419e004c
[  287.464606] ---[ end trace c469801a8c53d8f1 ]---
[  287.466576]
[  287.466582] Sending IPI to other CPUs
[  287.468526] IPI complete


[-- Attachment #3: Type: text/plain, Size: 150 bytes --]

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (8 preceding siblings ...)
  (?)
@ 2014-01-20  9:10   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-20  9:10 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
[...]
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>

The patch fix the bug. However, the kernel crashed very quickly after running 
stress tests for a short while:


[-- Attachment #2: oops --]
[-- Type: text/plain, Size: 4920 bytes --]

[  287.464285] Unable to handle kernel paging request for data at address 0x00000001
[  287.464289] Faulting instruction address: 0xc000000000445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_tran
 sport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c000000000fde590 ti: c0000001fffd0000 task.ti: c0000000010a4000
[  287.464382] NIP: c000000000445af8 LR: c000000000445bcc CTR: c000000000445b90
[  287.464385] REGS: c0000001fffd38e0 TRAP: 0300   Not tainted  (3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 88002084  XER: 00000001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c00000000000908c
[  287.464401] DAR: 0000000000000001, DSISR: 40000000
[  287.464403]
GPR00: d000000003649a04 c0000001fffd3b60 c0000000010a94d0 0000000000000003
GPR04: c00000018d841048 c0000001fffd3bd0 0000000000000012 d00000000364eff0
GPR08: c0000001fffd3bd0 0000000000000001 d00000000364d688 c000000000445b90
GPR12: d00000000364b960 c000000007e00000 00000000042ac510 0000000000000060
GPR16: 0000000000200000 00000000fffffb19 c000000001122100 0000000000000000
GPR20: c000000000a94680 c000000001122180 c000000000a94680 000000000000000a
GPR24: 0000000000000100 0000000000000000 0000000000000001 c0000001ef900000
GPR28: c0000001d6c066f0 c0000001aea03520 c0000001bc9a2640 c00000018d841680
[  287.464447] NIP [c000000000445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c000000000445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [8000000000009032]
[  287.464455] Call Trace:
[  287.464458] [c0000001fffd3b60] [c0000001fffd3c00] 0xc0000001fffd3c00 (unreliable)
[  287.464467] [c0000001fffd3bf0] [d000000003649a04] .ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c0000001fffd3cb0] [d0000000036495b8] .ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c0000001fffd3d30] [d000000003649fe4] .ibmvfc_tasklet+0xd4/0x250 [ibmvfc]
[  287.464494] [c0000001fffd3de0] [c00000000009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c0000001fffd3e90] [c00000000009a668] .__do_softirq+0x148/0x360
[  287.464503] [c0000001fffd3f90] [c0000000000218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c0000001fffcfdf0] [c0000000000107e0] .do_softirq+0xd0/0x100
[  287.464511] [c0000001fffcfe80] [c00000000009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c0000001fffcff10] [c000000000010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c0000001fffcff90] [c0000000000218cc] .call_do_irq+0x14/0x24
[  287.464522] [c0000000010a76d0] [c0000000000105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0xffff
[  287.464527]     LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c0000000010a7770] [c000000000002494] hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540]     LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c0000000010a7a60] [0000000000000001] 0x1 (unreliable)
[  287.464550] [c0000000010a7ad0] [c000000000074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c0000000010a7b50] [c0000000005538f4] .cpuidle_enter_state+0x64/0x150
[  287.464559] [c0000000010a7c10] [c000000000553ad0] .cpuidle_idle_call+0xf0/0x300
[  287.464563] [c0000000010a7cc0] [c0000000000695c0] .pseries_lpar_idle+0x10/0x50
[  287.464568] [c0000000010a7d30] [c000000000016ee4] .arch_cpu_idle+0x64/0x150
[  287.464572] [c0000000010a7db0] [c0000000000f6504] .cpu_startup_entry+0x1a4/0x2d0
[  287.464577] [c0000000010a7e80] [c00000000000bd04] .rest_init+0x94/0xb0
[  287.464582] [c0000000010a7ef0] [c000000000a044d0] .start_kernel+0x4b0/0x4cc
[  287.464586] [c0000000010a7f90] [c000000000009d30] .start_here_common+0x20/0x70
[  287.464589] Instruction dump:
[  287.464591] 60000000 60420000 2c240000 7c6a1b78 41c20088 e9240090 88630001 7ca82b78
[  287.464598] 2fa90000 3863ffd0 7c6307b4 419e002c <e8c90000> e8e40050 2fa70000 419e004c
[  287.464606] ---[ end trace c469801a8c53d8f1 ]---
[  287.466576]
[  287.466582] Sending IPI to other CPUs
[  287.468526] IPI complete


[-- Attachment #3: Type: text/plain, Size: 150 bytes --]

_______________________________________________
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-07  7:41   ` Joonsoo Kim
                     ` (10 preceding siblings ...)
  (?)
@ 2014-01-20  9:10   ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-20  9:10 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: cl, nacc, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
[...]
>
>------------->8--------------------
>diff --git a/mm/slub.c b/mm/slub.c
>index c3eb3d3..a1f6dfa 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> {
>        void *object;
>        int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>+       struct zonelist *zonelist;
>+       struct zoneref *z;
>+       struct zone *zone;
>+       enum zone_type high_zoneidx = gfp_zone(flags);
>
>+       if (!node_present_pages(searchnode)) {
>+               zonelist = node_zonelist(searchnode, flags);
>+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>+                       searchnode = zone_to_nid(zone);
>+                       if (node_present_pages(searchnode))
>+                               break;
>+               }
>+       }
>        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>        if (object || node != NUMA_NO_NODE)
>                return object;
>

The patch fix the bug. However, the kernel crashed very quickly after running 
stress tests for a short while:


[-- Attachment #2: oops --]
[-- Type: text/plain, Size: 4918 bytes --]

[  287.464285] Unable to handle kernel paging request for data at address 0x00000001
[  287.464289] Faulting instruction address: 0xc000000000445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c000000000fde590 ti: c0000001fffd0000 task.ti: c0000000010a4000
[  287.464382] NIP: c000000000445af8 LR: c000000000445bcc CTR: c000000000445b90
[  287.464385] REGS: c0000001fffd38e0 TRAP: 0300   Not tainted  (3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 88002084  XER: 00000001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c00000000000908c
[  287.464401] DAR: 0000000000000001, DSISR: 40000000
[  287.464403]
GPR00: d000000003649a04 c0000001fffd3b60 c0000000010a94d0 0000000000000003
GPR04: c00000018d841048 c0000001fffd3bd0 0000000000000012 d00000000364eff0
GPR08: c0000001fffd3bd0 0000000000000001 d00000000364d688 c000000000445b90
GPR12: d00000000364b960 c000000007e00000 00000000042ac510 0000000000000060
GPR16: 0000000000200000 00000000fffffb19 c000000001122100 0000000000000000
GPR20: c000000000a94680 c000000001122180 c000000000a94680 000000000000000a
GPR24: 0000000000000100 0000000000000000 0000000000000001 c0000001ef900000
GPR28: c0000001d6c066f0 c0000001aea03520 c0000001bc9a2640 c00000018d841680
[  287.464447] NIP [c000000000445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c000000000445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [8000000000009032]
[  287.464455] Call Trace:
[  287.464458] [c0000001fffd3b60] [c0000001fffd3c00] 0xc0000001fffd3c00 (unreliable)
[  287.464467] [c0000001fffd3bf0] [d000000003649a04] .ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c0000001fffd3cb0] [d0000000036495b8] .ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c0000001fffd3d30] [d000000003649fe4] .ibmvfc_tasklet+0xd4/0x250 [ibmvfc]
[  287.464494] [c0000001fffd3de0] [c00000000009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c0000001fffd3e90] [c00000000009a668] .__do_softirq+0x148/0x360
[  287.464503] [c0000001fffd3f90] [c0000000000218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c0000001fffcfdf0] [c0000000000107e0] .do_softirq+0xd0/0x100
[  287.464511] [c0000001fffcfe80] [c00000000009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c0000001fffcff10] [c000000000010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c0000001fffcff90] [c0000000000218cc] .call_do_irq+0x14/0x24
[  287.464522] [c0000000010a76d0] [c0000000000105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0xffff
[  287.464527]     LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c0000000010a7770] [c000000000002494] hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540]     LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c0000000010a7a60] [0000000000000001] 0x1 (unreliable)
[  287.464550] [c0000000010a7ad0] [c000000000074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c0000000010a7b50] [c0000000005538f4] .cpuidle_enter_state+0x64/0x150
[  287.464559] [c0000000010a7c10] [c000000000553ad0] .cpuidle_idle_call+0xf0/0x300
[  287.464563] [c0000000010a7cc0] [c0000000000695c0] .pseries_lpar_idle+0x10/0x50
[  287.464568] [c0000000010a7d30] [c000000000016ee4] .arch_cpu_idle+0x64/0x150
[  287.464572] [c0000000010a7db0] [c0000000000f6504] .cpu_startup_entry+0x1a4/0x2d0
[  287.464577] [c0000000010a7e80] [c00000000000bd04] .rest_init+0x94/0xb0
[  287.464582] [c0000000010a7ef0] [c000000000a044d0] .start_kernel+0x4b0/0x4cc
[  287.464586] [c0000000010a7f90] [c000000000009d30] .start_here_common+0x20/0x70
[  287.464589] Instruction dump:
[  287.464591] 60000000 60420000 2c240000 7c6a1b78 41c20088 e9240090 88630001 7ca82b78
[  287.464598] 2fa90000 3863ffd0 7c6307b4 419e002c <e8c90000> e8e40050 2fa70000 419e004c
[  287.464606] ---[ end trace c469801a8c53d8f1 ]---
[  287.466576]
[  287.466582] Sending IPI to other CPUs
[  287.468526] IPI complete


^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
       [not found]   ` <52dce7fe.e5e6420a.5ff6.ffff84a0SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-01-20 22:13       ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-20 22:13 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Joonsoo Kim, benh, paulus, penberg, mpm, nacc, Anton Blanchard,
	linux-mm, linuxppc-dev, Han Pingtian

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1025 bytes --]

On Mon, 20 Jan 2014, Wanpeng Li wrote:

> >+       enum zone_type high_zoneidx = gfp_zone(flags);
> >
> >+       if (!node_present_pages(searchnode)) {
> >+               zonelist = node_zonelist(searchnode, flags);
> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> >+                       searchnode = zone_to_nid(zone);
> >+                       if (node_present_pages(searchnode))
> >+                               break;
> >+               }
> >+       }
> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >        if (object || node != NUMA_NO_NODE)
> >                return object;
> >
>
> The patch fix the bug. However, the kernel crashed very quickly after running
> stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).

[-- Attachment #2: Type: TEXT/PLAIN, Size: 4918 bytes --]

[  287.464285] Unable to handle kernel paging request for data at address 0x00000001
[  287.464289] Faulting instruction address: 0xc000000000445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c000000000fde590 ti: c0000001fffd0000 task.ti: c0000000010a4000
[  287.464382] NIP: c000000000445af8 LR: c000000000445bcc CTR: c000000000445b90
[  287.464385] REGS: c0000001fffd38e0 TRAP: 0300   Not tainted  (3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 88002084  XER: 00000001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c00000000000908c
[  287.464401] DAR: 0000000000000001, DSISR: 40000000
[  287.464403]
GPR00: d000000003649a04 c0000001fffd3b60 c0000000010a94d0 0000000000000003
GPR04: c00000018d841048 c0000001fffd3bd0 0000000000000012 d00000000364eff0
GPR08: c0000001fffd3bd0 0000000000000001 d00000000364d688 c000000000445b90
GPR12: d00000000364b960 c000000007e00000 00000000042ac510 0000000000000060
GPR16: 0000000000200000 00000000fffffb19 c000000001122100 0000000000000000
GPR20: c000000000a94680 c000000001122180 c000000000a94680 000000000000000a
GPR24: 0000000000000100 0000000000000000 0000000000000001 c0000001ef900000
GPR28: c0000001d6c066f0 c0000001aea03520 c0000001bc9a2640 c00000018d841680
[  287.464447] NIP [c000000000445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c000000000445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [8000000000009032]
[  287.464455] Call Trace:
[  287.464458] [c0000001fffd3b60] [c0000001fffd3c00] 0xc0000001fffd3c00 (unreliable)
[  287.464467] [c0000001fffd3bf0] [d000000003649a04] .ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c0000001fffd3cb0] [d0000000036495b8] .ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c0000001fffd3d30] [d000000003649fe4] .ibmvfc_tasklet+0xd4/0x250 [ibmvfc]
[  287.464494] [c0000001fffd3de0] [c00000000009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c0000001fffd3e90] [c00000000009a668] .__do_softirq+0x148/0x360
[  287.464503] [c0000001fffd3f90] [c0000000000218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c0000001fffcfdf0] [c0000000000107e0] .do_softirq+0xd0/0x100
[  287.464511] [c0000001fffcfe80] [c00000000009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c0000001fffcff10] [c000000000010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c0000001fffcff90] [c0000000000218cc] .call_do_irq+0x14/0x24
[  287.464522] [c0000000010a76d0] [c0000000000105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0xffff
[  287.464527]     LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c0000000010a7770] [c000000000002494] hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540]     LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c0000000010a7a60] [0000000000000001] 0x1 (unreliable)
[  287.464550] [c0000000010a7ad0] [c000000000074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c0000000010a7b50] [c0000000005538f4] .cpuidle_enter_state+0x64/0x150
[  287.464559] [c0000000010a7c10] [c000000000553ad0] .cpuidle_idle_call+0xf0/0x300
[  287.464563] [c0000000010a7cc0] [c0000000000695c0] .pseries_lpar_idle+0x10/0x50
[  287.464568] [c0000000010a7d30] [c000000000016ee4] .arch_cpu_idle+0x64/0x150
[  287.464572] [c0000000010a7db0] [c0000000000f6504] .cpu_startup_entry+0x1a4/0x2d0
[  287.464577] [c0000000010a7e80] [c00000000000bd04] .rest_init+0x94/0xb0
[  287.464582] [c0000000010a7ef0] [c000000000a044d0] .start_kernel+0x4b0/0x4cc
[  287.464586] [c0000000010a7f90] [c000000000009d30] .start_here_common+0x20/0x70
[  287.464589] Instruction dump:
[  287.464591] 60000000 60420000 2c240000 7c6a1b78 41c20088 e9240090 88630001 7ca82b78
[  287.464598] 2fa90000 3863ffd0 7c6307b4 419e002c <e8c90000> e8e40050 2fa70000 419e004c
[  287.464606] ---[ end trace c469801a8c53d8f1 ]---
[  287.466576]
[  287.466582] Sending IPI to other CPUs
[  287.468526] IPI complete


^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-20 22:13       ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-20 22:13 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: nacc, penberg, linux-mm, Han Pingtian, paulus, Anton Blanchard,
	mpm, Joonsoo Kim, linuxppc-dev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1025 bytes --]

On Mon, 20 Jan 2014, Wanpeng Li wrote:

> >+       enum zone_type high_zoneidx = gfp_zone(flags);
> >
> >+       if (!node_present_pages(searchnode)) {
> >+               zonelist = node_zonelist(searchnode, flags);
> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> >+                       searchnode = zone_to_nid(zone);
> >+                       if (node_present_pages(searchnode))
> >+                               break;
> >+               }
> >+       }
> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >        if (object || node != NUMA_NO_NODE)
> >                return object;
> >
>
> The patch fix the bug. However, the kernel crashed very quickly after running
> stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).

[-- Attachment #2: Type: TEXT/PLAIN, Size: 4918 bytes --]

[  287.464285] Unable to handle kernel paging request for data at address 0x00000001
[  287.464289] Faulting instruction address: 0xc000000000445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c000000000fde590 ti: c0000001fffd0000 task.ti: c0000000010a4000
[  287.464382] NIP: c000000000445af8 LR: c000000000445bcc CTR: c000000000445b90
[  287.464385] REGS: c0000001fffd38e0 TRAP: 0300   Not tainted  (3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 8000000000009032 <SF,EE,ME,IR,DR,RI>  CR: 88002084  XER: 00000001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c00000000000908c
[  287.464401] DAR: 0000000000000001, DSISR: 40000000
[  287.464403]
GPR00: d000000003649a04 c0000001fffd3b60 c0000000010a94d0 0000000000000003
GPR04: c00000018d841048 c0000001fffd3bd0 0000000000000012 d00000000364eff0
GPR08: c0000001fffd3bd0 0000000000000001 d00000000364d688 c000000000445b90
GPR12: d00000000364b960 c000000007e00000 00000000042ac510 0000000000000060
GPR16: 0000000000200000 00000000fffffb19 c000000001122100 0000000000000000
GPR20: c000000000a94680 c000000001122180 c000000000a94680 000000000000000a
GPR24: 0000000000000100 0000000000000000 0000000000000001 c0000001ef900000
GPR28: c0000001d6c066f0 c0000001aea03520 c0000001bc9a2640 c00000018d841680
[  287.464447] NIP [c000000000445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c000000000445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [8000000000009032]
[  287.464455] Call Trace:
[  287.464458] [c0000001fffd3b60] [c0000001fffd3c00] 0xc0000001fffd3c00 (unreliable)
[  287.464467] [c0000001fffd3bf0] [d000000003649a04] .ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c0000001fffd3cb0] [d0000000036495b8] .ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c0000001fffd3d30] [d000000003649fe4] .ibmvfc_tasklet+0xd4/0x250 [ibmvfc]
[  287.464494] [c0000001fffd3de0] [c00000000009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c0000001fffd3e90] [c00000000009a668] .__do_softirq+0x148/0x360
[  287.464503] [c0000001fffd3f90] [c0000000000218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c0000001fffcfdf0] [c0000000000107e0] .do_softirq+0xd0/0x100
[  287.464511] [c0000001fffcfe80] [c00000000009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c0000001fffcff10] [c000000000010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c0000001fffcff90] [c0000000000218cc] .call_do_irq+0x14/0x24
[  287.464522] [c0000000010a76d0] [c0000000000105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0xffff
[  287.464527]     LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c0000000010a7770] [c000000000002494] hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540]     LR = .check_and_cede_processor+0x24/0x40
[  287.464546] [c0000000010a7a60] [0000000000000001] 0x1 (unreliable)
[  287.464550] [c0000000010a7ad0] [c000000000074ecc] .shared_cede_loop+0x2c/0x70
[  287.464555] [c0000000010a7b50] [c0000000005538f4] .cpuidle_enter_state+0x64/0x150
[  287.464559] [c0000000010a7c10] [c000000000553ad0] .cpuidle_idle_call+0xf0/0x300
[  287.464563] [c0000000010a7cc0] [c0000000000695c0] .pseries_lpar_idle+0x10/0x50
[  287.464568] [c0000000010a7d30] [c000000000016ee4] .arch_cpu_idle+0x64/0x150
[  287.464572] [c0000000010a7db0] [c0000000000f6504] .cpu_startup_entry+0x1a4/0x2d0
[  287.464577] [c0000000010a7e80] [c00000000000bd04] .rest_init+0x94/0xb0
[  287.464582] [c0000000010a7ef0] [c000000000a044d0] .start_kernel+0x4b0/0x4cc
[  287.464586] [c0000000010a7f90] [c000000000009d30] .start_here_common+0x20/0x70
[  287.464589] Instruction dump:
[  287.464591] 60000000 60420000 2c240000 7c6a1b78 41c20088 e9240090 88630001 7ca82b78
[  287.464598] 2fa90000 3863ffd0 7c6307b4 419e002c <e8c90000> e8e40050 2fa70000 419e004c
[  287.464606] ---[ end trace c469801a8c53d8f1 ]---
[  287.466576]
[  287.466582] Sending IPI to other CPUs
[  287.468526] IPI complete


^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-20 22:13       ` Christoph Lameter
  (?)
  (?)
@ 2014-01-21  2:20       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-21  2:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, benh, paulus, penberg, mpm, nacc, Anton Blanchard,
	linux-mm, linuxppc-dev, Han Pingtian

On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>>
>> The patch fix the bug. However, the kernel crashed very quickly after running
>> stress tests for a short while:
>
>This is not a good way of fixing it. How about not asking for memory from
>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>that has memory instead of numa_node_id() (gives you the current node
>regardless if it has memory or not).

Thanks for your pointing out, I will do it and retest it later.

Regards,
Wanpeng Li 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-20 22:13       ` Christoph Lameter
                         ` (2 preceding siblings ...)
  (?)
@ 2014-01-21  2:20       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-21  2:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, penberg, linux-mm, Han Pingtian, paulus, Anton Blanchard,
	mpm, Joonsoo Kim, linuxppc-dev

On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>>
>> The patch fix the bug. However, the kernel crashed very quickly after running
>> stress tests for a short while:
>
>This is not a good way of fixing it. How about not asking for memory from
>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>that has memory instead of numa_node_id() (gives you the current node
>regardless if it has memory or not).

Thanks for your pointing out, I will do it and retest it later.

Regards,
Wanpeng Li 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-20 22:13       ` Christoph Lameter
                         ` (3 preceding siblings ...)
  (?)
@ 2014-01-21  2:20       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-21  2:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, penberg, linux-mm, Han Pingtian, paulus, Anton Blanchard,
	mpm, Joonsoo Kim, linuxppc-dev

On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>>
>> The patch fix the bug. However, the kernel crashed very quickly after running
>> stress tests for a short while:
>
>This is not a good way of fixing it. How about not asking for memory from
>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>that has memory instead of numa_node_id() (gives you the current node
>regardless if it has memory or not).

Thanks for your pointing out, I will do it and retest it later.

Regards,
Wanpeng Li 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-20 22:13       ` Christoph Lameter
  (?)
@ 2014-01-21  2:20       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-21  2:20 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, penberg, linux-mm, Han Pingtian, paulus, Anton Blanchard,
	mpm, Joonsoo Kim, linuxppc-dev

On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>>
>> The patch fix the bug. However, the kernel crashed very quickly after running
>> stress tests for a short while:
>
>This is not a good way of fixing it. How about not asking for memory from
>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>that has memory instead of numa_node_id() (gives you the current node
>regardless if it has memory or not).

Thanks for your pointing out, I will do it and retest it later.

Regards,
Wanpeng Li 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-20 22:13       ` Christoph Lameter
                         ` (7 preceding siblings ...)
  (?)
@ 2014-01-24  3:09       ` Wanpeng Li
  2014-01-24  3:14         ` Wanpeng Li
                           ` (4 more replies)
  -1 siblings, 5 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-24  3:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, benh, paulus, penberg, mpm, nacc, Anton Blanchard,
	linux-mm, linuxppc-dev, Han Pingtian, David Rientjes

Hi Christoph,
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>>
>> The patch fix the bug. However, the kernel crashed very quickly after running
>> stress tests for a short while:
>
>This is not a good way of fixing it. How about not asking for memory from
>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>that has memory instead of numa_node_id() (gives you the current node
>regardless if it has memory or not).

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..a1c6040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

+	if (!node_present_pages(searchnode))
+		searchnode = numa_mem_id();
+
	object = get_partial_node(s, get_node(s, searchnode), c, flags);
	if (object || node != NUMA_NO_NODE)
		return object;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-20 22:13       ` Christoph Lameter
                         ` (5 preceding siblings ...)
  (?)
@ 2014-01-24  3:09       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-24  3:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, David Rientjes, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev

Hi Christoph,
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>>
>> The patch fix the bug. However, the kernel crashed very quickly after running
>> stress tests for a short while:
>
>This is not a good way of fixing it. How about not asking for memory from
>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>that has memory instead of numa_node_id() (gives you the current node
>regardless if it has memory or not).

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..a1c6040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

+	if (!node_present_pages(searchnode))
+		searchnode = numa_mem_id();
+
	object = get_partial_node(s, get_node(s, searchnode), c, flags);
	if (object || node != NUMA_NO_NODE)
		return object;

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-20 22:13       ` Christoph Lameter
                         ` (4 preceding siblings ...)
  (?)
@ 2014-01-24  3:09       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-24  3:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, David Rientjes, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev

Hi Christoph,
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>>
>> The patch fix the bug. However, the kernel crashed very quickly after running
>> stress tests for a short while:
>
>This is not a good way of fixing it. How about not asking for memory from
>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>that has memory instead of numa_node_id() (gives you the current node
>regardless if it has memory or not).

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..a1c6040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

+	if (!node_present_pages(searchnode))
+		searchnode = numa_mem_id();
+
	object = get_partial_node(s, get_node(s, searchnode), c, flags);
	if (object || node != NUMA_NO_NODE)
		return object;

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-20 22:13       ` Christoph Lameter
                         ` (6 preceding siblings ...)
  (?)
@ 2014-01-24  3:09       ` Wanpeng Li
  -1 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-24  3:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, David Rientjes, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev

Hi Christoph,
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>
>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>> >
>> >+       if (!node_present_pages(searchnode)) {
>> >+               zonelist = node_zonelist(searchnode, flags);
>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>> >+                       searchnode = zone_to_nid(zone);
>> >+                       if (node_present_pages(searchnode))
>> >+                               break;
>> >+               }
>> >+       }
>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>> >        if (object || node != NUMA_NO_NODE)
>> >                return object;
>> >
>>
>> The patch fix the bug. However, the kernel crashed very quickly after running
>> stress tests for a short while:
>
>This is not a good way of fixing it. How about not asking for memory from
>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>that has memory instead of numa_node_id() (gives you the current node
>regardless if it has memory or not).

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..a1c6040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

+	if (!node_present_pages(searchnode))
+		searchnode = numa_mem_id();
+
	object = get_partial_node(s, get_node(s, searchnode), c, flags);
	if (object || node != NUMA_NO_NODE)
		return object;

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24  3:09       ` Wanpeng Li
                           ` (2 preceding siblings ...)
  2014-01-24  3:14         ` Wanpeng Li
@ 2014-01-24  3:14         ` Wanpeng Li
       [not found]         ` <52e1da8f.86f7440a.120f.25f3SMTPIN_ADDED_BROKEN@mx.google.com>
  4 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-24  3:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, benh, paulus, penberg, mpm, nacc, Anton Blanchard,
	linux-mm, linuxppc-dev, Han Pingtian, David Rientjes

On Fri, Jan 24, 2014 at 11:09:07AM +0800, Wanpeng Li wrote:
>Hi Christoph,
>On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>>
>>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>>> >
>>> >+       if (!node_present_pages(searchnode)) {
>>> >+               zonelist = node_zonelist(searchnode, flags);
>>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>>> >+                       searchnode = zone_to_nid(zone);
>>> >+                       if (node_present_pages(searchnode))
>>> >+                               break;
>>> >+               }
>>> >+       }
>>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>>> >        if (object || node != NUMA_NO_NODE)
>>> >                return object;
>>> >
>>>
>>> The patch fix the bug. However, the kernel crashed very quickly after running
>>> stress tests for a short while:
>>
>>This is not a good way of fixing it. How about not asking for memory from
>>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>>that has memory instead of numa_node_id() (gives you the current node
>>regardless if it has memory or not).
>
>diff --git a/mm/slub.c b/mm/slub.c
>index 545a170..a1c6040 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> 	void *object;
>	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>
>+	if (!node_present_pages(searchnode))
>+		searchnode = numa_mem_id();
>+
>	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>	if (object || node != NUMA_NO_NODE)
>		return object;
>

The bug still can't be fixed w/ this patch. 

Regards,
Wanpeng Li 

>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24  3:09       ` Wanpeng Li
  2014-01-24  3:14         ` Wanpeng Li
  2014-01-24  3:14         ` Wanpeng Li
@ 2014-01-24  3:14         ` Wanpeng Li
  2014-01-24  3:14         ` Wanpeng Li
       [not found]         ` <52e1da8f.86f7440a.120f.25f3SMTPIN_ADDED_BROKEN@mx.google.com>
  4 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-24  3:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, David Rientjes, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev

On Fri, Jan 24, 2014 at 11:09:07AM +0800, Wanpeng Li wrote:
>Hi Christoph,
>On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>>
>>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>>> >
>>> >+       if (!node_present_pages(searchnode)) {
>>> >+               zonelist = node_zonelist(searchnode, flags);
>>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>>> >+                       searchnode = zone_to_nid(zone);
>>> >+                       if (node_present_pages(searchnode))
>>> >+                               break;
>>> >+               }
>>> >+       }
>>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>>> >        if (object || node != NUMA_NO_NODE)
>>> >                return object;
>>> >
>>>
>>> The patch fix the bug. However, the kernel crashed very quickly after running
>>> stress tests for a short while:
>>
>>This is not a good way of fixing it. How about not asking for memory from
>>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>>that has memory instead of numa_node_id() (gives you the current node
>>regardless if it has memory or not).
>
>diff --git a/mm/slub.c b/mm/slub.c
>index 545a170..a1c6040 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> 	void *object;
>	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>
>+	if (!node_present_pages(searchnode))
>+		searchnode = numa_mem_id();
>+
>	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>	if (object || node != NUMA_NO_NODE)
>		return object;
>

The bug still can't be fixed w/ this patch. 

Regards,
Wanpeng Li 

>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24  3:09       ` Wanpeng Li
@ 2014-01-24  3:14         ` Wanpeng Li
  2014-01-24  3:14         ` Wanpeng Li
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-24  3:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, David Rientjes, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev

On Fri, Jan 24, 2014 at 11:09:07AM +0800, Wanpeng Li wrote:
>Hi Christoph,
>On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>>
>>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>>> >
>>> >+       if (!node_present_pages(searchnode)) {
>>> >+               zonelist = node_zonelist(searchnode, flags);
>>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>>> >+                       searchnode = zone_to_nid(zone);
>>> >+                       if (node_present_pages(searchnode))
>>> >+                               break;
>>> >+               }
>>> >+       }
>>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>>> >        if (object || node != NUMA_NO_NODE)
>>> >                return object;
>>> >
>>>
>>> The patch fix the bug. However, the kernel crashed very quickly after running
>>> stress tests for a short while:
>>
>>This is not a good way of fixing it. How about not asking for memory from
>>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>>that has memory instead of numa_node_id() (gives you the current node
>>regardless if it has memory or not).
>
>diff --git a/mm/slub.c b/mm/slub.c
>index 545a170..a1c6040 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> 	void *object;
>	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>
>+	if (!node_present_pages(searchnode))
>+		searchnode = numa_mem_id();
>+
>	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>	if (object || node != NUMA_NO_NODE)
>		return object;
>

The bug still can't be fixed w/ this patch. 

Regards,
Wanpeng Li 

>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24  3:09       ` Wanpeng Li
  2014-01-24  3:14         ` Wanpeng Li
@ 2014-01-24  3:14         ` Wanpeng Li
  2014-01-24  3:14         ` Wanpeng Li
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 229+ messages in thread
From: Wanpeng Li @ 2014-01-24  3:14 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, David Rientjes, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev

On Fri, Jan 24, 2014 at 11:09:07AM +0800, Wanpeng Li wrote:
>Hi Christoph,
>On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
>>On Mon, 20 Jan 2014, Wanpeng Li wrote:
>>
>>> >+       enum zone_type high_zoneidx = gfp_zone(flags);
>>> >
>>> >+       if (!node_present_pages(searchnode)) {
>>> >+               zonelist = node_zonelist(searchnode, flags);
>>> >+               for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>>> >+                       searchnode = zone_to_nid(zone);
>>> >+                       if (node_present_pages(searchnode))
>>> >+                               break;
>>> >+               }
>>> >+       }
>>> >        object = get_partial_node(s, get_node(s, searchnode), c, flags);
>>> >        if (object || node != NUMA_NO_NODE)
>>> >                return object;
>>> >
>>>
>>> The patch fix the bug. However, the kernel crashed very quickly after running
>>> stress tests for a short while:
>>
>>This is not a good way of fixing it. How about not asking for memory from
>>nodes that are memoryless? Use numa_mem_id() which gives you the next node
>>that has memory instead of numa_node_id() (gives you the current node
>>regardless if it has memory or not).
>
>diff --git a/mm/slub.c b/mm/slub.c
>index 545a170..a1c6040 100644
>--- a/mm/slub.c
>+++ b/mm/slub.c
>@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> 	void *object;
>	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
>
>+	if (!node_present_pages(searchnode))
>+		searchnode = numa_mem_id();
>+
>	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>	if (object || node != NUMA_NO_NODE)
>		return object;
>

The bug still can't be fixed w/ this patch. 

Regards,
Wanpeng Li 

>--
>To unsubscribe, send a message with 'unsubscribe linux-mm' in
>the body to majordomo@kvack.org.  For more info on Linux MM,
>see: http://www.linux-mm.org/ .
>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
       [not found]         ` <52e1da8f.86f7440a.120f.25f3SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2014-01-24 15:50             ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-24 15:50 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Joonsoo Kim, benh, paulus, penberg, mpm, nacc, Anton Blanchard,
	linux-mm, linuxppc-dev, Han Pingtian, David Rientjes

On Fri, 24 Jan 2014, Wanpeng Li wrote:

> >
> >diff --git a/mm/slub.c b/mm/slub.c
> >index 545a170..a1c6040 100644
> >--- a/mm/slub.c
> >+++ b/mm/slub.c
> >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > 	void *object;
> >	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

This needs to be numa_mem_id() and numa_mem_id would need to be
consistently used.

> >
> >+	if (!node_present_pages(searchnode))
> >+		searchnode = numa_mem_id();

Probably wont need that?

> >+
> >	object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >	if (object || node != NUMA_NO_NODE)
> >		return object;
> >
>
> The bug still can't be fixed w/ this patch.

Some more detail would be good. If memory is requested from a particular
node then it would be best to use one that has memory. Callers also may
have used numa_node_id() and that also would need to be fixed.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-24 15:50             ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-24 15:50 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: nacc, David Rientjes, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev

On Fri, 24 Jan 2014, Wanpeng Li wrote:

> >
> >diff --git a/mm/slub.c b/mm/slub.c
> >index 545a170..a1c6040 100644
> >--- a/mm/slub.c
> >+++ b/mm/slub.c
> >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > 	void *object;
> >	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

This needs to be numa_mem_id() and numa_mem_id would need to be
consistently used.

> >
> >+	if (!node_present_pages(searchnode))
> >+		searchnode = numa_mem_id();

Probably wont need that?

> >+
> >	object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >	if (object || node != NUMA_NO_NODE)
> >		return object;
> >
>
> The bug still can't be fixed w/ this patch.

Some more detail would be good. If memory is requested from a particular
node then it would be best to use one that has memory. Callers also may
have used numa_node_id() and that also would need to be fixed.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24 15:50             ` Christoph Lameter
@ 2014-01-24 21:03               ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-01-24 21:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Wanpeng Li, Joonsoo Kim, benh, paulus, penberg, mpm, nacc,
	Anton Blanchard, linux-mm, linuxppc-dev, Han Pingtian

On Fri, 24 Jan 2014, Christoph Lameter wrote:

> On Fri, 24 Jan 2014, Wanpeng Li wrote:
> 
> > >
> > >diff --git a/mm/slub.c b/mm/slub.c
> > >index 545a170..a1c6040 100644
> > >--- a/mm/slub.c
> > >+++ b/mm/slub.c
> > >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > > 	void *object;
> > >	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> 
> This needs to be numa_mem_id() and numa_mem_id would need to be
> consistently used.
> 
> > >
> > >+	if (!node_present_pages(searchnode))
> > >+		searchnode = numa_mem_id();
> 
> Probably wont need that?
> 

I think the problem is a memoryless node being used for kmalloc_node() so 
we need to decide where to enforce node_present_pages().  __slab_alloc() 
seems like the best candidate when !node_match().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-24 21:03               ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-01-24 21:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: nacc, penberg, linux-mm, Han Pingtian, paulus, Anton Blanchard,
	mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Fri, 24 Jan 2014, Christoph Lameter wrote:

> On Fri, 24 Jan 2014, Wanpeng Li wrote:
> 
> > >
> > >diff --git a/mm/slub.c b/mm/slub.c
> > >index 545a170..a1c6040 100644
> > >--- a/mm/slub.c
> > >+++ b/mm/slub.c
> > >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > > 	void *object;
> > >	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> 
> This needs to be numa_mem_id() and numa_mem_id would need to be
> consistently used.
> 
> > >
> > >+	if (!node_present_pages(searchnode))
> > >+		searchnode = numa_mem_id();
> 
> Probably wont need that?
> 

I think the problem is a memoryless node being used for kmalloc_node() so 
we need to decide where to enforce node_present_pages().  __slab_alloc() 
seems like the best candidate when !node_match().

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24 21:03               ` David Rientjes
@ 2014-01-24 22:19                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-24 22:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, Wanpeng Li, Joonsoo Kim, benh, paulus,
	penberg, mpm, Anton Blanchard, linux-mm, linuxppc-dev,
	Han Pingtian

On 24.01.2014 [13:03:13 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Christoph Lameter wrote:
> 
> > On Fri, 24 Jan 2014, Wanpeng Li wrote:
> > 
> > > >
> > > >diff --git a/mm/slub.c b/mm/slub.c
> > > >index 545a170..a1c6040 100644
> > > >--- a/mm/slub.c
> > > >+++ b/mm/slub.c
> > > >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > > > 	void *object;
> > > >	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > 
> > This needs to be numa_mem_id() and numa_mem_id would need to be
> > consistently used.
> > 
> > > >
> > > >+	if (!node_present_pages(searchnode))
> > > >+		searchnode = numa_mem_id();
> > 
> > Probably wont need that?
> > 
> 
> I think the problem is a memoryless node being used for kmalloc_node() so 
> we need to decide where to enforce node_present_pages().  __slab_alloc() 
> seems like the best candidate when !node_match().
> 

Yep, I'm looking through callers and such right now and came to a
similar conclusion. I should have a patch soon.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-24 22:19                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-24 22:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: penberg, linux-mm, Han Pingtian, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On 24.01.2014 [13:03:13 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Christoph Lameter wrote:
> 
> > On Fri, 24 Jan 2014, Wanpeng Li wrote:
> > 
> > > >
> > > >diff --git a/mm/slub.c b/mm/slub.c
> > > >index 545a170..a1c6040 100644
> > > >--- a/mm/slub.c
> > > >+++ b/mm/slub.c
> > > >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > > > 	void *object;
> > > >	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > 
> > This needs to be numa_mem_id() and numa_mem_id would need to be
> > consistently used.
> > 
> > > >
> > > >+	if (!node_present_pages(searchnode))
> > > >+		searchnode = numa_mem_id();
> > 
> > Probably wont need that?
> > 
> 
> I think the problem is a memoryless node being used for kmalloc_node() so 
> we need to decide where to enforce node_present_pages().  __slab_alloc() 
> seems like the best candidate when !node_match().
> 

Yep, I'm looking through callers and such right now and came to a
similar conclusion. I should have a patch soon.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24 21:03               ` David Rientjes
@ 2014-01-24 23:29                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-24 23:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Christoph Lameter, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 24.01.2014 [13:03:13 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Christoph Lameter wrote:
> 
> > On Fri, 24 Jan 2014, Wanpeng Li wrote:
> > 
> > > >
> > > >diff --git a/mm/slub.c b/mm/slub.c
> > > >index 545a170..a1c6040 100644
> > > >--- a/mm/slub.c
> > > >+++ b/mm/slub.c
> > > >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > > > 	void *object;
> > > >	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > 
> > This needs to be numa_mem_id() and numa_mem_id would need to be
> > consistently used.
> > 
> > > >
> > > >+	if (!node_present_pages(searchnode))
> > > >+		searchnode = numa_mem_id();
> > 
> > Probably wont need that?
> > 
> 
> I think the problem is a memoryless node being used for kmalloc_node() so 
> we need to decide where to enforce node_present_pages().  __slab_alloc() 
> seems like the best candidate when !node_match().

Actually, this is effectively what Anton's patch does, except with
Wanpeng's adjustment to use node_present_pages(). Does that seem
sufficient to you?

It does only cover the memoryless node case (not the exhausted node
case), but I think that shouldn't block the fix (and it does fix the
issue we've run across in our testing).

-Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-24 23:29                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-24 23:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On 24.01.2014 [13:03:13 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Christoph Lameter wrote:
> 
> > On Fri, 24 Jan 2014, Wanpeng Li wrote:
> > 
> > > >
> > > >diff --git a/mm/slub.c b/mm/slub.c
> > > >index 545a170..a1c6040 100644
> > > >--- a/mm/slub.c
> > > >+++ b/mm/slub.c
> > > >@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> > > > 	void *object;
> > > >	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > 
> > This needs to be numa_mem_id() and numa_mem_id would need to be
> > consistently used.
> > 
> > > >
> > > >+	if (!node_present_pages(searchnode))
> > > >+		searchnode = numa_mem_id();
> > 
> > Probably wont need that?
> > 
> 
> I think the problem is a memoryless node being used for kmalloc_node() so 
> we need to decide where to enforce node_present_pages().  __slab_alloc() 
> seems like the best candidate when !node_match().

Actually, this is effectively what Anton's patch does, except with
Wanpeng's adjustment to use node_present_pages(). Does that seem
sufficient to you?

It does only cover the memoryless node case (not the exhausted node
case), but I think that shouldn't block the fix (and it does fix the
issue we've run across in our testing).

-Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24 23:29                 ` Nishanth Aravamudan
@ 2014-01-24 23:49                   ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-01-24 23:49 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> > I think the problem is a memoryless node being used for kmalloc_node() so 
> > we need to decide where to enforce node_present_pages().  __slab_alloc() 
> > seems like the best candidate when !node_match().
> 
> Actually, this is effectively what Anton's patch does, except with
> Wanpeng's adjustment to use node_present_pages(). Does that seem
> sufficient to you?
> 

I don't see that as being the effect of Anton's patch.  We need to use 
numa_mem_id() as Christoph mentioned when a memoryless node is passed for 
the best NUMA locality.  Something like this:

diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,14 @@ redo:
 
 	if (unlikely(!node_match(page, node))) {
 		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+		if (unlikely(!node_present_pages(node)))
+			node = numa_mem_id();
+		if (!node_match(page, node)) {
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}
 	}
 
 	/*

> It does only cover the memoryless node case (not the exhausted node
> case), but I think that shouldn't block the fix (and it does fix the
> issue we've run across in our testing).
> 

kmalloc_node(nid) and kmem_cache_alloc_node(nid) should fallback to nodes 
other than nid when memory can't be allocated, these functions only 
indicate a preference.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-24 23:49                   ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-01-24 23:49 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> > I think the problem is a memoryless node being used for kmalloc_node() so 
> > we need to decide where to enforce node_present_pages().  __slab_alloc() 
> > seems like the best candidate when !node_match().
> 
> Actually, this is effectively what Anton's patch does, except with
> Wanpeng's adjustment to use node_present_pages(). Does that seem
> sufficient to you?
> 

I don't see that as being the effect of Anton's patch.  We need to use 
numa_mem_id() as Christoph mentioned when a memoryless node is passed for 
the best NUMA locality.  Something like this:

diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,14 @@ redo:
 
 	if (unlikely(!node_match(page, node))) {
 		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+		if (unlikely(!node_present_pages(node)))
+			node = numa_mem_id();
+		if (!node_match(page, node)) {
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}
 	}
 
 	/*

> It does only cover the memoryless node case (not the exhausted node
> case), but I think that shouldn't block the fix (and it does fix the
> issue we've run across in our testing).
> 

kmalloc_node(nid) and kmem_cache_alloc_node(nid) should fallback to nodes 
other than nid when memory can't be allocated, these functions only 
indicate a preference.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24 23:49                   ` David Rientjes
@ 2014-01-25  0:16                     ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-25  0:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On 24.01.2014 [15:49:33 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > > I think the problem is a memoryless node being used for kmalloc_node() so 
> > > we need to decide where to enforce node_present_pages().  __slab_alloc() 
> > > seems like the best candidate when !node_match().
> > 
> > Actually, this is effectively what Anton's patch does, except with
> > Wanpeng's adjustment to use node_present_pages(). Does that seem
> > sufficient to you?
> > 
> 
> I don't see that as being the effect of Anton's patch.  We need to use 
> numa_mem_id() as Christoph mentioned when a memoryless node is passed for 
> the best NUMA locality.  Something like this:

Thank you for clarifying and providing  a test patch. I ran with this on
the system showing the original problem, configured to have 15GB of
memory.

With your patch after boot:

MemTotal:       15604736 kB
MemFree:         8768192 kB
Slab:            3882560 kB
SReclaimable:     105408 kB
SUnreclaim:      3777152 kB

With Anton's patch after boot:

MemTotal:       15604736 kB
MemFree:        11195008 kB
Slab:            1427968 kB
SReclaimable:     109184 kB
SUnreclaim:      1318784 kB


I know that's fairly unscientific, but the numbers are reproducible. 

For what it's worth, a sample of the unmodified numbers:

MemTotal:       15317632 kB
MemFree:         5023424 kB
Slab:            7176064 kB
SReclaimable:     106816 kB
SUnreclaim:      7069248 kB

So it's an improvement, but something is still causing us to (it seems)
be pretty inefficient with the slabs.


> diff --git a/mm/slub.c b/mm/slub.c
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2278,10 +2278,14 @@ redo:
> 
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +		if (unlikely(!node_present_pages(node)))
> +			node = numa_mem_id();
> +		if (!node_match(page, node)) {
> +			deactivate_slab(s, page, c->freelist);
> +			c->page = NULL;
> +			c->freelist = NULL;
> +			goto new_slab;
> +		}

Semantically, and please correct me if I'm wrong, this patch is saying
if we have a memoryless node, we expect the page's locality to be that
of numa_mem_id(), and we still deactivate the slab if that isn't true.
Just wanting to make sure I understand the intent.

What I find odd is that there are only 2 nodes on this system, node 0
(empty) and node 1. So won't numa_mem_id() always be 1? And every page
should be coming from node 1 (thus node_match() should always be true?)

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-25  0:16                     ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-25  0:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On 24.01.2014 [15:49:33 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > > I think the problem is a memoryless node being used for kmalloc_node() so 
> > > we need to decide where to enforce node_present_pages().  __slab_alloc() 
> > > seems like the best candidate when !node_match().
> > 
> > Actually, this is effectively what Anton's patch does, except with
> > Wanpeng's adjustment to use node_present_pages(). Does that seem
> > sufficient to you?
> > 
> 
> I don't see that as being the effect of Anton's patch.  We need to use 
> numa_mem_id() as Christoph mentioned when a memoryless node is passed for 
> the best NUMA locality.  Something like this:

Thank you for clarifying and providing  a test patch. I ran with this on
the system showing the original problem, configured to have 15GB of
memory.

With your patch after boot:

MemTotal:       15604736 kB
MemFree:         8768192 kB
Slab:            3882560 kB
SReclaimable:     105408 kB
SUnreclaim:      3777152 kB

With Anton's patch after boot:

MemTotal:       15604736 kB
MemFree:        11195008 kB
Slab:            1427968 kB
SReclaimable:     109184 kB
SUnreclaim:      1318784 kB


I know that's fairly unscientific, but the numbers are reproducible. 

For what it's worth, a sample of the unmodified numbers:

MemTotal:       15317632 kB
MemFree:         5023424 kB
Slab:            7176064 kB
SReclaimable:     106816 kB
SUnreclaim:      7069248 kB

So it's an improvement, but something is still causing us to (it seems)
be pretty inefficient with the slabs.


> diff --git a/mm/slub.c b/mm/slub.c
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2278,10 +2278,14 @@ redo:
> 
>  	if (unlikely(!node_match(page, node))) {
>  		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +		if (unlikely(!node_present_pages(node)))
> +			node = numa_mem_id();
> +		if (!node_match(page, node)) {
> +			deactivate_slab(s, page, c->freelist);
> +			c->page = NULL;
> +			c->freelist = NULL;
> +			goto new_slab;
> +		}

Semantically, and please correct me if I'm wrong, this patch is saying
if we have a memoryless node, we expect the page's locality to be that
of numa_mem_id(), and we still deactivate the slab if that isn't true.
Just wanting to make sure I understand the intent.

What I find odd is that there are only 2 nodes on this system, node 0
(empty) and node 1. So won't numa_mem_id() always be 1? And every page
should be coming from node 1 (thus node_match() should always be true?)

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-25  0:16                     ` Nishanth Aravamudan
@ 2014-01-25  0:25                       ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-01-25  0:25 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> Thank you for clarifying and providing  a test patch. I ran with this on
> the system showing the original problem, configured to have 15GB of
> memory.
> 
> With your patch after boot:
> 
> MemTotal:       15604736 kB
> MemFree:         8768192 kB
> Slab:            3882560 kB
> SReclaimable:     105408 kB
> SUnreclaim:      3777152 kB
> 
> With Anton's patch after boot:
> 
> MemTotal:       15604736 kB
> MemFree:        11195008 kB
> Slab:            1427968 kB
> SReclaimable:     109184 kB
> SUnreclaim:      1318784 kB
> 
> 
> I know that's fairly unscientific, but the numbers are reproducible. 
> 

I don't think the goal of the discussion is to reduce the amount of slab 
allocated, but rather get the most local slab memory possible by use of 
kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
which is probably cpu_to_node() for a cpu bound to a node without memory, 
my patch is allocating it on the most local node; Anton's patch is 
allocating it on whatever happened to be the cpu slab.

> > diff --git a/mm/slub.c b/mm/slub.c
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2278,10 +2278,14 @@ redo:
> > 
> >  	if (unlikely(!node_match(page, node))) {
> >  		stat(s, ALLOC_NODE_MISMATCH);
> > -		deactivate_slab(s, page, c->freelist);
> > -		c->page = NULL;
> > -		c->freelist = NULL;
> > -		goto new_slab;
> > +		if (unlikely(!node_present_pages(node)))
> > +			node = numa_mem_id();
> > +		if (!node_match(page, node)) {
> > +			deactivate_slab(s, page, c->freelist);
> > +			c->page = NULL;
> > +			c->freelist = NULL;
> > +			goto new_slab;
> > +		}
> 
> Semantically, and please correct me if I'm wrong, this patch is saying
> if we have a memoryless node, we expect the page's locality to be that
> of numa_mem_id(), and we still deactivate the slab if that isn't true.
> Just wanting to make sure I understand the intent.
> 

Yeah, the default policy should be to fallback to local memory if the node 
passed is memoryless.

> What I find odd is that there are only 2 nodes on this system, node 0
> (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> should be coming from node 1 (thus node_match() should always be true?)
> 

The nice thing about slub is its debugging ability, what is 
/sys/kernel/slab/cache/objects showing in comparison between the two 
patches?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-25  0:25                       ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-01-25  0:25 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> Thank you for clarifying and providing  a test patch. I ran with this on
> the system showing the original problem, configured to have 15GB of
> memory.
> 
> With your patch after boot:
> 
> MemTotal:       15604736 kB
> MemFree:         8768192 kB
> Slab:            3882560 kB
> SReclaimable:     105408 kB
> SUnreclaim:      3777152 kB
> 
> With Anton's patch after boot:
> 
> MemTotal:       15604736 kB
> MemFree:        11195008 kB
> Slab:            1427968 kB
> SReclaimable:     109184 kB
> SUnreclaim:      1318784 kB
> 
> 
> I know that's fairly unscientific, but the numbers are reproducible. 
> 

I don't think the goal of the discussion is to reduce the amount of slab 
allocated, but rather get the most local slab memory possible by use of 
kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
which is probably cpu_to_node() for a cpu bound to a node without memory, 
my patch is allocating it on the most local node; Anton's patch is 
allocating it on whatever happened to be the cpu slab.

> > diff --git a/mm/slub.c b/mm/slub.c
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2278,10 +2278,14 @@ redo:
> > 
> >  	if (unlikely(!node_match(page, node))) {
> >  		stat(s, ALLOC_NODE_MISMATCH);
> > -		deactivate_slab(s, page, c->freelist);
> > -		c->page = NULL;
> > -		c->freelist = NULL;
> > -		goto new_slab;
> > +		if (unlikely(!node_present_pages(node)))
> > +			node = numa_mem_id();
> > +		if (!node_match(page, node)) {
> > +			deactivate_slab(s, page, c->freelist);
> > +			c->page = NULL;
> > +			c->freelist = NULL;
> > +			goto new_slab;
> > +		}
> 
> Semantically, and please correct me if I'm wrong, this patch is saying
> if we have a memoryless node, we expect the page's locality to be that
> of numa_mem_id(), and we still deactivate the slab if that isn't true.
> Just wanting to make sure I understand the intent.
> 

Yeah, the default policy should be to fallback to local memory if the node 
passed is memoryless.

> What I find odd is that there are only 2 nodes on this system, node 0
> (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> should be coming from node 1 (thus node_match() should always be true?)
> 

The nice thing about slub is its debugging ability, what is 
/sys/kernel/slab/cache/objects showing in comparison between the two 
patches?

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-25  0:25                       ` David Rientjes
@ 2014-01-25  1:10                         ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-25  1:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > Thank you for clarifying and providing  a test patch. I ran with this on
> > the system showing the original problem, configured to have 15GB of
> > memory.
> > 
> > With your patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:         8768192 kB
> > Slab:            3882560 kB
> > SReclaimable:     105408 kB
> > SUnreclaim:      3777152 kB
> > 
> > With Anton's patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:        11195008 kB
> > Slab:            1427968 kB
> > SReclaimable:     109184 kB
> > SUnreclaim:      1318784 kB
> > 
> > 
> > I know that's fairly unscientific, but the numbers are reproducible. 
> > 
> 
> I don't think the goal of the discussion is to reduce the amount of slab 
> allocated, but rather get the most local slab memory possible by use of 
> kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> which is probably cpu_to_node() for a cpu bound to a node without memory, 
> my patch is allocating it on the most local node; Anton's patch is 
> allocating it on whatever happened to be the cpu slab.

Well, the issue we're trying to resolve, based upon our analysis, is
that we're seeing incredibly inefficient slab usage with memoryless
nodes. To the point where we are OOM'ing a 8GB system without doing
anything in particularly stressful.

As to cpu_to_node() being passed to kmalloc_node(), I think an
appropriate fix is to change that to cpu_to_mem()?

> > > diff --git a/mm/slub.c b/mm/slub.c
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2278,10 +2278,14 @@ redo:
> > > 
> > >  	if (unlikely(!node_match(page, node))) {
> > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > -		deactivate_slab(s, page, c->freelist);
> > > -		c->page = NULL;
> > > -		c->freelist = NULL;
> > > -		goto new_slab;
> > > +		if (unlikely(!node_present_pages(node)))
> > > +			node = numa_mem_id();
> > > +		if (!node_match(page, node)) {
> > > +			deactivate_slab(s, page, c->freelist);
> > > +			c->page = NULL;
> > > +			c->freelist = NULL;
> > > +			goto new_slab;
> > > +		}
> > 
> > Semantically, and please correct me if I'm wrong, this patch is saying
> > if we have a memoryless node, we expect the page's locality to be that
> > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > Just wanting to make sure I understand the intent.
> > 
> 
> Yeah, the default policy should be to fallback to local memory if the node 
> passed is memoryless.

Thanks!

> > What I find odd is that there are only 2 nodes on this system, node 0
> > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > should be coming from node 1 (thus node_match() should always be true?)
> > 
> 
> The nice thing about slub is its debugging ability, what is 
> /sys/kernel/slab/cache/objects showing in comparison between the two 
> patches?

Do you mean kmem_cache or kmem_cache_node?

-Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-25  1:10                         ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-25  1:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > Thank you for clarifying and providing  a test patch. I ran with this on
> > the system showing the original problem, configured to have 15GB of
> > memory.
> > 
> > With your patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:         8768192 kB
> > Slab:            3882560 kB
> > SReclaimable:     105408 kB
> > SUnreclaim:      3777152 kB
> > 
> > With Anton's patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:        11195008 kB
> > Slab:            1427968 kB
> > SReclaimable:     109184 kB
> > SUnreclaim:      1318784 kB
> > 
> > 
> > I know that's fairly unscientific, but the numbers are reproducible. 
> > 
> 
> I don't think the goal of the discussion is to reduce the amount of slab 
> allocated, but rather get the most local slab memory possible by use of 
> kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> which is probably cpu_to_node() for a cpu bound to a node without memory, 
> my patch is allocating it on the most local node; Anton's patch is 
> allocating it on whatever happened to be the cpu slab.

Well, the issue we're trying to resolve, based upon our analysis, is
that we're seeing incredibly inefficient slab usage with memoryless
nodes. To the point where we are OOM'ing a 8GB system without doing
anything in particularly stressful.

As to cpu_to_node() being passed to kmalloc_node(), I think an
appropriate fix is to change that to cpu_to_mem()?

> > > diff --git a/mm/slub.c b/mm/slub.c
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2278,10 +2278,14 @@ redo:
> > > 
> > >  	if (unlikely(!node_match(page, node))) {
> > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > -		deactivate_slab(s, page, c->freelist);
> > > -		c->page = NULL;
> > > -		c->freelist = NULL;
> > > -		goto new_slab;
> > > +		if (unlikely(!node_present_pages(node)))
> > > +			node = numa_mem_id();
> > > +		if (!node_match(page, node)) {
> > > +			deactivate_slab(s, page, c->freelist);
> > > +			c->page = NULL;
> > > +			c->freelist = NULL;
> > > +			goto new_slab;
> > > +		}
> > 
> > Semantically, and please correct me if I'm wrong, this patch is saying
> > if we have a memoryless node, we expect the page's locality to be that
> > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > Just wanting to make sure I understand the intent.
> > 
> 
> Yeah, the default policy should be to fallback to local memory if the node 
> passed is memoryless.

Thanks!

> > What I find odd is that there are only 2 nodes on this system, node 0
> > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > should be coming from node 1 (thus node_match() should always be true?)
> > 
> 
> The nice thing about slub is its debugging ability, what is 
> /sys/kernel/slab/cache/objects showing in comparison between the two 
> patches?

Do you mean kmem_cache or kmem_cache_node?

-Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-25  1:10                         ` Nishanth Aravamudan
@ 2014-01-27  5:58                           ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-27  5:58 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > 
> > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > the system showing the original problem, configured to have 15GB of
> > > memory.
> > > 
> > > With your patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:         8768192 kB
> > > Slab:            3882560 kB
> > > SReclaimable:     105408 kB
> > > SUnreclaim:      3777152 kB
> > > 
> > > With Anton's patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:        11195008 kB
> > > Slab:            1427968 kB
> > > SReclaimable:     109184 kB
> > > SUnreclaim:      1318784 kB
> > > 
> > > 
> > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > 

Hello,

I think that there is one mistake on David's patch although I'm not sure
that it is the reason for this result.

With David's patch, get_partial() in new_slab_objects() doesn't work properly,
because we only change node id in !node_match() case. If we meet just !freelist
case, we pass node id directly to new_slab_objects(), so we always try to allocate
new slab page regardless existence of partial pages. We should solve it.

Could you try this one?

Thanks.

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
                struct kmem_cache_cpu *c)
 {
        void *object;
-       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
+       if (node != NUMA_NO_NODE && !node_present_pages(node))
+               searchnode = numa_mem_id();
        object = get_partial_node(s, get_node(s, searchnode), c, flags);
        if (object || node != NUMA_NO_NODE)
                return object;
@@ -2278,10 +2280,14 @@ redo:
 
        if (unlikely(!node_match(page, node))) {
                stat(s, ALLOC_NODE_MISMATCH);
-               deactivate_slab(s, page, c->freelist);
-               c->page = NULL;
-               c->freelist = NULL;
-               goto new_slab;
+               if (unlikely(!node_present_pages(node)))
+                       node = numa_mem_id();
+               if (!node_match(page, node)) {
+                       deactivate_slab(s, page, c->freelist);
+                       c->page = NULL;
+                       c->freelist = NULL;
+                       goto new_slab;
+               }
        }
 
        /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-27  5:58                           ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-01-27  5:58 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > 
> > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > the system showing the original problem, configured to have 15GB of
> > > memory.
> > > 
> > > With your patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:         8768192 kB
> > > Slab:            3882560 kB
> > > SReclaimable:     105408 kB
> > > SUnreclaim:      3777152 kB
> > > 
> > > With Anton's patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:        11195008 kB
> > > Slab:            1427968 kB
> > > SReclaimable:     109184 kB
> > > SUnreclaim:      1318784 kB
> > > 
> > > 
> > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > 

Hello,

I think that there is one mistake on David's patch although I'm not sure
that it is the reason for this result.

With David's patch, get_partial() in new_slab_objects() doesn't work properly,
because we only change node id in !node_match() case. If we meet just !freelist
case, we pass node id directly to new_slab_objects(), so we always try to allocate
new slab page regardless existence of partial pages. We should solve it.

Could you try this one?

Thanks.

--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
                struct kmem_cache_cpu *c)
 {
        void *object;
-       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
+       if (node != NUMA_NO_NODE && !node_present_pages(node))
+               searchnode = numa_mem_id();
        object = get_partial_node(s, get_node(s, searchnode), c, flags);
        if (object || node != NUMA_NO_NODE)
                return object;
@@ -2278,10 +2280,14 @@ redo:
 
        if (unlikely(!node_match(page, node))) {
                stat(s, ALLOC_NODE_MISMATCH);
-               deactivate_slab(s, page, c->freelist);
-               c->page = NULL;
-               c->freelist = NULL;
-               goto new_slab;
+               if (unlikely(!node_present_pages(node)))
+                       node = numa_mem_id();
+               if (!node_match(page, node)) {
+                       deactivate_slab(s, page, c->freelist);
+                       c->page = NULL;
+                       c->freelist = NULL;
+                       goto new_slab;
+               }
        }
 
        /*

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-24 23:49                   ` David Rientjes
@ 2014-01-27 16:16                     ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-27 16:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nishanth Aravamudan, penberg, linux-mm, Han Pingtian, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Fri, 24 Jan 2014, David Rientjes wrote:

> kmalloc_node(nid) and kmem_cache_alloc_node(nid) should fallback to nodes
> other than nid when memory can't be allocated, these functions only
> indicate a preference.

The nid passed indicated a preference unless __GFP_THIS_NODE is specified.
Then the allocation must occur on that node.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-27 16:16                     ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-27 16:16 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Fri, 24 Jan 2014, David Rientjes wrote:

> kmalloc_node(nid) and kmem_cache_alloc_node(nid) should fallback to nodes
> other than nid when memory can't be allocated, these functions only
> indicate a preference.

The nid passed indicated a preference unless __GFP_THIS_NODE is specified.
Then the allocation must occur on that node.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-25  1:10                         ` Nishanth Aravamudan
@ 2014-01-27 16:18                           ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-27 16:18 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> As to cpu_to_node() being passed to kmalloc_node(), I think an
> appropriate fix is to change that to cpu_to_mem()?

Yup.

> > Yeah, the default policy should be to fallback to local memory if the node
> > passed is memoryless.
>
> Thanks!

I would suggest to use NUMA_NO_NODE instead. That will fit any slab that
we may be currently allocating from or can get a hold of and is mosty
efficient.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-27 16:18                           ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-27 16:18 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> As to cpu_to_node() being passed to kmalloc_node(), I think an
> appropriate fix is to change that to cpu_to_mem()?

Yup.

> > Yeah, the default policy should be to fallback to local memory if the node
> > passed is memoryless.
>
> Thanks!

I would suggest to use NUMA_NO_NODE instead. That will fit any slab that
we may be currently allocating from or can get a hold of and is mosty
efficient.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-25  0:16                     ` Nishanth Aravamudan
@ 2014-01-27 16:24                       ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-27 16:24 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> What I find odd is that there are only 2 nodes on this system, node 0
> (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> should be coming from node 1 (thus node_match() should always be true?)

Well yes that occurs if you specify the node or just always use the
default memory allocation policy.

In order to spread the allocatios over both node you would have to set the
tasks memory allocation policy to MPOL_INTERLEAVE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-27 16:24                       ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-27 16:24 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:

> What I find odd is that there are only 2 nodes on this system, node 0
> (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> should be coming from node 1 (thus node_match() should always be true?)

Well yes that occurs if you specify the node or just always use the
default memory allocation policy.

In order to spread the allocatios over both node you would have to set the
tasks memory allocation policy to MPOL_INTERLEAVE.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-27  5:58                           ` Joonsoo Kim
@ 2014-01-28 18:29                             ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-28 18:29 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > 
> > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > the system showing the original problem, configured to have 15GB of
> > > > memory.
> > > > 
> > > > With your patch after boot:
> > > > 
> > > > MemTotal:       15604736 kB
> > > > MemFree:         8768192 kB
> > > > Slab:            3882560 kB
> > > > SReclaimable:     105408 kB
> > > > SUnreclaim:      3777152 kB
> > > > 
> > > > With Anton's patch after boot:
> > > > 
> > > > MemTotal:       15604736 kB
> > > > MemFree:        11195008 kB
> > > > Slab:            1427968 kB
> > > > SReclaimable:     109184 kB
> > > > SUnreclaim:      1318784 kB
> > > > 
> > > > 
> > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > 
> 
> Hello,
> 
> I think that there is one mistake on David's patch although I'm not sure
> that it is the reason for this result.
> 
> With David's patch, get_partial() in new_slab_objects() doesn't work
> properly, because we only change node id in !node_match() case. If we
> meet just !freelist case, we pass node id directly to
> new_slab_objects(), so we always try to allocate new slab page
> regardless existence of partial pages. We should solve it.
> 
> Could you try this one?

This helps about the same as David's patch -- but I found the reason
why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
shortly for that and one other case I found.

This patch on its own seems to help on our test system by saving around
1.5GB of slab.

Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

with the caveat below.

Thanks,
Nish

> 
> Thanks.
> 
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>                 struct kmem_cache_cpu *c)
>  {
>         void *object;
> -       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
> +       if (node != NUMA_NO_NODE && !node_present_pages(node))
> +               searchnode = numa_mem_id();

This might be clearer as:

int searchnode = node;
if (node == NUMA_NO_NODE || !node_present_pages(node))
	searchnode = numa_mem_id();

>         object = get_partial_node(s, get_node(s, searchnode), c, flags);
>         if (object || node != NUMA_NO_NODE)
>                 return object;
> @@ -2278,10 +2280,14 @@ redo:
> 
>         if (unlikely(!node_match(page, node))) {
>                 stat(s, ALLOC_NODE_MISMATCH);
> -               deactivate_slab(s, page, c->freelist);
> -               c->page = NULL;
> -               c->freelist = NULL;
> -               goto new_slab;
> +               if (unlikely(!node_present_pages(node)))
> +                       node = numa_mem_id();
> +               if (!node_match(page, node)) {
> +                       deactivate_slab(s, page, c->freelist);
> +                       c->page = NULL;
> +                       c->freelist = NULL;
> +                       goto new_slab;
> +               }
>         }
> 
>         /*
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-28 18:29                             ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-28 18:29 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > 
> > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > the system showing the original problem, configured to have 15GB of
> > > > memory.
> > > > 
> > > > With your patch after boot:
> > > > 
> > > > MemTotal:       15604736 kB
> > > > MemFree:         8768192 kB
> > > > Slab:            3882560 kB
> > > > SReclaimable:     105408 kB
> > > > SUnreclaim:      3777152 kB
> > > > 
> > > > With Anton's patch after boot:
> > > > 
> > > > MemTotal:       15604736 kB
> > > > MemFree:        11195008 kB
> > > > Slab:            1427968 kB
> > > > SReclaimable:     109184 kB
> > > > SUnreclaim:      1318784 kB
> > > > 
> > > > 
> > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > 
> 
> Hello,
> 
> I think that there is one mistake on David's patch although I'm not sure
> that it is the reason for this result.
> 
> With David's patch, get_partial() in new_slab_objects() doesn't work
> properly, because we only change node id in !node_match() case. If we
> meet just !freelist case, we pass node id directly to
> new_slab_objects(), so we always try to allocate new slab page
> regardless existence of partial pages. We should solve it.
> 
> Could you try this one?

This helps about the same as David's patch -- but I found the reason
why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
shortly for that and one other case I found.

This patch on its own seems to help on our test system by saving around
1.5GB of slab.

Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

with the caveat below.

Thanks,
Nish

> 
> Thanks.
> 
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>                 struct kmem_cache_cpu *c)
>  {
>         void *object;
> -       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
> +       if (node != NUMA_NO_NODE && !node_present_pages(node))
> +               searchnode = numa_mem_id();

This might be clearer as:

int searchnode = node;
if (node == NUMA_NO_NODE || !node_present_pages(node))
	searchnode = numa_mem_id();

>         object = get_partial_node(s, get_node(s, searchnode), c, flags);
>         if (object || node != NUMA_NO_NODE)
>                 return object;
> @@ -2278,10 +2280,14 @@ redo:
> 
>         if (unlikely(!node_match(page, node))) {
>                 stat(s, ALLOC_NODE_MISMATCH);
> -               deactivate_slab(s, page, c->freelist);
> -               c->page = NULL;
> -               c->freelist = NULL;
> -               goto new_slab;
> +               if (unlikely(!node_present_pages(node)))
> +                       node = numa_mem_id();
> +               if (!node_match(page, node)) {
> +                       deactivate_slab(s, page, c->freelist);
> +                       c->page = NULL;
> +                       c->freelist = NULL;
> +                       goto new_slab;
> +               }
>         }
> 
>         /*
> 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-28 18:29                             ` Nishanth Aravamudan
@ 2014-01-29 15:54                               ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-29 15:54 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, penberg, linux-mm,
	paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Tue, 28 Jan 2014, Nishanth Aravamudan wrote:

> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.

Oww...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-29 15:54                               ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-29 15:54 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Tue, 28 Jan 2014, Nishanth Aravamudan wrote:

> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.

Oww...

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-28 18:29                             ` Nishanth Aravamudan
@ 2014-01-29 22:36                               ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-29 22:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li, cody

On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote:
> On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > 
> > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > > 
> > > > > With your patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:         8768192 kB
> > > > > Slab:            3882560 kB
> > > > > SReclaimable:     105408 kB
> > > > > SUnreclaim:      3777152 kB
> > > > > 
> > > > > With Anton's patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:        11195008 kB
> > > > > Slab:            1427968 kB
> > > > > SReclaimable:     109184 kB
> > > > > SUnreclaim:      1318784 kB
> > > > > 
> > > > > 
> > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > 
> > 
> > Hello,
> > 
> > I think that there is one mistake on David's patch although I'm not sure
> > that it is the reason for this result.
> > 
> > With David's patch, get_partial() in new_slab_objects() doesn't work
> > properly, because we only change node id in !node_match() case. If we
> > meet just !freelist case, we pass node id directly to
> > new_slab_objects(), so we always try to allocate new slab page
> > regardless existence of partial pages. We should solve it.
> > 
> > Could you try this one?
> 
> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.
> 
> This patch on its own seems to help on our test system by saving around
> 1.5GB of slab.
> 
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> with the caveat below.
> 
> Thanks,
> Nish
> 
> > 
> > Thanks.
> > 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> >                 struct kmem_cache_cpu *c)
> >  {
> >         void *object;
> > -       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > +       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> > 
> > +       if (node != NUMA_NO_NODE && !node_present_pages(node))
> > +               searchnode = numa_mem_id();
> 
> This might be clearer as:
> 
> int searchnode = node;
> if (node == NUMA_NO_NODE || !node_present_pages(node))
> 	searchnode = numa_mem_id();

Cody Schafer mentioned to me on IRC that this may not always reflect
exactly what the caller intends.

int searchnode = node;
if (node == NUMA_NO_NODE)
	searchnode = numa_mem_id();
if (!node_present_pages(node))
	searchnode = local_memory_node(node);

The difference in semantics from the previous is that here, if we have a
memoryless node, rather than using the CPU's nearest NUMA node, we use
the NUMA node closest to the requested one?

> >         object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >         if (object || node != NUMA_NO_NODE)
> >                 return object;
> > @@ -2278,10 +2280,14 @@ redo:
> > 
> >         if (unlikely(!node_match(page, node))) {
> >                 stat(s, ALLOC_NODE_MISMATCH);
> > -               deactivate_slab(s, page, c->freelist);
> > -               c->page = NULL;
> > -               c->freelist = NULL;
> > -               goto new_slab;
> > +               if (unlikely(!node_present_pages(node)))
> > +                       node = numa_mem_id();

Similarly here?

-Nish

> > +               if (!node_match(page, node)) {
> > +                       deactivate_slab(s, page, c->freelist);
> > +                       c->page = NULL;
> > +                       c->freelist = NULL;
> > +                       goto new_slab;
> > +               }
> >         }
> > 
> >         /*
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-29 22:36                               ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-01-29 22:36 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, mpm, penberg, linux-mm, cody, paulus,
	Anton Blanchard, David Rientjes, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote:
> On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > 
> > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > > 
> > > > > With your patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:         8768192 kB
> > > > > Slab:            3882560 kB
> > > > > SReclaimable:     105408 kB
> > > > > SUnreclaim:      3777152 kB
> > > > > 
> > > > > With Anton's patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:        11195008 kB
> > > > > Slab:            1427968 kB
> > > > > SReclaimable:     109184 kB
> > > > > SUnreclaim:      1318784 kB
> > > > > 
> > > > > 
> > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > 
> > 
> > Hello,
> > 
> > I think that there is one mistake on David's patch although I'm not sure
> > that it is the reason for this result.
> > 
> > With David's patch, get_partial() in new_slab_objects() doesn't work
> > properly, because we only change node id in !node_match() case. If we
> > meet just !freelist case, we pass node id directly to
> > new_slab_objects(), so we always try to allocate new slab page
> > regardless existence of partial pages. We should solve it.
> > 
> > Could you try this one?
> 
> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.
> 
> This patch on its own seems to help on our test system by saving around
> 1.5GB of slab.
> 
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> with the caveat below.
> 
> Thanks,
> Nish
> 
> > 
> > Thanks.
> > 
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1698,8 +1698,10 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> >                 struct kmem_cache_cpu *c)
> >  {
> >         void *object;
> > -       int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > +       int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> > 
> > +       if (node != NUMA_NO_NODE && !node_present_pages(node))
> > +               searchnode = numa_mem_id();
> 
> This might be clearer as:
> 
> int searchnode = node;
> if (node == NUMA_NO_NODE || !node_present_pages(node))
> 	searchnode = numa_mem_id();

Cody Schafer mentioned to me on IRC that this may not always reflect
exactly what the caller intends.

int searchnode = node;
if (node == NUMA_NO_NODE)
	searchnode = numa_mem_id();
if (!node_present_pages(node))
	searchnode = local_memory_node(node);

The difference in semantics from the previous is that here, if we have a
memoryless node, rather than using the CPU's nearest NUMA node, we use
the NUMA node closest to the requested one?

> >         object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >         if (object || node != NUMA_NO_NODE)
> >                 return object;
> > @@ -2278,10 +2280,14 @@ redo:
> > 
> >         if (unlikely(!node_match(page, node))) {
> >                 stat(s, ALLOC_NODE_MISMATCH);
> > -               deactivate_slab(s, page, c->freelist);
> > -               c->page = NULL;
> > -               c->freelist = NULL;
> > -               goto new_slab;
> > +               if (unlikely(!node_present_pages(node)))
> > +                       node = numa_mem_id();

Similarly here?

-Nish

> > +               if (!node_match(page, node)) {
> > +                       deactivate_slab(s, page, c->freelist);
> > +                       c->page = NULL;
> > +                       c->freelist = NULL;
> > +                       goto new_slab;
> > +               }
> >         }
> > 
> >         /*
> > 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-29 22:36                               ` Nishanth Aravamudan
@ 2014-01-30 16:26                                 ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-30 16:26 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, penberg, linux-mm,
	paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li, cody

On Wed, 29 Jan 2014, Nishanth Aravamudan wrote:

> exactly what the caller intends.
>
> int searchnode = node;
> if (node == NUMA_NO_NODE)
> 	searchnode = numa_mem_id();
> if (!node_present_pages(node))
> 	searchnode = local_memory_node(node);
>
> The difference in semantics from the previous is that here, if we have a
> memoryless node, rather than using the CPU's nearest NUMA node, we use
> the NUMA node closest to the requested one?

The idea here is that the page allocator will do the fallback to other
nodes. This check for !node_present should not be necessary. SLUB needs to
accept the page from whatever node the page allocator returned and work
with that.

The problem is the check for having a slab from the "right" node may fall
again after another attempt to allocate from the same node. SLUB will then
push the slab from the *wrong* node back to the partial lists and may
attempt another allocation that will again be successful but return memory
from another node. That way the partial lists from a particular node are
growing uselessly.

One way to solve this may be to check if memory is actually allocated
from the requested node and fallback to NUMA_NO_NODE (which will use the
last allocated slab) for future allocs if the page allocator returned
memory from a different node (unless GFP_THIS_NODE is set of course).
Otherwise we end up replicating  the page allocator logic in slub like in
slab. That is what I wanted to
avoid.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-01-30 16:26                                 ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-01-30 16:26 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, cody, paulus,
	Anton Blanchard, David Rientjes, Joonsoo Kim, linuxppc-dev,
	Wanpeng Li

On Wed, 29 Jan 2014, Nishanth Aravamudan wrote:

> exactly what the caller intends.
>
> int searchnode = node;
> if (node == NUMA_NO_NODE)
> 	searchnode = numa_mem_id();
> if (!node_present_pages(node))
> 	searchnode = local_memory_node(node);
>
> The difference in semantics from the previous is that here, if we have a
> memoryless node, rather than using the CPU's nearest NUMA node, we use
> the NUMA node closest to the requested one?

The idea here is that the page allocator will do the fallback to other
nodes. This check for !node_present should not be necessary. SLUB needs to
accept the page from whatever node the page allocator returned and work
with that.

The problem is the check for having a slab from the "right" node may fall
again after another attempt to allocate from the same node. SLUB will then
push the slab from the *wrong* node back to the partial lists and may
attempt another allocation that will again be successful but return memory
from another node. That way the partial lists from a particular node are
growing uselessly.

One way to solve this may be to check if memory is actually allocated
from the requested node and fallback to NUMA_NO_NODE (which will use the
last allocated slab) for future allocs if the page allocator returned
memory from a different node (unless GFP_THIS_NODE is set of course).
Otherwise we end up replicating  the page allocator logic in slub like in
slab. That is what I wanted to
avoid.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-28 18:29                             ` Nishanth Aravamudan
@ 2014-02-03 23:00                               ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-03 23:00 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote:
> On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > 
> > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > > 
> > > > > With your patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:         8768192 kB
> > > > > Slab:            3882560 kB
> > > > > SReclaimable:     105408 kB
> > > > > SUnreclaim:      3777152 kB
> > > > > 
> > > > > With Anton's patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:        11195008 kB
> > > > > Slab:            1427968 kB
> > > > > SReclaimable:     109184 kB
> > > > > SUnreclaim:      1318784 kB
> > > > > 
> > > > > 
> > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > 
> > 
> > Hello,
> > 
> > I think that there is one mistake on David's patch although I'm not sure
> > that it is the reason for this result.
> > 
> > With David's patch, get_partial() in new_slab_objects() doesn't work
> > properly, because we only change node id in !node_match() case. If we
> > meet just !freelist case, we pass node id directly to
> > new_slab_objects(), so we always try to allocate new slab page
> > regardless existence of partial pages. We should solve it.
> > 
> > Could you try this one?
> 
> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.
> 
> This patch on its own seems to help on our test system by saving around
> 1.5GB of slab.
> 
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> with the caveat below.

So what's the status of this patch? Christoph, do you think this is fine
as it is?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-03 23:00                               ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-03 23:00 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On 28.01.2014 [10:29:47 -0800], Nishanth Aravamudan wrote:
> On 27.01.2014 [14:58:05 +0900], Joonsoo Kim wrote:
> > On Fri, Jan 24, 2014 at 05:10:42PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > 
> > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > > 
> > > > > With your patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:         8768192 kB
> > > > > Slab:            3882560 kB
> > > > > SReclaimable:     105408 kB
> > > > > SUnreclaim:      3777152 kB
> > > > > 
> > > > > With Anton's patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:        11195008 kB
> > > > > Slab:            1427968 kB
> > > > > SReclaimable:     109184 kB
> > > > > SUnreclaim:      1318784 kB
> > > > > 
> > > > > 
> > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > 
> > 
> > Hello,
> > 
> > I think that there is one mistake on David's patch although I'm not sure
> > that it is the reason for this result.
> > 
> > With David's patch, get_partial() in new_slab_objects() doesn't work
> > properly, because we only change node id in !node_match() case. If we
> > meet just !freelist case, we pass node id directly to
> > new_slab_objects(), so we always try to allocate new slab page
> > regardless existence of partial pages. We should solve it.
> > 
> > Could you try this one?
> 
> This helps about the same as David's patch -- but I found the reason
> why! ppc64 doesn't set CONFIG_HAVE_MEMORYLESS_NODES :) Expect a patch
> shortly for that and one other case I found.
> 
> This patch on its own seems to help on our test system by saving around
> 1.5GB of slab.
> 
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> with the caveat below.

So what's the status of this patch? Christoph, do you think this is fine
as it is?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-03 23:00                               ` Nishanth Aravamudan
@ 2014-02-04  3:38                                 ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-04  3:38 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, penberg, linux-mm,
	paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

> So what's the status of this patch? Christoph, do you think this is fine
> as it is?

Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
already acked the patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-04  3:38                                 ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-04  3:38 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

> So what's the status of this patch? Christoph, do you think this is fine
> as it is?

Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
already acked the patch.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-04  3:38                                 ` Christoph Lameter
@ 2014-02-04  7:26                                   ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-04  7:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, penberg, linux-mm,
	paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On 03.02.2014 [21:38:36 -0600], Christoph Lameter wrote:
> On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:
> 
> > So what's the status of this patch? Christoph, do you think this is fine
> > as it is?
> 
> Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
> already acked the patch.

Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
the $SUBJECT issue.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-04  7:26                                   ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-04  7:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 03.02.2014 [21:38:36 -0600], Christoph Lameter wrote:
> On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:
> 
> > So what's the status of this patch? Christoph, do you think this is fine
> > as it is?
> 
> Certainly enabling CONFIG_MEMORYLESS_NODES is the right thing to do and I
> already acked the patch.

Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
the $SUBJECT issue.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-04  7:26                                   ` Nishanth Aravamudan
@ 2014-02-04 20:39                                     ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-04 20:39 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, penberg, linux-mm,
	paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

> Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
> the $SUBJECT issue.

Hmmm... I am not sure that this is a general solution. The fallback to
other nodes can not only occur because a node has no memory as his patch
assumes.

If the target node allocation fails (for whatever reason) then I would
recommend for simplicities sake to change the target node to NUMA_NO_NODE
and just take whatever is in the current cpu slab. A more complex solution
would be to look through partial lists in increasing distance to find a
partially used slab that is reasonable close to the current node. Slab has
logic like that in fallback_alloc(). Slubs get_any_partial() function does
something close to what you want.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-04 20:39                                     ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-04 20:39 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:

> Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
> the $SUBJECT issue.

Hmmm... I am not sure that this is a general solution. The fallback to
other nodes can not only occur because a node has no memory as his patch
assumes.

If the target node allocation fails (for whatever reason) then I would
recommend for simplicities sake to change the target node to NUMA_NO_NODE
and just take whatever is in the current cpu slab. A more complex solution
would be to look through partial lists in increasing distance to find a
partially used slab that is reasonable close to the current node. Slab has
logic like that in fallback_alloc(). Slubs get_any_partial() function does
something close to what you want.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-04 20:39                                     ` Christoph Lameter
@ 2014-02-05  0:13                                       ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-05  0:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 04.02.2014 [14:39:32 -0600], Christoph Lameter wrote:
> On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:
> 
> > Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
> > the $SUBJECT issue.
> 
> Hmmm... I am not sure that this is a general solution. The fallback to
> other nodes can not only occur because a node has no memory as his patch
> assumes.

Thanks, Christoph. I see your point.

Something in this area would be nice, though, as it does produce a
fairly significant bump in the slab usage on our test system.

> If the target node allocation fails (for whatever reason) then I would
> recommend for simplicities sake to change the target node to
> NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> complex solution would be to look through partial lists in increasing
> distance to find a partially used slab that is reasonable close to the
> current node. Slab has logic like that in fallback_alloc(). Slubs
> get_any_partial() function does something close to what you want.

I apologize for my own ignorance, but I'm having trouble following.
Anton's original patch did fallback to the current cpu slab, but I'm not
sure any NUMA_NO_NODE change is necessary there. At the point we're
deactivating the slab (in the current code, in __slab_alloc()), we have
successfully allocated from somewhere, it's just not on the node we
expected to be on.

So perhaps you are saying to make a change lower in the code? I'm not
sure where it makes sense to change the target node in that case. I'd
appreciate any guidance you can give.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-05  0:13                                       ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-05  0:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 04.02.2014 [14:39:32 -0600], Christoph Lameter wrote:
> On Mon, 3 Feb 2014, Nishanth Aravamudan wrote:
> 
> > Yes, sorry for my lack of clarity. I meant Joonsoo's latest patch for
> > the $SUBJECT issue.
> 
> Hmmm... I am not sure that this is a general solution. The fallback to
> other nodes can not only occur because a node has no memory as his patch
> assumes.

Thanks, Christoph. I see your point.

Something in this area would be nice, though, as it does produce a
fairly significant bump in the slab usage on our test system.

> If the target node allocation fails (for whatever reason) then I would
> recommend for simplicities sake to change the target node to
> NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> complex solution would be to look through partial lists in increasing
> distance to find a partially used slab that is reasonable close to the
> current node. Slab has logic like that in fallback_alloc(). Slubs
> get_any_partial() function does something close to what you want.

I apologize for my own ignorance, but I'm having trouble following.
Anton's original patch did fallback to the current cpu slab, but I'm not
sure any NUMA_NO_NODE change is necessary there. At the point we're
deactivating the slab (in the current code, in __slab_alloc()), we have
successfully allocated from somewhere, it's just not on the node we
expected to be on.

So perhaps you are saying to make a change lower in the code? I'm not
sure where it makes sense to change the target node in that case. I'd
appreciate any guidance you can give.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-05  0:13                                       ` Nishanth Aravamudan
@ 2014-02-05 19:28                                         ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-05 19:28 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:

> > If the target node allocation fails (for whatever reason) then I would
> > recommend for simplicities sake to change the target node to
> > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > complex solution would be to look through partial lists in increasing
> > distance to find a partially used slab that is reasonable close to the
> > current node. Slab has logic like that in fallback_alloc(). Slubs
> > get_any_partial() function does something close to what you want.
>
> I apologize for my own ignorance, but I'm having trouble following.
> Anton's original patch did fallback to the current cpu slab, but I'm not
> sure any NUMA_NO_NODE change is necessary there. At the point we're
> deactivating the slab (in the current code, in __slab_alloc()), we have
> successfully allocated from somewhere, it's just not on the node we
> expected to be on.

Right so if we are ignoring the node then the simplest thing to do is to
not deactivate the current cpu slab but to take an object from it.

> So perhaps you are saying to make a change lower in the code? I'm not
> sure where it makes sense to change the target node in that case. I'd
> appreciate any guidance you can give.

This not an easy thing to do. If the current slab is not the right node
but would be the node from which the page allocator would be returning
memory then the current slab can still be allocated from. If the fallback
is to another node then the current cpu slab needs to be deactivated and
the allocation from that node needs to proceeed. Have a look at
fallback_alloc() in the slab allocator.

A allocation attempt from the page allocator can be restricted to a
specific node through GFP_THIS_NODE.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-05 19:28                                         ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-05 19:28 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:

> > If the target node allocation fails (for whatever reason) then I would
> > recommend for simplicities sake to change the target node to
> > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > complex solution would be to look through partial lists in increasing
> > distance to find a partially used slab that is reasonable close to the
> > current node. Slab has logic like that in fallback_alloc(). Slubs
> > get_any_partial() function does something close to what you want.
>
> I apologize for my own ignorance, but I'm having trouble following.
> Anton's original patch did fallback to the current cpu slab, but I'm not
> sure any NUMA_NO_NODE change is necessary there. At the point we're
> deactivating the slab (in the current code, in __slab_alloc()), we have
> successfully allocated from somewhere, it's just not on the node we
> expected to be on.

Right so if we are ignoring the node then the simplest thing to do is to
not deactivate the current cpu slab but to take an object from it.

> So perhaps you are saying to make a change lower in the code? I'm not
> sure where it makes sense to change the target node in that case. I'd
> appreciate any guidance you can give.

This not an easy thing to do. If the current slab is not the right node
but would be the node from which the page allocator would be returning
memory then the current slab can still be allocated from. If the fallback
is to another node then the current cpu slab needs to be deactivated and
the allocation from that node needs to proceeed. Have a look at
fallback_alloc() in the slab allocator.

A allocation attempt from the page allocator can be restricted to a
specific node through GFP_THIS_NODE.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-01-25  0:25                       ` David Rientjes
@ 2014-02-06  2:07                         ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-06  2:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > Thank you for clarifying and providing  a test patch. I ran with this on
> > the system showing the original problem, configured to have 15GB of
> > memory.
> > 
> > With your patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:         8768192 kB
> > Slab:            3882560 kB
> > SReclaimable:     105408 kB
> > SUnreclaim:      3777152 kB
> > 
> > With Anton's patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:        11195008 kB
> > Slab:            1427968 kB
> > SReclaimable:     109184 kB
> > SUnreclaim:      1318784 kB
> > 
> > 
> > I know that's fairly unscientific, but the numbers are reproducible. 
> > 
> 
> I don't think the goal of the discussion is to reduce the amount of slab 
> allocated, but rather get the most local slab memory possible by use of 
> kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> which is probably cpu_to_node() for a cpu bound to a node without memory, 
> my patch is allocating it on the most local node; Anton's patch is 
> allocating it on whatever happened to be the cpu slab.
> 
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2278,10 +2278,14 @@ redo:
> > > 
> > >  	if (unlikely(!node_match(page, node))) {
> > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > -		deactivate_slab(s, page, c->freelist);
> > > -		c->page = NULL;
> > > -		c->freelist = NULL;
> > > -		goto new_slab;
> > > +		if (unlikely(!node_present_pages(node)))
> > > +			node = numa_mem_id();
> > > +		if (!node_match(page, node)) {
> > > +			deactivate_slab(s, page, c->freelist);
> > > +			c->page = NULL;
> > > +			c->freelist = NULL;
> > > +			goto new_slab;
> > > +		}
> > 
> > Semantically, and please correct me if I'm wrong, this patch is saying
> > if we have a memoryless node, we expect the page's locality to be that
> > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > Just wanting to make sure I understand the intent.
> > 
> 
> Yeah, the default policy should be to fallback to local memory if the node 
> passed is memoryless.
> 
> > What I find odd is that there are only 2 nodes on this system, node 0
> > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > should be coming from node 1 (thus node_match() should always be true?)
> > 
> 
> The nice thing about slub is its debugging ability, what is 
> /sys/kernel/slab/cache/objects showing in comparison between the two 
> patches?

Ok, I finally got around to writing a script that compares the objects
output from both kernels.

log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Joonsoo's patch.

log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Anton's patch.

slab                           objects    objects   percent
                               log1       log2      change
-----------------------------------------------------------
:t-0000104                     71190      85680      20.353982 %
UDP                            4352       3392       22.058824 %
inode_cache                    54302      41923      22.796582 %
fscache_cookie_jar             3276       2457       25.000000 %
:t-0000896                     438        292        33.333333 %
:t-0000080                     310401     195323     37.073978 %
ext4_inode_cache               335        201        40.000000 %
:t-0000192                     89408      128898     44.168307 %
:t-0000184                     151300     81880      45.882353 %
:t-0000512                     49698      73648      48.191074 %
:at-0000192                    242867     120948     50.199904 %
xfs_inode                      34350      15221      55.688501 %
:t-0016384                     11005      17257      56.810541 %
proc_inode_cache               103868     34717      66.575846 %
tw_sock_TCP                    768        256        66.666667 %
:t-0004096                     15240      25672      68.451444 %
nfs_inode_cache                1008       315        68.750000 %
:t-0001024                     14528      24720      70.154185 %
:t-0032768                     655        1312       100.305344%
:t-0002048                     14242      30720      115.700042%
:t-0000640                     1020       2550       150.000000%
:t-0008192                     10005      27905      178.910545%

FWIW, the configuration of this LPAR has slightly changed. It is now configured
for maximally 400 CPUs, of which 200 are present. The result is that even with
Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
script reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-512                        1182 MB    2.03%  100.00%
kmalloc-192                        1182 MB    1.38%  100.00%
kmalloc-16384                       966 MB   17.66%  100.00%
kmalloc-4096                        353 MB   15.92%  100.00%
kmalloc-8192                        259 MB   27.28%  100.00%
kmalloc-32768                       207 MB    9.86%  100.00%

In comparison (log2 above):

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       273 MB   98.76%  100.00%
kmalloc-8192                        225 MB   98.67%  100.00%
pgtable-2^11                        114 MB  100.00%  100.00%
pgtable-2^12                        109 MB  100.00%  100.00%
kmalloc-4096                        104 MB   98.59%  100.00%

I appreciate all the help so far, if anyone has any ideas how best to
proceed further, or what they'd like debugged more, I'm happy to get
this fixed. We're hitting this on a couple of different systems and I'd
like to find a good resolution to the problem.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-06  2:07                         ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-06  2:07 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, penberg, linux-mm, paulus, Anton Blanchard, mpm,
	Christoph Lameter, linuxppc-dev, Joonsoo Kim, Wanpeng Li

On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> 
> > Thank you for clarifying and providing  a test patch. I ran with this on
> > the system showing the original problem, configured to have 15GB of
> > memory.
> > 
> > With your patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:         8768192 kB
> > Slab:            3882560 kB
> > SReclaimable:     105408 kB
> > SUnreclaim:      3777152 kB
> > 
> > With Anton's patch after boot:
> > 
> > MemTotal:       15604736 kB
> > MemFree:        11195008 kB
> > Slab:            1427968 kB
> > SReclaimable:     109184 kB
> > SUnreclaim:      1318784 kB
> > 
> > 
> > I know that's fairly unscientific, but the numbers are reproducible. 
> > 
> 
> I don't think the goal of the discussion is to reduce the amount of slab 
> allocated, but rather get the most local slab memory possible by use of 
> kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> which is probably cpu_to_node() for a cpu bound to a node without memory, 
> my patch is allocating it on the most local node; Anton's patch is 
> allocating it on whatever happened to be the cpu slab.
> 
> > > diff --git a/mm/slub.c b/mm/slub.c
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -2278,10 +2278,14 @@ redo:
> > > 
> > >  	if (unlikely(!node_match(page, node))) {
> > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > -		deactivate_slab(s, page, c->freelist);
> > > -		c->page = NULL;
> > > -		c->freelist = NULL;
> > > -		goto new_slab;
> > > +		if (unlikely(!node_present_pages(node)))
> > > +			node = numa_mem_id();
> > > +		if (!node_match(page, node)) {
> > > +			deactivate_slab(s, page, c->freelist);
> > > +			c->page = NULL;
> > > +			c->freelist = NULL;
> > > +			goto new_slab;
> > > +		}
> > 
> > Semantically, and please correct me if I'm wrong, this patch is saying
> > if we have a memoryless node, we expect the page's locality to be that
> > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > Just wanting to make sure I understand the intent.
> > 
> 
> Yeah, the default policy should be to fallback to local memory if the node 
> passed is memoryless.
> 
> > What I find odd is that there are only 2 nodes on this system, node 0
> > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > should be coming from node 1 (thus node_match() should always be true?)
> > 
> 
> The nice thing about slub is its debugging ability, what is 
> /sys/kernel/slab/cache/objects showing in comparison between the two 
> patches?

Ok, I finally got around to writing a script that compares the objects
output from both kernels.

log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Joonsoo's patch.

log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
and Anton's patch.

slab                           objects    objects   percent
                               log1       log2      change
-----------------------------------------------------------
:t-0000104                     71190      85680      20.353982 %
UDP                            4352       3392       22.058824 %
inode_cache                    54302      41923      22.796582 %
fscache_cookie_jar             3276       2457       25.000000 %
:t-0000896                     438        292        33.333333 %
:t-0000080                     310401     195323     37.073978 %
ext4_inode_cache               335        201        40.000000 %
:t-0000192                     89408      128898     44.168307 %
:t-0000184                     151300     81880      45.882353 %
:t-0000512                     49698      73648      48.191074 %
:at-0000192                    242867     120948     50.199904 %
xfs_inode                      34350      15221      55.688501 %
:t-0016384                     11005      17257      56.810541 %
proc_inode_cache               103868     34717      66.575846 %
tw_sock_TCP                    768        256        66.666667 %
:t-0004096                     15240      25672      68.451444 %
nfs_inode_cache                1008       315        68.750000 %
:t-0001024                     14528      24720      70.154185 %
:t-0032768                     655        1312       100.305344%
:t-0002048                     14242      30720      115.700042%
:t-0000640                     1020       2550       150.000000%
:t-0008192                     10005      27905      178.910545%

FWIW, the configuration of this LPAR has slightly changed. It is now configured
for maximally 400 CPUs, of which 200 are present. The result is that even with
Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
script reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-512                        1182 MB    2.03%  100.00%
kmalloc-192                        1182 MB    1.38%  100.00%
kmalloc-16384                       966 MB   17.66%  100.00%
kmalloc-4096                        353 MB   15.92%  100.00%
kmalloc-8192                        259 MB   27.28%  100.00%
kmalloc-32768                       207 MB    9.86%  100.00%

In comparison (log2 above):

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       273 MB   98.76%  100.00%
kmalloc-8192                        225 MB   98.67%  100.00%
pgtable-2^11                        114 MB  100.00%  100.00%
pgtable-2^12                        109 MB  100.00%  100.00%
kmalloc-4096                        104 MB   98.59%  100.00%

I appreciate all the help so far, if anyone has any ideas how best to
proceed further, or what they'd like debugged more, I'm happy to get
this fixed. We're hitting this on a couple of different systems and I'd
like to find a good resolution to the problem.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-05 19:28                                         ` Christoph Lameter
@ 2014-02-06  2:08                                           ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-06  2:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 05.02.2014 [13:28:03 -0600], Christoph Lameter wrote:
> On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:
> 
> > > If the target node allocation fails (for whatever reason) then I would
> > > recommend for simplicities sake to change the target node to
> > > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > > complex solution would be to look through partial lists in increasing
> > > distance to find a partially used slab that is reasonable close to the
> > > current node. Slab has logic like that in fallback_alloc(). Slubs
> > > get_any_partial() function does something close to what you want.
> >
> > I apologize for my own ignorance, but I'm having trouble following.
> > Anton's original patch did fallback to the current cpu slab, but I'm not
> > sure any NUMA_NO_NODE change is necessary there. At the point we're
> > deactivating the slab (in the current code, in __slab_alloc()), we have
> > successfully allocated from somewhere, it's just not on the node we
> > expected to be on.
> 
> Right so if we are ignoring the node then the simplest thing to do is to
> not deactivate the current cpu slab but to take an object from it.

Ok, that's what Anton's patch does, I believe. Are you ok with that
patch as it is?

> > So perhaps you are saying to make a change lower in the code? I'm not
> > sure where it makes sense to change the target node in that case. I'd
> > appreciate any guidance you can give.
> 
> This not an easy thing to do. If the current slab is not the right node
> but would be the node from which the page allocator would be returning
> memory then the current slab can still be allocated from. If the fallback
> is to another node then the current cpu slab needs to be deactivated and
> the allocation from that node needs to proceeed. Have a look at
> fallback_alloc() in the slab allocator.
> 
> A allocation attempt from the page allocator can be restricted to a
> specific node through GFP_THIS_NODE.

Thanks for the pointers, I will try and take a look.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-06  2:08                                           ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-06  2:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 05.02.2014 [13:28:03 -0600], Christoph Lameter wrote:
> On Tue, 4 Feb 2014, Nishanth Aravamudan wrote:
> 
> > > If the target node allocation fails (for whatever reason) then I would
> > > recommend for simplicities sake to change the target node to
> > > NUMA_NO_NODE and just take whatever is in the current cpu slab. A more
> > > complex solution would be to look through partial lists in increasing
> > > distance to find a partially used slab that is reasonable close to the
> > > current node. Slab has logic like that in fallback_alloc(). Slubs
> > > get_any_partial() function does something close to what you want.
> >
> > I apologize for my own ignorance, but I'm having trouble following.
> > Anton's original patch did fallback to the current cpu slab, but I'm not
> > sure any NUMA_NO_NODE change is necessary there. At the point we're
> > deactivating the slab (in the current code, in __slab_alloc()), we have
> > successfully allocated from somewhere, it's just not on the node we
> > expected to be on.
> 
> Right so if we are ignoring the node then the simplest thing to do is to
> not deactivate the current cpu slab but to take an object from it.

Ok, that's what Anton's patch does, I believe. Are you ok with that
patch as it is?

> > So perhaps you are saying to make a change lower in the code? I'm not
> > sure where it makes sense to change the target node in that case. I'd
> > appreciate any guidance you can give.
> 
> This not an easy thing to do. If the current slab is not the right node
> but would be the node from which the page allocator would be returning
> memory then the current slab can still be allocated from. If the fallback
> is to another node then the current cpu slab needs to be deactivated and
> the allocation from that node needs to proceeed. Have a look at
> fallback_alloc() in the slab allocator.
> 
> A allocation attempt from the page allocator can be restricted to a
> specific node through GFP_THIS_NODE.

Thanks for the pointers, I will try and take a look.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-06  2:07                         ` Nishanth Aravamudan
@ 2014-02-06  8:04                           ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06  8:04 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > 
> > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > the system showing the original problem, configured to have 15GB of
> > > memory.
> > > 
> > > With your patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:         8768192 kB
> > > Slab:            3882560 kB
> > > SReclaimable:     105408 kB
> > > SUnreclaim:      3777152 kB
> > > 
> > > With Anton's patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:        11195008 kB
> > > Slab:            1427968 kB
> > > SReclaimable:     109184 kB
> > > SUnreclaim:      1318784 kB
> > > 
> > > 
> > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > 
> > 
> > I don't think the goal of the discussion is to reduce the amount of slab 
> > allocated, but rather get the most local slab memory possible by use of 
> > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > my patch is allocating it on the most local node; Anton's patch is 
> > allocating it on whatever happened to be the cpu slab.
> > 
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -2278,10 +2278,14 @@ redo:
> > > > 
> > > >  	if (unlikely(!node_match(page, node))) {
> > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > -		deactivate_slab(s, page, c->freelist);
> > > > -		c->page = NULL;
> > > > -		c->freelist = NULL;
> > > > -		goto new_slab;
> > > > +		if (unlikely(!node_present_pages(node)))
> > > > +			node = numa_mem_id();
> > > > +		if (!node_match(page, node)) {
> > > > +			deactivate_slab(s, page, c->freelist);
> > > > +			c->page = NULL;
> > > > +			c->freelist = NULL;
> > > > +			goto new_slab;
> > > > +		}
> > > 
> > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > if we have a memoryless node, we expect the page's locality to be that
> > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > Just wanting to make sure I understand the intent.
> > > 
> > 
> > Yeah, the default policy should be to fallback to local memory if the node 
> > passed is memoryless.
> > 
> > > What I find odd is that there are only 2 nodes on this system, node 0
> > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > should be coming from node 1 (thus node_match() should always be true?)
> > > 
> > 
> > The nice thing about slub is its debugging ability, what is 
> > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > patches?
> 
> Ok, I finally got around to writing a script that compares the objects
> output from both kernels.
> 
> log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> and Joonsoo's patch.
> 
> log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> and Anton's patch.
> 
> slab                           objects    objects   percent
>                                log1       log2      change
> -----------------------------------------------------------
> :t-0000104                     71190      85680      20.353982 %
> UDP                            4352       3392       22.058824 %
> inode_cache                    54302      41923      22.796582 %
> fscache_cookie_jar             3276       2457       25.000000 %
> :t-0000896                     438        292        33.333333 %
> :t-0000080                     310401     195323     37.073978 %
> ext4_inode_cache               335        201        40.000000 %
> :t-0000192                     89408      128898     44.168307 %
> :t-0000184                     151300     81880      45.882353 %
> :t-0000512                     49698      73648      48.191074 %
> :at-0000192                    242867     120948     50.199904 %
> xfs_inode                      34350      15221      55.688501 %
> :t-0016384                     11005      17257      56.810541 %
> proc_inode_cache               103868     34717      66.575846 %
> tw_sock_TCP                    768        256        66.666667 %
> :t-0004096                     15240      25672      68.451444 %
> nfs_inode_cache                1008       315        68.750000 %
> :t-0001024                     14528      24720      70.154185 %
> :t-0032768                     655        1312       100.305344%
> :t-0002048                     14242      30720      115.700042%
> :t-0000640                     1020       2550       150.000000%
> :t-0008192                     10005      27905      178.910545%
> 
> FWIW, the configuration of this LPAR has slightly changed. It is now configured
> for maximally 400 CPUs, of which 200 are present. The result is that even with
> Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> script reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-512                        1182 MB    2.03%  100.00%
> kmalloc-192                        1182 MB    1.38%  100.00%
> kmalloc-16384                       966 MB   17.66%  100.00%
> kmalloc-4096                        353 MB   15.92%  100.00%
> kmalloc-8192                        259 MB   27.28%  100.00%
> kmalloc-32768                       207 MB    9.86%  100.00%
> 
> In comparison (log2 above):
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       273 MB   98.76%  100.00%
> kmalloc-8192                        225 MB   98.67%  100.00%
> pgtable-2^11                        114 MB  100.00%  100.00%
> pgtable-2^12                        109 MB  100.00%  100.00%
> kmalloc-4096                        104 MB   98.59%  100.00%
> 
> I appreciate all the help so far, if anyone has any ideas how best to
> proceed further, or what they'd like debugged more, I'm happy to get
> this fixed. We're hitting this on a couple of different systems and I'd
> like to find a good resolution to the problem.

Hello,

I have no memoryless system, so, to debug it, I need your help. :)
First, please let me know node information on your system.

I'm preparing 3 another patches which are nearly same with previous patch,
but slightly different approach. Could you test them on your system?
I will send them soon.

And I think that same problem exists if CONFIG_SLAB is enabled. Could you
confirm that?

And, could you confirm that your system's numa_mem_id() is properly set?
And, could you confirm that node_present_pages() test works properly?
And, with my patches, could you give me more information on slub stat?
For this, you need to enable CONFIG_SLUB_STATS. Then please send me all the
slub stat on /proc/sys/kernel/debug/slab.

Sorry for too many request.
If it bothers you too much, please ignore it :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-06  8:04                           ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06  8:04 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > 
> > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > the system showing the original problem, configured to have 15GB of
> > > memory.
> > > 
> > > With your patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:         8768192 kB
> > > Slab:            3882560 kB
> > > SReclaimable:     105408 kB
> > > SUnreclaim:      3777152 kB
> > > 
> > > With Anton's patch after boot:
> > > 
> > > MemTotal:       15604736 kB
> > > MemFree:        11195008 kB
> > > Slab:            1427968 kB
> > > SReclaimable:     109184 kB
> > > SUnreclaim:      1318784 kB
> > > 
> > > 
> > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > 
> > 
> > I don't think the goal of the discussion is to reduce the amount of slab 
> > allocated, but rather get the most local slab memory possible by use of 
> > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > my patch is allocating it on the most local node; Anton's patch is 
> > allocating it on whatever happened to be the cpu slab.
> > 
> > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > --- a/mm/slub.c
> > > > +++ b/mm/slub.c
> > > > @@ -2278,10 +2278,14 @@ redo:
> > > > 
> > > >  	if (unlikely(!node_match(page, node))) {
> > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > -		deactivate_slab(s, page, c->freelist);
> > > > -		c->page = NULL;
> > > > -		c->freelist = NULL;
> > > > -		goto new_slab;
> > > > +		if (unlikely(!node_present_pages(node)))
> > > > +			node = numa_mem_id();
> > > > +		if (!node_match(page, node)) {
> > > > +			deactivate_slab(s, page, c->freelist);
> > > > +			c->page = NULL;
> > > > +			c->freelist = NULL;
> > > > +			goto new_slab;
> > > > +		}
> > > 
> > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > if we have a memoryless node, we expect the page's locality to be that
> > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > Just wanting to make sure I understand the intent.
> > > 
> > 
> > Yeah, the default policy should be to fallback to local memory if the node 
> > passed is memoryless.
> > 
> > > What I find odd is that there are only 2 nodes on this system, node 0
> > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > should be coming from node 1 (thus node_match() should always be true?)
> > > 
> > 
> > The nice thing about slub is its debugging ability, what is 
> > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > patches?
> 
> Ok, I finally got around to writing a script that compares the objects
> output from both kernels.
> 
> log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> and Joonsoo's patch.
> 
> log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> and Anton's patch.
> 
> slab                           objects    objects   percent
>                                log1       log2      change
> -----------------------------------------------------------
> :t-0000104                     71190      85680      20.353982 %
> UDP                            4352       3392       22.058824 %
> inode_cache                    54302      41923      22.796582 %
> fscache_cookie_jar             3276       2457       25.000000 %
> :t-0000896                     438        292        33.333333 %
> :t-0000080                     310401     195323     37.073978 %
> ext4_inode_cache               335        201        40.000000 %
> :t-0000192                     89408      128898     44.168307 %
> :t-0000184                     151300     81880      45.882353 %
> :t-0000512                     49698      73648      48.191074 %
> :at-0000192                    242867     120948     50.199904 %
> xfs_inode                      34350      15221      55.688501 %
> :t-0016384                     11005      17257      56.810541 %
> proc_inode_cache               103868     34717      66.575846 %
> tw_sock_TCP                    768        256        66.666667 %
> :t-0004096                     15240      25672      68.451444 %
> nfs_inode_cache                1008       315        68.750000 %
> :t-0001024                     14528      24720      70.154185 %
> :t-0032768                     655        1312       100.305344%
> :t-0002048                     14242      30720      115.700042%
> :t-0000640                     1020       2550       150.000000%
> :t-0008192                     10005      27905      178.910545%
> 
> FWIW, the configuration of this LPAR has slightly changed. It is now configured
> for maximally 400 CPUs, of which 200 are present. The result is that even with
> Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> script reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-512                        1182 MB    2.03%  100.00%
> kmalloc-192                        1182 MB    1.38%  100.00%
> kmalloc-16384                       966 MB   17.66%  100.00%
> kmalloc-4096                        353 MB   15.92%  100.00%
> kmalloc-8192                        259 MB   27.28%  100.00%
> kmalloc-32768                       207 MB    9.86%  100.00%
> 
> In comparison (log2 above):
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       273 MB   98.76%  100.00%
> kmalloc-8192                        225 MB   98.67%  100.00%
> pgtable-2^11                        114 MB  100.00%  100.00%
> pgtable-2^12                        109 MB  100.00%  100.00%
> kmalloc-4096                        104 MB   98.59%  100.00%
> 
> I appreciate all the help so far, if anyone has any ideas how best to
> proceed further, or what they'd like debugged more, I'm happy to get
> this fixed. We're hitting this on a couple of different systems and I'd
> like to find a good resolution to the problem.

Hello,

I have no memoryless system, so, to debug it, I need your help. :)
First, please let me know node information on your system.

I'm preparing 3 another patches which are nearly same with previous patch,
but slightly different approach. Could you test them on your system?
I will send them soon.

And I think that same problem exists if CONFIG_SLAB is enabled. Could you
confirm that?

And, could you confirm that your system's numa_mem_id() is properly set?
And, could you confirm that node_present_pages() test works properly?
And, with my patches, could you give me more information on slub stat?
For this, you need to enable CONFIG_SLUB_STATS. Then please send me all the
slub stat on /proc/sys/kernel/debug/slab.

Sorry for too many request.
If it bothers you too much, please ignore it :)

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
  2014-02-06  2:07                         ` Nishanth Aravamudan
@ 2014-02-06  8:07                           ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Joonsoo Kim, Wanpeng Li

Currently, if allocation constraint to node is NUMA_NO_NODE, we search
a partial slab on numa_node_id() node. This doesn't work properly on the
system having memoryless node, since it can have no memory on that node and
there must be no partial slab on that node.

On that node, page allocation always fallback to numa_mem_id() first. So
searching a partial slab on numa_node_id() in that case is proper solution
for memoryless node case.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..cc1f995 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1698,7 +1698,7 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
@ 2014-02-06  8:07                           ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim,
	Wanpeng Li

Currently, if allocation constraint to node is NUMA_NO_NODE, we search
a partial slab on numa_node_id() node. This doesn't work properly on the
system having memoryless node, since it can have no memory on that node and
there must be no partial slab on that node.

On that node, page allocation always fallback to numa_mem_id() first. So
searching a partial slab on numa_node_id() in that case is proper solution
for memoryless node case.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..cc1f995 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1698,7 +1698,7 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-06  8:07                           ` Joonsoo Kim
@ 2014-02-06  8:07                             ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Joonsoo Kim, Wanpeng Li

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..a6d5438 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -233,11 +233,20 @@ static inline int numa_node_id(void)
  * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
  */
 DECLARE_PER_CPU(int, _numa_mem_);
+int _node_numa_mem_[MAX_NUMNODES];
 
 #ifndef set_numa_mem
 static inline void set_numa_mem(int node)
 {
 	this_cpu_write(_numa_mem_, node);
+	_node_numa_mem_[numa_node_id()] = node;
+}
+#endif
+
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+	return _node_numa_mem_[node];
 }
 #endif
 
@@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
 static inline void set_cpu_numa_mem(int cpu, int node)
 {
 	per_cpu(_numa_mem_, cpu) = node;
+	_node_numa_mem_[numa_node_id()] = node;
 }
 #endif
 
@@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
 }
 #endif
 
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+	return node;
+}
+#endif
+
 #ifndef cpu_to_mem
 static inline int cpu_to_mem(int cpu)
 {
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-06  8:07                             ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim,
	Wanpeng Li

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..a6d5438 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -233,11 +233,20 @@ static inline int numa_node_id(void)
  * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
  */
 DECLARE_PER_CPU(int, _numa_mem_);
+int _node_numa_mem_[MAX_NUMNODES];
 
 #ifndef set_numa_mem
 static inline void set_numa_mem(int node)
 {
 	this_cpu_write(_numa_mem_, node);
+	_node_numa_mem_[numa_node_id()] = node;
+}
+#endif
+
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+	return _node_numa_mem_[node];
 }
 #endif
 
@@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
 static inline void set_cpu_numa_mem(int cpu, int node)
 {
 	per_cpu(_numa_mem_, cpu) = node;
+	_node_numa_mem_[numa_node_id()] = node;
 }
 #endif
 
@@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
 }
 #endif
 
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+	return node;
+}
+#endif
+
 #ifndef cpu_to_mem
 static inline int cpu_to_mem(int cpu)
 {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
  2014-02-06  8:07                           ` Joonsoo Kim
@ 2014-02-06  8:07                             ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Joonsoo Kim, Wanpeng Li

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/slub.c b/mm/slub.c
index cc1f995..c851f82 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
 	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
+	if (node == NUMA_NO_NODE)
+		searchnode = numa_mem_id();
+	else {
+		searchnode = node;
+		if (!node_present_pages(node))
+			searchnode = get_numa_mem(node);
+	}
+
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
 		return object;
@@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 redo:
 
 	if (unlikely(!node_match(page, node))) {
-		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+		int searchnode = node;
+
+		if (node != NUMA_NO_NODE && !node_present_pages(node))
+			searchnode = get_numa_mem(node);
+
+		if (!node_match(page, searchnode)) {
+			stat(s, ALLOC_NODE_MISMATCH);
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}
 	}
 
 	/*
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
@ 2014-02-06  8:07                             ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06  8:07 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Joonsoo Kim,
	Wanpeng Li

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/mm/slub.c b/mm/slub.c
index cc1f995..c851f82 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
 	void *object;
 	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
 
+	if (node == NUMA_NO_NODE)
+		searchnode = numa_mem_id();
+	else {
+		searchnode = node;
+		if (!node_present_pages(node))
+			searchnode = get_numa_mem(node);
+	}
+
 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
 		return object;
@@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 redo:
 
 	if (unlikely(!node_match(page, node))) {
-		stat(s, ALLOC_NODE_MISMATCH);
-		deactivate_slab(s, page, c->freelist);
-		c->page = NULL;
-		c->freelist = NULL;
-		goto new_slab;
+		int searchnode = node;
+
+		if (node != NUMA_NO_NODE && !node_present_pages(node))
+			searchnode = get_numa_mem(node);
+
+		if (!node_match(page, searchnode)) {
+			stat(s, ALLOC_NODE_MISMATCH);
+			deactivate_slab(s, page, c->freelist);
+			c->page = NULL;
+			c->freelist = NULL;
+			goto new_slab;
+		}
 	}
 
 	/*
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
  2014-02-06  8:07                           ` Joonsoo Kim
@ 2014-02-06  8:37                             ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-06  8:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
> 
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

Acked-by: David Rientjes <rientjes@google.com>

I think you'll need to send these to Andrew since he appears to be picking 
up slub patches these days.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
@ 2014-02-06  8:37                             ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-06  8:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
> 
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

Acked-by: David Rientjes <rientjes@google.com>

I think you'll need to send these to Andrew since he appears to be picking 
up slub patches these days.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-06  8:07                             ` Joonsoo Kim
@ 2014-02-06  8:52                               ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-06  8:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

I may be misunderstanding this patch and there's no help because there's 
no changelog.

> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..a6d5438 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>   */
>  DECLARE_PER_CPU(int, _numa_mem_);
> +int _node_numa_mem_[MAX_NUMNODES];
>  
>  #ifndef set_numa_mem
>  static inline void set_numa_mem(int node)
>  {
>  	this_cpu_write(_numa_mem_, node);
> +	_node_numa_mem_[numa_node_id()] = node;
> +}
> +#endif
> +
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> +	return _node_numa_mem_[node];
>  }
>  #endif
>  
> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>  static inline void set_cpu_numa_mem(int cpu, int node)
>  {
>  	per_cpu(_numa_mem_, cpu) = node;
> +	_node_numa_mem_[numa_node_id()] = node;

The intention seems to be that _node_numa_mem_[X] for a node X will return 
a node Y with memory that has the nearest distance?  In other words, 
caching the value returned by local_memory_node(X)?

That doesn't seem to be what it's doing since numa_node_id() is the node 
of the cpu that current is running on so this ends up getting initialized 
to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in 
cpu_possible_mask.

>  }
>  #endif
>  
> @@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
>  }
>  #endif
>  
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> +	return node;
> +}
> +#endif
> +
>  #ifndef cpu_to_mem
>  static inline int cpu_to_mem(int cpu)
>  {

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-06  8:52                               ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-06  8:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 

I may be misunderstanding this patch and there's no help because there's 
no changelog.

> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..a6d5438 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>   */
>  DECLARE_PER_CPU(int, _numa_mem_);
> +int _node_numa_mem_[MAX_NUMNODES];
>  
>  #ifndef set_numa_mem
>  static inline void set_numa_mem(int node)
>  {
>  	this_cpu_write(_numa_mem_, node);
> +	_node_numa_mem_[numa_node_id()] = node;
> +}
> +#endif
> +
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> +	return _node_numa_mem_[node];
>  }
>  #endif
>  
> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>  static inline void set_cpu_numa_mem(int cpu, int node)
>  {
>  	per_cpu(_numa_mem_, cpu) = node;
> +	_node_numa_mem_[numa_node_id()] = node;

The intention seems to be that _node_numa_mem_[X] for a node X will return 
a node Y with memory that has the nearest distance?  In other words, 
caching the value returned by local_memory_node(X)?

That doesn't seem to be what it's doing since numa_node_id() is the node 
of the cpu that current is running on so this ends up getting initialized 
to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in 
cpu_possible_mask.

>  }
>  #endif
>  
> @@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
>  }
>  #endif
>  
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> +	return node;
> +}
> +#endif
> +
>  #ifndef cpu_to_mem
>  static inline int cpu_to_mem(int cpu)
>  {

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-06  8:52                               ` David Rientjes
@ 2014-02-06 10:29                                 ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06 10:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

2014-02-06 David Rientjes <rientjes@google.com>:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>
> I may be misunderstanding this patch and there's no help because there's
> no changelog.

Sorry about that.
I made this patch just for testing. :)
Thanks for looking this.

>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 12ae6ce..a6d5438 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>>   */
>>  DECLARE_PER_CPU(int, _numa_mem_);
>> +int _node_numa_mem_[MAX_NUMNODES];
>>
>>  #ifndef set_numa_mem
>>  static inline void set_numa_mem(int node)
>>  {
>>       this_cpu_write(_numa_mem_, node);
>> +     _node_numa_mem_[numa_node_id()] = node;
>> +}
>> +#endif
>> +
>> +#ifndef get_numa_mem
>> +static inline int get_numa_mem(int node)
>> +{
>> +     return _node_numa_mem_[node];
>>  }
>>  #endif
>>
>> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>>  static inline void set_cpu_numa_mem(int cpu, int node)
>>  {
>>       per_cpu(_numa_mem_, cpu) = node;
>> +     _node_numa_mem_[numa_node_id()] = node;
>
> The intention seems to be that _node_numa_mem_[X] for a node X will return
> a node Y with memory that has the nearest distance?  In other words,
> caching the value returned by local_memory_node(X)?

Yes, you are right.

> That doesn't seem to be what it's doing since numa_node_id() is the node
> of the cpu that current is running on so this ends up getting initialized
> to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> cpu_possible_mask.

Yes, I made a mistake.
Thanks for pointer.
I fix it and attach v2.
Now I'm out of office, so I'm not sure this second version is correct :(

Thanks.

----------8<--------------

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-06 10:29                                 ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-06 10:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Joonsoo Kim, linuxppc-dev, Christoph Lameter,
	Wanpeng Li

2014-02-06 David Rientjes <rientjes@google.com>:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>
> I may be misunderstanding this patch and there's no help because there's
> no changelog.

Sorry about that.
I made this patch just for testing. :)
Thanks for looking this.

>> diff --git a/include/linux/topology.h b/include/linux/topology.h
>> index 12ae6ce..a6d5438 100644
>> --- a/include/linux/topology.h
>> +++ b/include/linux/topology.h
>> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>>   */
>>  DECLARE_PER_CPU(int, _numa_mem_);
>> +int _node_numa_mem_[MAX_NUMNODES];
>>
>>  #ifndef set_numa_mem
>>  static inline void set_numa_mem(int node)
>>  {
>>       this_cpu_write(_numa_mem_, node);
>> +     _node_numa_mem_[numa_node_id()] = node;
>> +}
>> +#endif
>> +
>> +#ifndef get_numa_mem
>> +static inline int get_numa_mem(int node)
>> +{
>> +     return _node_numa_mem_[node];
>>  }
>>  #endif
>>
>> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>>  static inline void set_cpu_numa_mem(int cpu, int node)
>>  {
>>       per_cpu(_numa_mem_, cpu) = node;
>> +     _node_numa_mem_[numa_node_id()] = node;
>
> The intention seems to be that _node_numa_mem_[X] for a node X will return
> a node Y with memory that has the nearest distance?  In other words,
> caching the value returned by local_memory_node(X)?

Yes, you are right.

> That doesn't seem to be what it's doing since numa_node_id() is the node
> of the cpu that current is running on so this ends up getting initialized
> to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> cpu_possible_mask.

Yes, I made a mistake.
Thanks for pointer.
I fix it and attach v2.
Now I'm out of office, so I'm not sure this second version is correct :(

Thanks.

----------8<--------------
>From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Date: Thu, 6 Feb 2014 17:07:05 +0900
Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
determining the
 fallback node

We need to determine the fallback node in slub allocator if the allocation
target node is memoryless node. Without it, the SLUB wrongly select
the node which has no memory and can't use a partial slab, because of node
mismatch. Introduced function, node_numa_mem(X), will return
a node Y with memory that has the nearest distance. If X is memoryless
node, it will return nearest distance node, but, if
X is normal node, it will return itself.

We will use this function in following patch to determine the fallback
node.

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 12ae6ce..66b19b8 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -233,11 +233,20 @@ static inline int numa_node_id(void)
  * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
  */
 DECLARE_PER_CPU(int, _numa_mem_);
+int _node_numa_mem_[MAX_NUMNODES];

 #ifndef set_numa_mem
 static inline void set_numa_mem(int node)
 {
  this_cpu_write(_numa_mem_, node);
+ _node_numa_mem_[numa_node_id()] = node;
+}
+#endif
+
+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+ return _node_numa_mem_[node];
 }
 #endif

@@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
 static inline void set_cpu_numa_mem(int cpu, int node)
 {
  per_cpu(_numa_mem_, cpu) = node;
+ _node_numa_mem_[cpu_to_node(cpu)] = node;
 }
 #endif

@@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
 }
 #endif

+#ifndef get_numa_mem
+static inline int get_numa_mem(int node)
+{
+ return node;
+}
+#endif
+
 #ifndef cpu_to_mem
 static inline int cpu_to_mem(int cpu)
 {
-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-06  2:08                                           ` Nishanth Aravamudan
@ 2014-02-06 17:25                                             ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-06 17:25 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Wed, 5 Feb 2014, Nishanth Aravamudan wrote:

> > Right so if we are ignoring the node then the simplest thing to do is to
> > not deactivate the current cpu slab but to take an object from it.
>
> Ok, that's what Anton's patch does, I believe. Are you ok with that
> patch as it is?

No. Again his patch only works if the node is memoryless not if there are
other issues that prevent allocation from that node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-06 17:25                                             ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-06 17:25 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, David Rientjes, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Wed, 5 Feb 2014, Nishanth Aravamudan wrote:

> > Right so if we are ignoring the node then the simplest thing to do is to
> > not deactivate the current cpu slab but to take an object from it.
>
> Ok, that's what Anton's patch does, I believe. Are you ok with that
> patch as it is?

No. Again his patch only works if the node is memoryless not if there are
other issues that prevent allocation from that node.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
  2014-02-06  8:07                           ` Joonsoo Kim
@ 2014-02-06 17:26                             ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-06 17:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, penberg,
	linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
>
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
@ 2014-02-06 17:26                             ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-06 17:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm,
	paulus, Anton Blanchard, David Rientjes, linuxppc-dev,
	Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
>
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
  2014-02-06  8:07                             ` Joonsoo Kim
@ 2014-02-06 17:30                               ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-06 17:30 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, penberg,
	linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index cc1f995..c851f82 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>  	void *object;
>  	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
>
> +	if (node == NUMA_NO_NODE)
> +		searchnode = numa_mem_id();
> +	else {
> +		searchnode = node;
> +		if (!node_present_pages(node))

This check wouild need to be something that checks for other contigencies
in the page allocator as well. A simple solution would be to actually run
a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
If that fails then fallback. See how fallback_alloc() does it in slab.

> +			searchnode = get_numa_mem(node);
> +	}

> @@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  redo:
>
>  	if (unlikely(!node_match(page, node))) {
> -		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +		int searchnode = node;
> +
> +		if (node != NUMA_NO_NODE && !node_present_pages(node))

Same issue here. I would suggest not deactivating the slab and first check
if the node has no pages. If so then just take an object from the current
cpu slab. If that is not available do an allcoation from the indicated
node and take whatever the page allocator gave you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
@ 2014-02-06 17:30                               ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-06 17:30 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm,
	paulus, Anton Blanchard, David Rientjes, linuxppc-dev,
	Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> diff --git a/mm/slub.c b/mm/slub.c
> index cc1f995..c851f82 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>  	void *object;
>  	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
>
> +	if (node == NUMA_NO_NODE)
> +		searchnode = numa_mem_id();
> +	else {
> +		searchnode = node;
> +		if (!node_present_pages(node))

This check wouild need to be something that checks for other contigencies
in the page allocator as well. A simple solution would be to actually run
a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
If that fails then fallback. See how fallback_alloc() does it in slab.

> +			searchnode = get_numa_mem(node);
> +	}

> @@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
>  redo:
>
>  	if (unlikely(!node_match(page, node))) {
> -		stat(s, ALLOC_NODE_MISMATCH);
> -		deactivate_slab(s, page, c->freelist);
> -		c->page = NULL;
> -		c->freelist = NULL;
> -		goto new_slab;
> +		int searchnode = node;
> +
> +		if (node != NUMA_NO_NODE && !node_present_pages(node))

Same issue here. I would suggest not deactivating the slab and first check
if the node has no pages. If so then just take an object from the current
cpu slab. If that is not available do an allcoation from the indicated
node and take whatever the page allocator gave you.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
  2014-02-06  8:37                             ` David Rientjes
@ 2014-02-06 17:31                               ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-06 17:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Nishanth Aravamudan, Han Pingtian, penberg,
	linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Thu, 6 Feb 2014, David Rientjes wrote:

> I think you'll need to send these to Andrew since he appears to be picking
> up slub patches these days.

I can start managing merges again if Pekka no longer has the time.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
@ 2014-02-06 17:31                               ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-06 17:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Thu, 6 Feb 2014, David Rientjes wrote:

> I think you'll need to send these to Andrew since he appears to be picking
> up slub patches these days.

I can start managing merges again if Pekka no longer has the time.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-06 10:29                                 ` Joonsoo Kim
@ 2014-02-06 19:11                                   ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-06 19:11 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Joonsoo Kim, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On 06.02.2014 [19:29:16 +0900], Joonsoo Kim wrote:
> 2014-02-06 David Rientjes <rientjes@google.com>:
> > On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> >
> >> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>
> >
> > I may be misunderstanding this patch and there's no help because there's
> > no changelog.
> 
> Sorry about that.
> I made this patch just for testing. :)
> Thanks for looking this.
> 
> >> diff --git a/include/linux/topology.h b/include/linux/topology.h
> >> index 12ae6ce..a6d5438 100644
> >> --- a/include/linux/topology.h
> >> +++ b/include/linux/topology.h
> >> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> >>   */
> >>  DECLARE_PER_CPU(int, _numa_mem_);
> >> +int _node_numa_mem_[MAX_NUMNODES];
> >>
> >>  #ifndef set_numa_mem
> >>  static inline void set_numa_mem(int node)
> >>  {
> >>       this_cpu_write(_numa_mem_, node);
> >> +     _node_numa_mem_[numa_node_id()] = node;
> >> +}
> >> +#endif
> >> +
> >> +#ifndef get_numa_mem
> >> +static inline int get_numa_mem(int node)
> >> +{
> >> +     return _node_numa_mem_[node];
> >>  }
> >>  #endif
> >>
> >> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
> >>  static inline void set_cpu_numa_mem(int cpu, int node)
> >>  {
> >>       per_cpu(_numa_mem_, cpu) = node;
> >> +     _node_numa_mem_[numa_node_id()] = node;
> >
> > The intention seems to be that _node_numa_mem_[X] for a node X will return
> > a node Y with memory that has the nearest distance?  In other words,
> > caching the value returned by local_memory_node(X)?
> 
> Yes, you are right.
> 
> > That doesn't seem to be what it's doing since numa_node_id() is the node
> > of the cpu that current is running on so this ends up getting initialized
> > to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> > cpu_possible_mask.
> 
> Yes, I made a mistake.
> Thanks for pointer.
> I fix it and attach v2.
> Now I'm out of office, so I'm not sure this second version is correct :(
> 
> Thanks.
> 
> ----------8<--------------
> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
>  fallback node
> 
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
> 
> We will use this function in following patch to determine the fallback
> node.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..66b19b8 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>   */
>  DECLARE_PER_CPU(int, _numa_mem_);
> +int _node_numa_mem_[MAX_NUMNODES];

Should be static, I think?

> 
>  #ifndef set_numa_mem
>  static inline void set_numa_mem(int node)
>  {
>   this_cpu_write(_numa_mem_, node);
> + _node_numa_mem_[numa_node_id()] = node;
> +}
> +#endif
> +
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return _node_numa_mem_[node];
>  }
>  #endif
> 
> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>  static inline void set_cpu_numa_mem(int cpu, int node)
>  {
>   per_cpu(_numa_mem_, cpu) = node;
> + _node_numa_mem_[cpu_to_node(cpu)] = node;
>  }
>  #endif
> 
> @@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
>  }
>  #endif
> 
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return node;
> +}
> +#endif
> +
>  #ifndef cpu_to_mem
>  static inline int cpu_to_mem(int cpu)
>  {
> -- 
> 1.7.9.5
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-06 19:11                                   ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-06 19:11 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Christoph Lameter,
	Wanpeng Li

On 06.02.2014 [19:29:16 +0900], Joonsoo Kim wrote:
> 2014-02-06 David Rientjes <rientjes@google.com>:
> > On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> >
> >> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >>
> >
> > I may be misunderstanding this patch and there's no help because there's
> > no changelog.
> 
> Sorry about that.
> I made this patch just for testing. :)
> Thanks for looking this.
> 
> >> diff --git a/include/linux/topology.h b/include/linux/topology.h
> >> index 12ae6ce..a6d5438 100644
> >> --- a/include/linux/topology.h
> >> +++ b/include/linux/topology.h
> >> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> >>   */
> >>  DECLARE_PER_CPU(int, _numa_mem_);
> >> +int _node_numa_mem_[MAX_NUMNODES];
> >>
> >>  #ifndef set_numa_mem
> >>  static inline void set_numa_mem(int node)
> >>  {
> >>       this_cpu_write(_numa_mem_, node);
> >> +     _node_numa_mem_[numa_node_id()] = node;
> >> +}
> >> +#endif
> >> +
> >> +#ifndef get_numa_mem
> >> +static inline int get_numa_mem(int node)
> >> +{
> >> +     return _node_numa_mem_[node];
> >>  }
> >>  #endif
> >>
> >> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
> >>  static inline void set_cpu_numa_mem(int cpu, int node)
> >>  {
> >>       per_cpu(_numa_mem_, cpu) = node;
> >> +     _node_numa_mem_[numa_node_id()] = node;
> >
> > The intention seems to be that _node_numa_mem_[X] for a node X will return
> > a node Y with memory that has the nearest distance?  In other words,
> > caching the value returned by local_memory_node(X)?
> 
> Yes, you are right.
> 
> > That doesn't seem to be what it's doing since numa_node_id() is the node
> > of the cpu that current is running on so this ends up getting initialized
> > to whatever local_memory_node(cpu_to_node(cpu)) is for the last bit set in
> > cpu_possible_mask.
> 
> Yes, I made a mistake.
> Thanks for pointer.
> I fix it and attach v2.
> Now I'm out of office, so I'm not sure this second version is correct :(
> 
> Thanks.
> 
> ----------8<--------------
> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
>  fallback node
> 
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
> 
> We will use this function in following patch to determine the fallback
> node.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index 12ae6ce..66b19b8 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
>   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
>   */
>  DECLARE_PER_CPU(int, _numa_mem_);
> +int _node_numa_mem_[MAX_NUMNODES];

Should be static, I think?

> 
>  #ifndef set_numa_mem
>  static inline void set_numa_mem(int node)
>  {
>   this_cpu_write(_numa_mem_, node);
> + _node_numa_mem_[numa_node_id()] = node;
> +}
> +#endif
> +
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return _node_numa_mem_[node];
>  }
>  #endif
> 
> @@ -260,6 +269,7 @@ static inline int cpu_to_mem(int cpu)
>  static inline void set_cpu_numa_mem(int cpu, int node)
>  {
>   per_cpu(_numa_mem_, cpu) = node;
> + _node_numa_mem_[cpu_to_node(cpu)] = node;
>  }
>  #endif
> 
> @@ -273,6 +283,13 @@ static inline int numa_mem_id(void)
>  }
>  #endif
> 
> +#ifndef get_numa_mem
> +static inline int get_numa_mem(int node)
> +{
> + return node;
> +}
> +#endif
> +
>  #ifndef cpu_to_mem
>  static inline int cpu_to_mem(int cpu)
>  {
> -- 
> 1.7.9.5
> 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
       [not found]                           ` <20140206185955.GA7845@linux.vnet.ibm.com>
@ 2014-02-06 19:28                               ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-06 19:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

[-- Attachment #1: Type: text/plain, Size: 8967 bytes --]

On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:
> On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:
> > On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > 
> > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > > 
> > > > > With your patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:         8768192 kB
> > > > > Slab:            3882560 kB
> > > > > SReclaimable:     105408 kB
> > > > > SUnreclaim:      3777152 kB
> > > > > 
> > > > > With Anton's patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:        11195008 kB
> > > > > Slab:            1427968 kB
> > > > > SReclaimable:     109184 kB
> > > > > SUnreclaim:      1318784 kB
> > > > > 
> > > > > 
> > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > 
> > > > 
> > > > I don't think the goal of the discussion is to reduce the amount of slab 
> > > > allocated, but rather get the most local slab memory possible by use of 
> > > > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > > > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > > > my patch is allocating it on the most local node; Anton's patch is 
> > > > allocating it on whatever happened to be the cpu slab.
> > > > 
> > > > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > > > --- a/mm/slub.c
> > > > > > +++ b/mm/slub.c
> > > > > > @@ -2278,10 +2278,14 @@ redo:
> > > > > > 
> > > > > >  	if (unlikely(!node_match(page, node))) {
> > > > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > > > -		deactivate_slab(s, page, c->freelist);
> > > > > > -		c->page = NULL;
> > > > > > -		c->freelist = NULL;
> > > > > > -		goto new_slab;
> > > > > > +		if (unlikely(!node_present_pages(node)))
> > > > > > +			node = numa_mem_id();
> > > > > > +		if (!node_match(page, node)) {
> > > > > > +			deactivate_slab(s, page, c->freelist);
> > > > > > +			c->page = NULL;
> > > > > > +			c->freelist = NULL;
> > > > > > +			goto new_slab;
> > > > > > +		}
> > > > > 
> > > > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > > > if we have a memoryless node, we expect the page's locality to be that
> > > > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > > > Just wanting to make sure I understand the intent.
> > > > > 
> > > > 
> > > > Yeah, the default policy should be to fallback to local memory if the node 
> > > > passed is memoryless.
> > > > 
> > > > > What I find odd is that there are only 2 nodes on this system, node 0
> > > > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > > > should be coming from node 1 (thus node_match() should always be true?)
> > > > > 
> > > > 
> > > > The nice thing about slub is its debugging ability, what is 
> > > > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > > > patches?
> > > 
> > > Ok, I finally got around to writing a script that compares the objects
> > > output from both kernels.
> > > 
> > > log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > and Joonsoo's patch.
> > > 
> > > log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > and Anton's patch.
> > > 
> > > slab                           objects    objects   percent
> > >                                log1       log2      change
> > > -----------------------------------------------------------
> > > :t-0000104                     71190      85680      20.353982 %
> > > UDP                            4352       3392       22.058824 %
> > > inode_cache                    54302      41923      22.796582 %
> > > fscache_cookie_jar             3276       2457       25.000000 %
> > > :t-0000896                     438        292        33.333333 %
> > > :t-0000080                     310401     195323     37.073978 %
> > > ext4_inode_cache               335        201        40.000000 %
> > > :t-0000192                     89408      128898     44.168307 %
> > > :t-0000184                     151300     81880      45.882353 %
> > > :t-0000512                     49698      73648      48.191074 %
> > > :at-0000192                    242867     120948     50.199904 %
> > > xfs_inode                      34350      15221      55.688501 %
> > > :t-0016384                     11005      17257      56.810541 %
> > > proc_inode_cache               103868     34717      66.575846 %
> > > tw_sock_TCP                    768        256        66.666667 %
> > > :t-0004096                     15240      25672      68.451444 %
> > > nfs_inode_cache                1008       315        68.750000 %
> > > :t-0001024                     14528      24720      70.154185 %
> > > :t-0032768                     655        1312       100.305344%
> > > :t-0002048                     14242      30720      115.700042%
> > > :t-0000640                     1020       2550       150.000000%
> > > :t-0008192                     10005      27905      178.910545%
> > > 
> > > FWIW, the configuration of this LPAR has slightly changed. It is now configured
> > > for maximally 400 CPUs, of which 200 are present. The result is that even with
> > > Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> > > script reports:
> > > 
> > > slab                                   mem     objs    slabs
> > >                                       used   active   active
> > > ------------------------------------------------------------
> > > kmalloc-512                        1182 MB    2.03%  100.00%
> > > kmalloc-192                        1182 MB    1.38%  100.00%
> > > kmalloc-16384                       966 MB   17.66%  100.00%
> > > kmalloc-4096                        353 MB   15.92%  100.00%
> > > kmalloc-8192                        259 MB   27.28%  100.00%
> > > kmalloc-32768                       207 MB    9.86%  100.00%
> > > 
> > > In comparison (log2 above):
> > > 
> > > slab                                   mem     objs    slabs
> > >                                       used   active   active
> > > ------------------------------------------------------------
> > > kmalloc-16384                       273 MB   98.76%  100.00%
> > > kmalloc-8192                        225 MB   98.67%  100.00%
> > > pgtable-2^11                        114 MB  100.00%  100.00%
> > > pgtable-2^12                        109 MB  100.00%  100.00%
> > > kmalloc-4096                        104 MB   98.59%  100.00%
> > > 
> > > I appreciate all the help so far, if anyone has any ideas how best to
> > > proceed further, or what they'd like debugged more, I'm happy to get
> > > this fixed. We're hitting this on a couple of different systems and I'd
> > > like to find a good resolution to the problem.
> > 
> > Hello,
> > 
> > I have no memoryless system, so, to debug it, I need your help. :)
> > First, please let me know node information on your system.
> 
> [    0.000000] Node 0 Memory:
> [    0.000000] Node 1 Memory: 0x0-0x200000000
> 
> [    0.000000] On node 0 totalpages: 0
> [    0.000000] On node 1 totalpages: 131072
> [    0.000000]   DMA zone: 112 pages used for memmap
> [    0.000000]   DMA zone: 0 pages reserved
> [    0.000000]   DMA zone: 131072 pages, LIFO batch:1
> 
> [    0.638391] Node 0 CPUs: 0-199
> [    0.638394] Node 1 CPUs:
> 
> Do you need anything else?
> 
> > I'm preparing 3 another patches which are nearly same with previous patch,
> > but slightly different approach. Could you test them on your system?
> > I will send them soon.
> 
> Test results are in the attached tarball [1].
> 
> > And I think that same problem exists if CONFIG_SLAB is enabled. Could you
> > confirm that?
> 
> I will test and let you know.

Ok, with your patches applied and CONFIG_SLAB enabled:

MemTotal:        8264640 kB
MemFree:         7119680 kB
Slab:             207232 kB
SReclaimable:      32896 kB
SUnreclaim:       174336 kB

For reference, same kernel with CONFIG_SLUB:

MemTotal:        8264640 kB
MemFree:         4264000 kB
Slab:            3065408 kB
SReclaimable:     104704 kB
SUnreclaim:      2960704 kB

So CONFIG_SLAB is much better in this case.

Without your patches (but still CONFIG_HAVE_MEMORYLESS_NODES, kthread
locality patch and two other unrelated bugfix patches):

3.13.0-slub:

MemTotal:        8264704 kB
MemFree:         4404288 kB
Slab:            2963648 kB
SReclaimable:     106816 kB
SUnreclaim:      2856832 kB

3.13.0-slab:

MemTotal:        8264640 kB
MemFree:         7263168 kB
Slab:             206144 kB
SReclaimable:      32576 kB
SUnreclaim:       173568 kB

In case it's helpful, I've attached /proc/slabinfo from both kernels.

Thanks,
Nish

[-- Attachment #2: slabusage.3.13.SLAB --]
[-- Type: text/plain, Size: 13115 bytes --]

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
thread_info                          34 MB   96.33%  100.00%
kmalloc-1024                         22 MB   97.44%  100.00%
task_struct                          19 MB   95.15%  100.00%
kmalloc-16384                         9 MB   98.05%  100.00%
inode_cache                           8 MB   97.74%  100.00%
kmalloc-512                           7 MB   89.56%  100.00%
dentry                                7 MB   98.89%  100.00%
kmalloc-8192                          6 MB   98.64%  100.00%
proc_inode_cache                      6 MB   90.20%  100.00%
idr_layer_cache                       4 MB   94.76%  100.00%
sighand_cache                         4 MB   94.69%  100.00%
pgtable-2^12                          3 MB   72.58%  100.00%
xfs_inode                             3 MB   98.89%  100.00%
sysfs_dir_cache                       3 MB   98.29%  100.00%
radix_tree_node                       2 MB   97.19%  100.00%
kmalloc-32768                         2 MB   97.96%  100.00%
kmalloc-4096                          2 MB   97.68%  100.00%
filp                                  2 MB   20.71%  100.00%
signal_cache                          2 MB   72.35%  100.00%
pgtable-2^10                          2 MB   52.81%  100.00%
kmalloc-256                           2 MB   85.56%  100.00%
kmalloc-2048                          1 MB   84.95%  100.00%
shmem_inode_cache                     1 MB   89.59%  100.00%
dtl                                   1 MB   98.77%  100.00%
kmalloc-192                           1 MB   77.89%  100.00%
vm_area_struct                        1 MB   76.80%  100.00%
cred_jar                              1 MB   36.80%  100.00%
kmem_cache                            1 MB   97.69%  100.00%
kmalloc-65536                         0 MB  100.00%  100.00%
kmalloc-128                           0 MB   87.07%  100.00%
buffer_head                           0 MB   92.52%  100.00%
kmalloc-32                            0 MB   92.89%  100.00%
anon_vma_chain                        0 MB   47.46%  100.00%
sock_inode_cache                      0 MB   65.45%  100.00%
kmalloc-64                            0 MB   94.98%  100.00%
files_cache                           0 MB   60.85%  100.00%
names_cache                           0 MB   85.83%  100.00%
mm_struct                             0 MB   22.06%  100.00%
xfs_buf                               0 MB   91.50%  100.00%
UNIX                                  0 MB   37.90%  100.00%
task_delay_info                       0 MB   66.76%  100.00%
skbuff_head_cache                     0 MB   50.33%  100.00%
pid                                   0 MB   62.63%  100.00%
RAW                                   0 MB   92.59%  100.00%
kmalloc-96                            0 MB   63.71%  100.00%
anon_vma                              0 MB   52.25%  100.00%
xfs_ifork                             0 MB   88.60%  100.00%
biovec-256                            0 MB   75.56%  100.00%
TCP                                   0 MB   19.66%  100.00%
ftrace_event_field                    0 MB   63.17%  100.00%
fs_cache                              0 MB   24.30%  100.00%
file_lock_cache                       0 MB    5.24%  100.00%
eventpoll_epi                         0 MB   13.21%  100.00%
cifs_request                          0 MB   71.43%  100.00%
cfq_queue                             0 MB   26.90%  100.00%
blkdev_queue                          0 MB   48.39%  100.00%
UDP                                   0 MB   12.50%  100.00%
xfs_trans                             0 MB    4.33%  100.00%
xfs_log_ticket                        0 MB    3.45%  100.00%
xfs_log_item_desc                     0 MB    2.42%  100.00%
xfs_ioend                             0 MB   84.65%  100.00%
xfs_ili                               0 MB   66.20%  100.00%
xfs_buf_item                          0 MB    7.94%  100.00%
xfs_btree_cur                         0 MB    1.94%  100.00%
uid_cache                             0 MB    1.61%  100.00%
tcp_bind_bucket                       0 MB    2.18%  100.00%
taskstats                             0 MB    3.55%  100.00%
sigqueue                              0 MB    0.75%  100.00%
sgpool-8                              0 MB    1.59%  100.00%
sgpool-64                             0 MB    6.45%  100.00%
sgpool-32                             0 MB    3.17%  100.00%
sgpool-16                             0 MB    1.57%  100.00%
sgpool-128                            0 MB   13.33%  100.00%
sd_ext_cdb                            0 MB    0.11%  100.00%
scsi_sense_cache                      0 MB    0.60%  100.00%
scsi_cmd_cache                        0 MB    1.19%  100.00%
rpc_tasks                             0 MB    3.17%  100.00%
rpc_inode_cache                       0 MB   31.68%  100.00%
rpc_buffers                           0 MB   25.81%  100.00%
revoke_table                          0 MB    0.12%  100.00%
pool_workqueue                        0 MB    4.37%  100.00%
numa_policy                           0 MB   46.75%  100.00%
nsproxy                               0 MB    0.16%  100.00%
nfs_write_data                        0 MB   50.79%  100.00%
nfs_inode_cache                       0 MB   27.69%  100.00%
nfs_commit_data                       0 MB    4.76%  100.00%
nf_conntrack_c000000000cc9900         0 MB   45.22%  100.00%
mqueue_inode_cache                    0 MB    1.39%  100.00%
mnt_cache                             0 MB   53.57%  100.00%
key_jar                               0 MB    5.56%  100.00%
jbd2_revoke_table_s                   0 MB    0.06%  100.00%
ip_fib_trie                           0 MB    0.73%  100.00%
ip_fib_alias                          0 MB    0.71%  100.00%
ip_dst_cache                          0 MB   30.16%  100.00%
inotify_inode_mark                    0 MB   17.23%  100.00%
inet_peer_cache                       0 MB    3.97%  100.00%
hugetlbfs_inode_cache                 0 MB    2.59%  100.00%
ftrace_event_file                     0 MB   92.58%  100.00%
fsnotify_event                        0 MB    0.18%  100.00%
ext4_inode_cache                      0 MB    4.35%  100.00%
ext4_groupinfo_4k                     0 MB    8.55%  100.00%
ext4_extent_status                    0 MB    0.07%  100.00%
ext3_inode_cache                      0 MB    4.76%  100.00%
eventpoll_pwq                         0 MB   15.20%  100.00%
dnotify_struct                        0 MB    0.60%  100.00%
dnotify_mark                          0 MB    2.08%  100.00%
dm_io                                 0 MB    2.28%  100.00%
cifs_small_rq                         0 MB   23.62%  100.00%
cifs_mpx_ids                          0 MB    0.60%  100.00%
cfq_io_cq                             0 MB   26.42%  100.00%
blkdev_requests                       0 MB   10.56%  100.00%
blkdev_ioc                            0 MB   19.31%  100.00%
biovec-16                             0 MB    1.19%  100.00%
bio-1                                 0 MB   13.49%  100.00%
bio-0                                 0 MB    1.59%  100.00%
bdev_cache                            0 MB   52.78%  100.00%
xfs_mru_cache_elem                    0 MB    0.00%    0.00%
xfs_icr                               0 MB    0.00%    0.00%
xfs_efi_item                          0 MB    0.00%    0.00%
xfs_efd_item                          0 MB    0.00%    0.00%
xfs_da_state                          0 MB    0.00%    0.00%
xfs_bmap_free_item                    0 MB    0.00%    0.00%
xfrm_dst_cache                        0 MB    0.00%    0.00%
tw_sock_TCP                           0 MB    0.00%    0.00%
skbuff_fclone_cache                   0 MB    0.00%    0.00%
shared_policy_node                    0 MB    0.00%    0.00%
secpath_cache                         0 MB    0.00%    0.00%
scsi_data_buffer                      0 MB    0.00%    0.00%
revoke_record                         0 MB    0.00%    0.00%
request_sock_TCP                      0 MB    0.00%    0.00%
reiser_inode_cache                    0 MB    0.00%    0.00%
posix_timers_cache                    0 MB    0.00%    0.00%
pid_namespace                         0 MB    0.00%    0.00%
nfsd_drc                              0 MB    0.00%    0.00%
nfsd4_stateids                        0 MB    0.00%    0.00%
nfsd4_openowners                      0 MB    0.00%    0.00%
nfsd4_lockowners                      0 MB    0.00%    0.00%
nfsd4_files                           0 MB    0.00%    0.00%
nfsd4_delegations                     0 MB    0.00%    0.00%
nfs_read_data                         0 MB    0.00%    0.00%
nfs_page                              0 MB    0.00%    0.00%
nfs_direct_cache                      0 MB    0.00%    0.00%
nf_conntrack_expect                   0 MB    0.00%    0.00%
net_namespace                         0 MB    0.00%    0.00%
kmalloc-8388608                       0 MB    0.00%    0.00%
kmalloc-524288                        0 MB    0.00%    0.00%
kmalloc-4194304                       0 MB    0.00%    0.00%
kmalloc-262144                        0 MB    0.00%    0.00%
kmalloc-2097152                       0 MB    0.00%    0.00%
kmalloc-16777216                      0 MB    0.00%    0.00%
kmalloc-131072                        0 MB    0.00%    0.00%
kmalloc-1048576                       0 MB    0.00%    0.00%
kioctx                                0 MB    0.00%    0.00%
kiocb                                 0 MB    0.00%    0.00%
kcopyd_job                            0 MB    0.00%    0.00%
journal_head                          0 MB    0.00%    0.00%
journal_handle                        0 MB    0.00%    0.00%
jbd2_transaction_s                    0 MB    0.00%    0.00%
jbd2_revoke_record_s                  0 MB    0.00%    0.00%
jbd2_journal_head                     0 MB    0.00%    0.00%
jbd2_journal_handle                   0 MB    0.00%    0.00%
jbd2_inode                            0 MB    0.00%    0.00%
jbd2_4k                               0 MB    0.00%    0.00%
isofs_inode_cache                     0 MB    0.00%    0.00%
io                                    0 MB    0.00%    0.00%
inotify_event_private_data            0 MB    0.00%    0.00%
fstrm_item                            0 MB    0.00%    0.00%
fsnotify_event_holder                 0 MB    0.00%    0.00%
flow_cache                            0 MB    0.00%    0.00%
fat_inode_cache                       0 MB    0.00%    0.00%
fat_cache                             0 MB    0.00%    0.00%
fasync_cache                          0 MB    0.00%    0.00%
ext4_xattr                            0 MB    0.00%    0.00%
ext4_system_zone                      0 MB    0.00%    0.00%
ext4_prealloc_space                   0 MB    0.00%    0.00%
ext4_io_end                           0 MB    0.00%    0.00%
ext4_free_data                        0 MB    0.00%    0.00%
ext4_allocation_context               0 MB    0.00%    0.00%
ext3_xattr                            0 MB    0.00%    0.00%
ext2_xattr                            0 MB    0.00%    0.00%
ext2_inode_cache                      0 MB    0.00%    0.00%
dma-kmalloc-96                        0 MB    0.00%    0.00%
dma-kmalloc-8388608                   0 MB    0.00%    0.00%
dma-kmalloc-8192                      0 MB    0.00%    0.00%
dma-kmalloc-65536                     0 MB    0.00%    0.00%
dma-kmalloc-64                        0 MB    0.00%    0.00%
dma-kmalloc-524288                    0 MB    0.00%    0.00%
dma-kmalloc-512                       0 MB    0.00%    0.00%
dma-kmalloc-4194304                   0 MB    0.00%    0.00%
dma-kmalloc-4096                      0 MB    0.00%    0.00%
dma-kmalloc-32768                     0 MB    0.00%    0.00%
dma-kmalloc-32                        0 MB    0.00%    0.00%
dma-kmalloc-262144                    0 MB    0.00%    0.00%
dma-kmalloc-256                       0 MB    0.00%    0.00%
dma-kmalloc-2097152                   0 MB    0.00%    0.00%
dma-kmalloc-2048                      0 MB    0.00%    0.00%
dma-kmalloc-192                       0 MB    0.00%    0.00%
dma-kmalloc-16777216                  0 MB    0.00%    0.00%
dma-kmalloc-16384                     0 MB    0.00%    0.00%
dma-kmalloc-131072                    0 MB    0.00%    0.00%
dma-kmalloc-128                       0 MB    0.00%    0.00%
dma-kmalloc-1048576                   0 MB    0.00%    0.00%
dma-kmalloc-1024                      0 MB    0.00%    0.00%
dm_uevent                             0 MB    0.00%    0.00%
dm_rq_target_io                       0 MB    0.00%    0.00%
dio                                   0 MB    0.00%    0.00%
cifs_inode_cache                      0 MB    0.00%    0.00%
bsg_cmd                               0 MB    0.00%    0.00%
biovec-64                             0 MB    0.00%    0.00%
biovec-128                            0 MB    0.00%    0.00%
UDP-Lite                              0 MB    0.00%    0.00%
PING                                  0 MB    0.00%    0.00%

[-- Attachment #3: slabusage.3.13.SLUB --]
[-- Type: text/plain, Size: 7076 bytes --]

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                      1018 MB   14.09%  100.00%
task_struct                         704 MB   17.20%  100.00%
pgtable-2^12                        110 MB  100.00%  100.00%
kmalloc-8192                        109 MB   49.21%  100.00%
pgtable-2^10                        105 MB  100.00%  100.00%
kmalloc-65536                        92 MB  100.00%  100.00%
kmalloc-512                          83 MB   16.68%  100.00%
kmalloc-128                          75 MB   17.55%  100.00%
kmalloc-4096                         52 MB   97.30%  100.00%
kmalloc-16                           38 MB   24.78%  100.00%
kmalloc-256                          33 MB   99.09%  100.00%
kmalloc-1024                         27 MB   60.45%  100.00%
sighand_cache                        27 MB  100.00%  100.00%
idr_layer_cache                      25 MB  100.00%  100.00%
kmalloc-2048                         25 MB   97.59%  100.00%
dentry                               23 MB  100.00%  100.00%
inode_cache                          20 MB  100.00%  100.00%
proc_inode_cache                     19 MB  100.00%  100.00%
sysfs_dir_cache                      16 MB  100.00%  100.00%
vm_area_struct                       14 MB  100.00%  100.00%
kmalloc-64                           14 MB   97.79%  100.00%
kmalloc-192                          13 MB   97.60%  100.00%
kmalloc-32                           12 MB   97.56%  100.00%
anon_vma                             12 MB  100.00%  100.00%
mm_struct                            12 MB  100.00%  100.00%
sigqueue                             12 MB  100.00%  100.00%
files_cache                          12 MB  100.00%  100.00%
cfq_queue                            11 MB  100.00%  100.00%
radix_tree_node                      11 MB  100.00%  100.00%
kmalloc-96                           10 MB   97.06%  100.00%
blkdev_requests                      10 MB  100.00%  100.00%
xfs_inode                             9 MB  100.00%  100.00%
shmem_inode_cache                     9 MB  100.00%  100.00%
ext4_system_zone                      9 MB  100.00%  100.00%
sock_inode_cache                      9 MB  100.00%  100.00%
RAW                                   8 MB  100.00%  100.00%
kmalloc-8                             8 MB  100.00%  100.00%
kmalloc-32768                         8 MB  100.00%  100.00%
blkdev_ioc                            7 MB  100.00%  100.00%
buffer_head                           6 MB  100.00%  100.00%
xfs_da_state                          6 MB  100.00%  100.00%
mnt_cache                             6 MB  100.00%  100.00%
numa_policy                           6 MB  100.00%  100.00%
dnotify_mark                          4 MB  100.00%  100.00%
TCP                                   3 MB  100.00%  100.00%
cifs_request                          3 MB  100.00%  100.00%
UDP                                   3 MB  100.00%  100.00%
xfs_ili                               3 MB  100.00%  100.00%
xfs_btree_cur                         3 MB  100.00%  100.00%
nf_conntrack_c000000000cb5480         2 MB  100.00%  100.00%
fsnotify_event_holder                 1 MB  100.00%  100.00%
dm_rq_target_io                       1 MB  100.00%  100.00%
bdev_cache                            1 MB  100.00%  100.00%
kmem_cache                            1 MB   89.09%  100.00%
blkdev_queue                          0 MB  100.00%  100.00%
dio                                   0 MB  100.00%  100.00%
taskstats                             0 MB  100.00%  100.00%
kmem_cache_node                       0 MB  100.00%  100.00%
shared_policy_node                    0 MB  100.00%  100.00%
rpc_inode_cache                       0 MB  100.00%  100.00%
nfs_inode_cache                       0 MB  100.00%  100.00%
revoke_table                          0 MB  100.00%  100.00%
ip_fib_trie                           0 MB  100.00%  100.00%
ext4_inode_cache                      0 MB  100.00%  100.00%
hugetlbfs_inode_cache                 0 MB  100.00%  100.00%
ext3_inode_cache                      0 MB  100.00%  100.00%
tw_sock_TCP                           0 MB  100.00%  100.00%
mqueue_inode_cache                    0 MB  100.00%  100.00%
ext4_extent_status                    0 MB  100.00%  100.00%
ext4_allocation_context               0 MB  100.00%  100.00%
xfs_icr                               0 MB    0.00%    0.00%
revoke_record                         0 MB    0.00%    0.00%
reiser_inode_cache                    0 MB    0.00%    0.00%
posix_timers_cache                    0 MB    0.00%    0.00%
pid_namespace                         0 MB    0.00%    0.00%
nfsd4_openowners                      0 MB    0.00%    0.00%
nfsd4_delegations                     0 MB    0.00%    0.00%
nfs_direct_cache                      0 MB    0.00%    0.00%
net_namespace                         0 MB    0.00%    0.00%
kmalloc-131072                        0 MB    0.00%    0.00%
kcopyd_job                            0 MB    0.00%    0.00%
journal_head                          0 MB    0.00%    0.00%
journal_handle                        0 MB    0.00%    0.00%
jbd2_transaction_s                    0 MB    0.00%    0.00%
jbd2_journal_handle                   0 MB    0.00%    0.00%
isofs_inode_cache                     0 MB    0.00%    0.00%
fat_inode_cache                       0 MB    0.00%    0.00%
fat_cache                             0 MB    0.00%    0.00%
ext4_io_end                           0 MB    0.00%    0.00%
ext4_free_data                        0 MB    0.00%    0.00%
ext3_xattr                            0 MB    0.00%    0.00%
ext2_inode_cache                      0 MB    0.00%    0.00%
dma-kmalloc-96                        0 MB    0.00%    0.00%
dma-kmalloc-8192                      0 MB    0.00%    0.00%
dma-kmalloc-8                         0 MB    0.00%    0.00%
dma-kmalloc-65536                     0 MB    0.00%    0.00%
dma-kmalloc-64                        0 MB    0.00%    0.00%
dma-kmalloc-512                       0 MB    0.00%    0.00%
dma-kmalloc-4096                      0 MB    0.00%    0.00%
dma-kmalloc-32768                     0 MB    0.00%    0.00%
dma-kmalloc-32                        0 MB    0.00%    0.00%
dma-kmalloc-256                       0 MB    0.00%    0.00%
dma-kmalloc-2048                      0 MB    0.00%    0.00%
dma-kmalloc-192                       0 MB    0.00%    0.00%
dma-kmalloc-16384                     0 MB    0.00%    0.00%
dma-kmalloc-16                        0 MB    0.00%    0.00%
dma-kmalloc-131072                    0 MB    0.00%    0.00%
dma-kmalloc-128                       0 MB    0.00%    0.00%
dma-kmalloc-1024                      0 MB    0.00%    0.00%
dm_uevent                             0 MB    0.00%    0.00%
cifs_inode_cache                      0 MB    0.00%    0.00%
bsg_cmd                               0 MB    0.00%    0.00%
UDP-Lite                              0 MB    0.00%    0.00%

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-06 19:28                               ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-06 19:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

[-- Attachment #1: Type: text/plain, Size: 8967 bytes --]

On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:
> On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:
> > On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > 
> > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > the system showing the original problem, configured to have 15GB of
> > > > > memory.
> > > > > 
> > > > > With your patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:         8768192 kB
> > > > > Slab:            3882560 kB
> > > > > SReclaimable:     105408 kB
> > > > > SUnreclaim:      3777152 kB
> > > > > 
> > > > > With Anton's patch after boot:
> > > > > 
> > > > > MemTotal:       15604736 kB
> > > > > MemFree:        11195008 kB
> > > > > Slab:            1427968 kB
> > > > > SReclaimable:     109184 kB
> > > > > SUnreclaim:      1318784 kB
> > > > > 
> > > > > 
> > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > 
> > > > 
> > > > I don't think the goal of the discussion is to reduce the amount of slab 
> > > > allocated, but rather get the most local slab memory possible by use of 
> > > > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > > > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > > > my patch is allocating it on the most local node; Anton's patch is 
> > > > allocating it on whatever happened to be the cpu slab.
> > > > 
> > > > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > > > --- a/mm/slub.c
> > > > > > +++ b/mm/slub.c
> > > > > > @@ -2278,10 +2278,14 @@ redo:
> > > > > > 
> > > > > >  	if (unlikely(!node_match(page, node))) {
> > > > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > > > -		deactivate_slab(s, page, c->freelist);
> > > > > > -		c->page = NULL;
> > > > > > -		c->freelist = NULL;
> > > > > > -		goto new_slab;
> > > > > > +		if (unlikely(!node_present_pages(node)))
> > > > > > +			node = numa_mem_id();
> > > > > > +		if (!node_match(page, node)) {
> > > > > > +			deactivate_slab(s, page, c->freelist);
> > > > > > +			c->page = NULL;
> > > > > > +			c->freelist = NULL;
> > > > > > +			goto new_slab;
> > > > > > +		}
> > > > > 
> > > > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > > > if we have a memoryless node, we expect the page's locality to be that
> > > > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > > > Just wanting to make sure I understand the intent.
> > > > > 
> > > > 
> > > > Yeah, the default policy should be to fallback to local memory if the node 
> > > > passed is memoryless.
> > > > 
> > > > > What I find odd is that there are only 2 nodes on this system, node 0
> > > > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > > > should be coming from node 1 (thus node_match() should always be true?)
> > > > > 
> > > > 
> > > > The nice thing about slub is its debugging ability, what is 
> > > > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > > > patches?
> > > 
> > > Ok, I finally got around to writing a script that compares the objects
> > > output from both kernels.
> > > 
> > > log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > and Joonsoo's patch.
> > > 
> > > log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > and Anton's patch.
> > > 
> > > slab                           objects    objects   percent
> > >                                log1       log2      change
> > > -----------------------------------------------------------
> > > :t-0000104                     71190      85680      20.353982 %
> > > UDP                            4352       3392       22.058824 %
> > > inode_cache                    54302      41923      22.796582 %
> > > fscache_cookie_jar             3276       2457       25.000000 %
> > > :t-0000896                     438        292        33.333333 %
> > > :t-0000080                     310401     195323     37.073978 %
> > > ext4_inode_cache               335        201        40.000000 %
> > > :t-0000192                     89408      128898     44.168307 %
> > > :t-0000184                     151300     81880      45.882353 %
> > > :t-0000512                     49698      73648      48.191074 %
> > > :at-0000192                    242867     120948     50.199904 %
> > > xfs_inode                      34350      15221      55.688501 %
> > > :t-0016384                     11005      17257      56.810541 %
> > > proc_inode_cache               103868     34717      66.575846 %
> > > tw_sock_TCP                    768        256        66.666667 %
> > > :t-0004096                     15240      25672      68.451444 %
> > > nfs_inode_cache                1008       315        68.750000 %
> > > :t-0001024                     14528      24720      70.154185 %
> > > :t-0032768                     655        1312       100.305344%
> > > :t-0002048                     14242      30720      115.700042%
> > > :t-0000640                     1020       2550       150.000000%
> > > :t-0008192                     10005      27905      178.910545%
> > > 
> > > FWIW, the configuration of this LPAR has slightly changed. It is now configured
> > > for maximally 400 CPUs, of which 200 are present. The result is that even with
> > > Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> > > script reports:
> > > 
> > > slab                                   mem     objs    slabs
> > >                                       used   active   active
> > > ------------------------------------------------------------
> > > kmalloc-512                        1182 MB    2.03%  100.00%
> > > kmalloc-192                        1182 MB    1.38%  100.00%
> > > kmalloc-16384                       966 MB   17.66%  100.00%
> > > kmalloc-4096                        353 MB   15.92%  100.00%
> > > kmalloc-8192                        259 MB   27.28%  100.00%
> > > kmalloc-32768                       207 MB    9.86%  100.00%
> > > 
> > > In comparison (log2 above):
> > > 
> > > slab                                   mem     objs    slabs
> > >                                       used   active   active
> > > ------------------------------------------------------------
> > > kmalloc-16384                       273 MB   98.76%  100.00%
> > > kmalloc-8192                        225 MB   98.67%  100.00%
> > > pgtable-2^11                        114 MB  100.00%  100.00%
> > > pgtable-2^12                        109 MB  100.00%  100.00%
> > > kmalloc-4096                        104 MB   98.59%  100.00%
> > > 
> > > I appreciate all the help so far, if anyone has any ideas how best to
> > > proceed further, or what they'd like debugged more, I'm happy to get
> > > this fixed. We're hitting this on a couple of different systems and I'd
> > > like to find a good resolution to the problem.
> > 
> > Hello,
> > 
> > I have no memoryless system, so, to debug it, I need your help. :)
> > First, please let me know node information on your system.
> 
> [    0.000000] Node 0 Memory:
> [    0.000000] Node 1 Memory: 0x0-0x200000000
> 
> [    0.000000] On node 0 totalpages: 0
> [    0.000000] On node 1 totalpages: 131072
> [    0.000000]   DMA zone: 112 pages used for memmap
> [    0.000000]   DMA zone: 0 pages reserved
> [    0.000000]   DMA zone: 131072 pages, LIFO batch:1
> 
> [    0.638391] Node 0 CPUs: 0-199
> [    0.638394] Node 1 CPUs:
> 
> Do you need anything else?
> 
> > I'm preparing 3 another patches which are nearly same with previous patch,
> > but slightly different approach. Could you test them on your system?
> > I will send them soon.
> 
> Test results are in the attached tarball [1].
> 
> > And I think that same problem exists if CONFIG_SLAB is enabled. Could you
> > confirm that?
> 
> I will test and let you know.

Ok, with your patches applied and CONFIG_SLAB enabled:

MemTotal:        8264640 kB
MemFree:         7119680 kB
Slab:             207232 kB
SReclaimable:      32896 kB
SUnreclaim:       174336 kB

For reference, same kernel with CONFIG_SLUB:

MemTotal:        8264640 kB
MemFree:         4264000 kB
Slab:            3065408 kB
SReclaimable:     104704 kB
SUnreclaim:      2960704 kB

So CONFIG_SLAB is much better in this case.

Without your patches (but still CONFIG_HAVE_MEMORYLESS_NODES, kthread
locality patch and two other unrelated bugfix patches):

3.13.0-slub:

MemTotal:        8264704 kB
MemFree:         4404288 kB
Slab:            2963648 kB
SReclaimable:     106816 kB
SUnreclaim:      2856832 kB

3.13.0-slab:

MemTotal:        8264640 kB
MemFree:         7263168 kB
Slab:             206144 kB
SReclaimable:      32576 kB
SUnreclaim:       173568 kB

In case it's helpful, I've attached /proc/slabinfo from both kernels.

Thanks,
Nish

[-- Attachment #2: slabusage.3.13.SLAB --]
[-- Type: text/plain, Size: 13115 bytes --]

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
thread_info                          34 MB   96.33%  100.00%
kmalloc-1024                         22 MB   97.44%  100.00%
task_struct                          19 MB   95.15%  100.00%
kmalloc-16384                         9 MB   98.05%  100.00%
inode_cache                           8 MB   97.74%  100.00%
kmalloc-512                           7 MB   89.56%  100.00%
dentry                                7 MB   98.89%  100.00%
kmalloc-8192                          6 MB   98.64%  100.00%
proc_inode_cache                      6 MB   90.20%  100.00%
idr_layer_cache                       4 MB   94.76%  100.00%
sighand_cache                         4 MB   94.69%  100.00%
pgtable-2^12                          3 MB   72.58%  100.00%
xfs_inode                             3 MB   98.89%  100.00%
sysfs_dir_cache                       3 MB   98.29%  100.00%
radix_tree_node                       2 MB   97.19%  100.00%
kmalloc-32768                         2 MB   97.96%  100.00%
kmalloc-4096                          2 MB   97.68%  100.00%
filp                                  2 MB   20.71%  100.00%
signal_cache                          2 MB   72.35%  100.00%
pgtable-2^10                          2 MB   52.81%  100.00%
kmalloc-256                           2 MB   85.56%  100.00%
kmalloc-2048                          1 MB   84.95%  100.00%
shmem_inode_cache                     1 MB   89.59%  100.00%
dtl                                   1 MB   98.77%  100.00%
kmalloc-192                           1 MB   77.89%  100.00%
vm_area_struct                        1 MB   76.80%  100.00%
cred_jar                              1 MB   36.80%  100.00%
kmem_cache                            1 MB   97.69%  100.00%
kmalloc-65536                         0 MB  100.00%  100.00%
kmalloc-128                           0 MB   87.07%  100.00%
buffer_head                           0 MB   92.52%  100.00%
kmalloc-32                            0 MB   92.89%  100.00%
anon_vma_chain                        0 MB   47.46%  100.00%
sock_inode_cache                      0 MB   65.45%  100.00%
kmalloc-64                            0 MB   94.98%  100.00%
files_cache                           0 MB   60.85%  100.00%
names_cache                           0 MB   85.83%  100.00%
mm_struct                             0 MB   22.06%  100.00%
xfs_buf                               0 MB   91.50%  100.00%
UNIX                                  0 MB   37.90%  100.00%
task_delay_info                       0 MB   66.76%  100.00%
skbuff_head_cache                     0 MB   50.33%  100.00%
pid                                   0 MB   62.63%  100.00%
RAW                                   0 MB   92.59%  100.00%
kmalloc-96                            0 MB   63.71%  100.00%
anon_vma                              0 MB   52.25%  100.00%
xfs_ifork                             0 MB   88.60%  100.00%
biovec-256                            0 MB   75.56%  100.00%
TCP                                   0 MB   19.66%  100.00%
ftrace_event_field                    0 MB   63.17%  100.00%
fs_cache                              0 MB   24.30%  100.00%
file_lock_cache                       0 MB    5.24%  100.00%
eventpoll_epi                         0 MB   13.21%  100.00%
cifs_request                          0 MB   71.43%  100.00%
cfq_queue                             0 MB   26.90%  100.00%
blkdev_queue                          0 MB   48.39%  100.00%
UDP                                   0 MB   12.50%  100.00%
xfs_trans                             0 MB    4.33%  100.00%
xfs_log_ticket                        0 MB    3.45%  100.00%
xfs_log_item_desc                     0 MB    2.42%  100.00%
xfs_ioend                             0 MB   84.65%  100.00%
xfs_ili                               0 MB   66.20%  100.00%
xfs_buf_item                          0 MB    7.94%  100.00%
xfs_btree_cur                         0 MB    1.94%  100.00%
uid_cache                             0 MB    1.61%  100.00%
tcp_bind_bucket                       0 MB    2.18%  100.00%
taskstats                             0 MB    3.55%  100.00%
sigqueue                              0 MB    0.75%  100.00%
sgpool-8                              0 MB    1.59%  100.00%
sgpool-64                             0 MB    6.45%  100.00%
sgpool-32                             0 MB    3.17%  100.00%
sgpool-16                             0 MB    1.57%  100.00%
sgpool-128                            0 MB   13.33%  100.00%
sd_ext_cdb                            0 MB    0.11%  100.00%
scsi_sense_cache                      0 MB    0.60%  100.00%
scsi_cmd_cache                        0 MB    1.19%  100.00%
rpc_tasks                             0 MB    3.17%  100.00%
rpc_inode_cache                       0 MB   31.68%  100.00%
rpc_buffers                           0 MB   25.81%  100.00%
revoke_table                          0 MB    0.12%  100.00%
pool_workqueue                        0 MB    4.37%  100.00%
numa_policy                           0 MB   46.75%  100.00%
nsproxy                               0 MB    0.16%  100.00%
nfs_write_data                        0 MB   50.79%  100.00%
nfs_inode_cache                       0 MB   27.69%  100.00%
nfs_commit_data                       0 MB    4.76%  100.00%
nf_conntrack_c000000000cc9900         0 MB   45.22%  100.00%
mqueue_inode_cache                    0 MB    1.39%  100.00%
mnt_cache                             0 MB   53.57%  100.00%
key_jar                               0 MB    5.56%  100.00%
jbd2_revoke_table_s                   0 MB    0.06%  100.00%
ip_fib_trie                           0 MB    0.73%  100.00%
ip_fib_alias                          0 MB    0.71%  100.00%
ip_dst_cache                          0 MB   30.16%  100.00%
inotify_inode_mark                    0 MB   17.23%  100.00%
inet_peer_cache                       0 MB    3.97%  100.00%
hugetlbfs_inode_cache                 0 MB    2.59%  100.00%
ftrace_event_file                     0 MB   92.58%  100.00%
fsnotify_event                        0 MB    0.18%  100.00%
ext4_inode_cache                      0 MB    4.35%  100.00%
ext4_groupinfo_4k                     0 MB    8.55%  100.00%
ext4_extent_status                    0 MB    0.07%  100.00%
ext3_inode_cache                      0 MB    4.76%  100.00%
eventpoll_pwq                         0 MB   15.20%  100.00%
dnotify_struct                        0 MB    0.60%  100.00%
dnotify_mark                          0 MB    2.08%  100.00%
dm_io                                 0 MB    2.28%  100.00%
cifs_small_rq                         0 MB   23.62%  100.00%
cifs_mpx_ids                          0 MB    0.60%  100.00%
cfq_io_cq                             0 MB   26.42%  100.00%
blkdev_requests                       0 MB   10.56%  100.00%
blkdev_ioc                            0 MB   19.31%  100.00%
biovec-16                             0 MB    1.19%  100.00%
bio-1                                 0 MB   13.49%  100.00%
bio-0                                 0 MB    1.59%  100.00%
bdev_cache                            0 MB   52.78%  100.00%
xfs_mru_cache_elem                    0 MB    0.00%    0.00%
xfs_icr                               0 MB    0.00%    0.00%
xfs_efi_item                          0 MB    0.00%    0.00%
xfs_efd_item                          0 MB    0.00%    0.00%
xfs_da_state                          0 MB    0.00%    0.00%
xfs_bmap_free_item                    0 MB    0.00%    0.00%
xfrm_dst_cache                        0 MB    0.00%    0.00%
tw_sock_TCP                           0 MB    0.00%    0.00%
skbuff_fclone_cache                   0 MB    0.00%    0.00%
shared_policy_node                    0 MB    0.00%    0.00%
secpath_cache                         0 MB    0.00%    0.00%
scsi_data_buffer                      0 MB    0.00%    0.00%
revoke_record                         0 MB    0.00%    0.00%
request_sock_TCP                      0 MB    0.00%    0.00%
reiser_inode_cache                    0 MB    0.00%    0.00%
posix_timers_cache                    0 MB    0.00%    0.00%
pid_namespace                         0 MB    0.00%    0.00%
nfsd_drc                              0 MB    0.00%    0.00%
nfsd4_stateids                        0 MB    0.00%    0.00%
nfsd4_openowners                      0 MB    0.00%    0.00%
nfsd4_lockowners                      0 MB    0.00%    0.00%
nfsd4_files                           0 MB    0.00%    0.00%
nfsd4_delegations                     0 MB    0.00%    0.00%
nfs_read_data                         0 MB    0.00%    0.00%
nfs_page                              0 MB    0.00%    0.00%
nfs_direct_cache                      0 MB    0.00%    0.00%
nf_conntrack_expect                   0 MB    0.00%    0.00%
net_namespace                         0 MB    0.00%    0.00%
kmalloc-8388608                       0 MB    0.00%    0.00%
kmalloc-524288                        0 MB    0.00%    0.00%
kmalloc-4194304                       0 MB    0.00%    0.00%
kmalloc-262144                        0 MB    0.00%    0.00%
kmalloc-2097152                       0 MB    0.00%    0.00%
kmalloc-16777216                      0 MB    0.00%    0.00%
kmalloc-131072                        0 MB    0.00%    0.00%
kmalloc-1048576                       0 MB    0.00%    0.00%
kioctx                                0 MB    0.00%    0.00%
kiocb                                 0 MB    0.00%    0.00%
kcopyd_job                            0 MB    0.00%    0.00%
journal_head                          0 MB    0.00%    0.00%
journal_handle                        0 MB    0.00%    0.00%
jbd2_transaction_s                    0 MB    0.00%    0.00%
jbd2_revoke_record_s                  0 MB    0.00%    0.00%
jbd2_journal_head                     0 MB    0.00%    0.00%
jbd2_journal_handle                   0 MB    0.00%    0.00%
jbd2_inode                            0 MB    0.00%    0.00%
jbd2_4k                               0 MB    0.00%    0.00%
isofs_inode_cache                     0 MB    0.00%    0.00%
io                                    0 MB    0.00%    0.00%
inotify_event_private_data            0 MB    0.00%    0.00%
fstrm_item                            0 MB    0.00%    0.00%
fsnotify_event_holder                 0 MB    0.00%    0.00%
flow_cache                            0 MB    0.00%    0.00%
fat_inode_cache                       0 MB    0.00%    0.00%
fat_cache                             0 MB    0.00%    0.00%
fasync_cache                          0 MB    0.00%    0.00%
ext4_xattr                            0 MB    0.00%    0.00%
ext4_system_zone                      0 MB    0.00%    0.00%
ext4_prealloc_space                   0 MB    0.00%    0.00%
ext4_io_end                           0 MB    0.00%    0.00%
ext4_free_data                        0 MB    0.00%    0.00%
ext4_allocation_context               0 MB    0.00%    0.00%
ext3_xattr                            0 MB    0.00%    0.00%
ext2_xattr                            0 MB    0.00%    0.00%
ext2_inode_cache                      0 MB    0.00%    0.00%
dma-kmalloc-96                        0 MB    0.00%    0.00%
dma-kmalloc-8388608                   0 MB    0.00%    0.00%
dma-kmalloc-8192                      0 MB    0.00%    0.00%
dma-kmalloc-65536                     0 MB    0.00%    0.00%
dma-kmalloc-64                        0 MB    0.00%    0.00%
dma-kmalloc-524288                    0 MB    0.00%    0.00%
dma-kmalloc-512                       0 MB    0.00%    0.00%
dma-kmalloc-4194304                   0 MB    0.00%    0.00%
dma-kmalloc-4096                      0 MB    0.00%    0.00%
dma-kmalloc-32768                     0 MB    0.00%    0.00%
dma-kmalloc-32                        0 MB    0.00%    0.00%
dma-kmalloc-262144                    0 MB    0.00%    0.00%
dma-kmalloc-256                       0 MB    0.00%    0.00%
dma-kmalloc-2097152                   0 MB    0.00%    0.00%
dma-kmalloc-2048                      0 MB    0.00%    0.00%
dma-kmalloc-192                       0 MB    0.00%    0.00%
dma-kmalloc-16777216                  0 MB    0.00%    0.00%
dma-kmalloc-16384                     0 MB    0.00%    0.00%
dma-kmalloc-131072                    0 MB    0.00%    0.00%
dma-kmalloc-128                       0 MB    0.00%    0.00%
dma-kmalloc-1048576                   0 MB    0.00%    0.00%
dma-kmalloc-1024                      0 MB    0.00%    0.00%
dm_uevent                             0 MB    0.00%    0.00%
dm_rq_target_io                       0 MB    0.00%    0.00%
dio                                   0 MB    0.00%    0.00%
cifs_inode_cache                      0 MB    0.00%    0.00%
bsg_cmd                               0 MB    0.00%    0.00%
biovec-64                             0 MB    0.00%    0.00%
biovec-128                            0 MB    0.00%    0.00%
UDP-Lite                              0 MB    0.00%    0.00%
PING                                  0 MB    0.00%    0.00%

[-- Attachment #3: slabusage.3.13.SLUB --]
[-- Type: text/plain, Size: 7076 bytes --]

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                      1018 MB   14.09%  100.00%
task_struct                         704 MB   17.20%  100.00%
pgtable-2^12                        110 MB  100.00%  100.00%
kmalloc-8192                        109 MB   49.21%  100.00%
pgtable-2^10                        105 MB  100.00%  100.00%
kmalloc-65536                        92 MB  100.00%  100.00%
kmalloc-512                          83 MB   16.68%  100.00%
kmalloc-128                          75 MB   17.55%  100.00%
kmalloc-4096                         52 MB   97.30%  100.00%
kmalloc-16                           38 MB   24.78%  100.00%
kmalloc-256                          33 MB   99.09%  100.00%
kmalloc-1024                         27 MB   60.45%  100.00%
sighand_cache                        27 MB  100.00%  100.00%
idr_layer_cache                      25 MB  100.00%  100.00%
kmalloc-2048                         25 MB   97.59%  100.00%
dentry                               23 MB  100.00%  100.00%
inode_cache                          20 MB  100.00%  100.00%
proc_inode_cache                     19 MB  100.00%  100.00%
sysfs_dir_cache                      16 MB  100.00%  100.00%
vm_area_struct                       14 MB  100.00%  100.00%
kmalloc-64                           14 MB   97.79%  100.00%
kmalloc-192                          13 MB   97.60%  100.00%
kmalloc-32                           12 MB   97.56%  100.00%
anon_vma                             12 MB  100.00%  100.00%
mm_struct                            12 MB  100.00%  100.00%
sigqueue                             12 MB  100.00%  100.00%
files_cache                          12 MB  100.00%  100.00%
cfq_queue                            11 MB  100.00%  100.00%
radix_tree_node                      11 MB  100.00%  100.00%
kmalloc-96                           10 MB   97.06%  100.00%
blkdev_requests                      10 MB  100.00%  100.00%
xfs_inode                             9 MB  100.00%  100.00%
shmem_inode_cache                     9 MB  100.00%  100.00%
ext4_system_zone                      9 MB  100.00%  100.00%
sock_inode_cache                      9 MB  100.00%  100.00%
RAW                                   8 MB  100.00%  100.00%
kmalloc-8                             8 MB  100.00%  100.00%
kmalloc-32768                         8 MB  100.00%  100.00%
blkdev_ioc                            7 MB  100.00%  100.00%
buffer_head                           6 MB  100.00%  100.00%
xfs_da_state                          6 MB  100.00%  100.00%
mnt_cache                             6 MB  100.00%  100.00%
numa_policy                           6 MB  100.00%  100.00%
dnotify_mark                          4 MB  100.00%  100.00%
TCP                                   3 MB  100.00%  100.00%
cifs_request                          3 MB  100.00%  100.00%
UDP                                   3 MB  100.00%  100.00%
xfs_ili                               3 MB  100.00%  100.00%
xfs_btree_cur                         3 MB  100.00%  100.00%
nf_conntrack_c000000000cb5480         2 MB  100.00%  100.00%
fsnotify_event_holder                 1 MB  100.00%  100.00%
dm_rq_target_io                       1 MB  100.00%  100.00%
bdev_cache                            1 MB  100.00%  100.00%
kmem_cache                            1 MB   89.09%  100.00%
blkdev_queue                          0 MB  100.00%  100.00%
dio                                   0 MB  100.00%  100.00%
taskstats                             0 MB  100.00%  100.00%
kmem_cache_node                       0 MB  100.00%  100.00%
shared_policy_node                    0 MB  100.00%  100.00%
rpc_inode_cache                       0 MB  100.00%  100.00%
nfs_inode_cache                       0 MB  100.00%  100.00%
revoke_table                          0 MB  100.00%  100.00%
ip_fib_trie                           0 MB  100.00%  100.00%
ext4_inode_cache                      0 MB  100.00%  100.00%
hugetlbfs_inode_cache                 0 MB  100.00%  100.00%
ext3_inode_cache                      0 MB  100.00%  100.00%
tw_sock_TCP                           0 MB  100.00%  100.00%
mqueue_inode_cache                    0 MB  100.00%  100.00%
ext4_extent_status                    0 MB  100.00%  100.00%
ext4_allocation_context               0 MB  100.00%  100.00%
xfs_icr                               0 MB    0.00%    0.00%
revoke_record                         0 MB    0.00%    0.00%
reiser_inode_cache                    0 MB    0.00%    0.00%
posix_timers_cache                    0 MB    0.00%    0.00%
pid_namespace                         0 MB    0.00%    0.00%
nfsd4_openowners                      0 MB    0.00%    0.00%
nfsd4_delegations                     0 MB    0.00%    0.00%
nfs_direct_cache                      0 MB    0.00%    0.00%
net_namespace                         0 MB    0.00%    0.00%
kmalloc-131072                        0 MB    0.00%    0.00%
kcopyd_job                            0 MB    0.00%    0.00%
journal_head                          0 MB    0.00%    0.00%
journal_handle                        0 MB    0.00%    0.00%
jbd2_transaction_s                    0 MB    0.00%    0.00%
jbd2_journal_handle                   0 MB    0.00%    0.00%
isofs_inode_cache                     0 MB    0.00%    0.00%
fat_inode_cache                       0 MB    0.00%    0.00%
fat_cache                             0 MB    0.00%    0.00%
ext4_io_end                           0 MB    0.00%    0.00%
ext4_free_data                        0 MB    0.00%    0.00%
ext3_xattr                            0 MB    0.00%    0.00%
ext2_inode_cache                      0 MB    0.00%    0.00%
dma-kmalloc-96                        0 MB    0.00%    0.00%
dma-kmalloc-8192                      0 MB    0.00%    0.00%
dma-kmalloc-8                         0 MB    0.00%    0.00%
dma-kmalloc-65536                     0 MB    0.00%    0.00%
dma-kmalloc-64                        0 MB    0.00%    0.00%
dma-kmalloc-512                       0 MB    0.00%    0.00%
dma-kmalloc-4096                      0 MB    0.00%    0.00%
dma-kmalloc-32768                     0 MB    0.00%    0.00%
dma-kmalloc-32                        0 MB    0.00%    0.00%
dma-kmalloc-256                       0 MB    0.00%    0.00%
dma-kmalloc-2048                      0 MB    0.00%    0.00%
dma-kmalloc-192                       0 MB    0.00%    0.00%
dma-kmalloc-16384                     0 MB    0.00%    0.00%
dma-kmalloc-16                        0 MB    0.00%    0.00%
dma-kmalloc-131072                    0 MB    0.00%    0.00%
dma-kmalloc-128                       0 MB    0.00%    0.00%
dma-kmalloc-1024                      0 MB    0.00%    0.00%
dm_uevent                             0 MB    0.00%    0.00%
cifs_inode_cache                      0 MB    0.00%    0.00%
bsg_cmd                               0 MB    0.00%    0.00%
UDP-Lite                              0 MB    0.00%    0.00%

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-06 10:29                                 ` Joonsoo Kim
@ 2014-02-06 20:52                                   ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-06 20:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
>  fallback node
> 
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
> 
> We will use this function in following patch to determine the fallback
> node.
> 

I like the approach and it may fix the problem today, but it may not be 
sufficient in the future: nodes may not only be memoryless but they may 
also be cpuless.  It's possible that a node can only have I/O, networking, 
or storage devices and we can define affinity for them that is remote from 
every cpu and/or memory by the ACPI specification.

It seems like a better approach would be to do this when a node is brought 
online and determine the fallback node based not on the zonelists as you 
do here but rather on locality (such as through a SLIT if provided, see 
node_distance()).

Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make 
a lot of sense in generic code.  I'd suggest something like 
node_to_mem_node().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-06 20:52                                   ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-06 20:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Joonsoo Kim, linuxppc-dev, Christoph Lameter,
	Wanpeng Li

On Thu, 6 Feb 2014, Joonsoo Kim wrote:

> From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Date: Thu, 6 Feb 2014 17:07:05 +0900
> Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> determining the
>  fallback node
> 
> We need to determine the fallback node in slub allocator if the allocation
> target node is memoryless node. Without it, the SLUB wrongly select
> the node which has no memory and can't use a partial slab, because of node
> mismatch. Introduced function, node_numa_mem(X), will return
> a node Y with memory that has the nearest distance. If X is memoryless
> node, it will return nearest distance node, but, if
> X is normal node, it will return itself.
> 
> We will use this function in following patch to determine the fallback
> node.
> 

I like the approach and it may fix the problem today, but it may not be 
sufficient in the future: nodes may not only be memoryless but they may 
also be cpuless.  It's possible that a node can only have I/O, networking, 
or storage devices and we can define affinity for them that is remote from 
every cpu and/or memory by the ACPI specification.

It seems like a better approach would be to do this when a node is brought 
online and determine the fallback node based not on the zonelists as you 
do here but rather on locality (such as through a SLIT if provided, see 
node_distance()).

Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make 
a lot of sense in generic code.  I'd suggest something like 
node_to_mem_node().

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
  2014-02-06 17:30                               ` Christoph Lameter
@ 2014-02-07  5:41                                 ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-07  5:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, penberg,
	linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Thu, Feb 06, 2014 at 11:30:20AM -0600, Christoph Lameter wrote:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index cc1f995..c851f82 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> >  	void *object;
> >  	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> >
> > +	if (node == NUMA_NO_NODE)
> > +		searchnode = numa_mem_id();
> > +	else {
> > +		searchnode = node;
> > +		if (!node_present_pages(node))
> 
> This check wouild need to be something that checks for other contigencies
> in the page allocator as well. A simple solution would be to actually run
> a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> If that fails then fallback. See how fallback_alloc() does it in slab.
> 

Hello, Christoph.

This !node_present_pages() ensure that allocation on this node cannot succeed.
So we can directly use numa_mem_id() here.

> > +			searchnode = get_numa_mem(node);
> > +	}
> 
> > @@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >  redo:
> >
> >  	if (unlikely(!node_match(page, node))) {
> > -		stat(s, ALLOC_NODE_MISMATCH);
> > -		deactivate_slab(s, page, c->freelist);
> > -		c->page = NULL;
> > -		c->freelist = NULL;
> > -		goto new_slab;
> > +		int searchnode = node;
> > +
> > +		if (node != NUMA_NO_NODE && !node_present_pages(node))
> 
> Same issue here. I would suggest not deactivating the slab and first check
> if the node has no pages. If so then just take an object from the current
> cpu slab. If that is not available do an allcoation from the indicated
> node and take whatever the page allocator gave you.

Here I do is not to deactivate the slab. I first check if the node has no pages.
And then, not taking an object from the current cpu slab. Instead, checking
current cpu slab comes from proper node getting from introduced get_numa_mem().
I think that this approach is better than just taking an object whatever node
requested.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
@ 2014-02-07  5:41                                 ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-07  5:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm,
	paulus, Anton Blanchard, David Rientjes, linuxppc-dev,
	Wanpeng Li

On Thu, Feb 06, 2014 at 11:30:20AM -0600, Christoph Lameter wrote:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index cc1f995..c851f82 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -1700,6 +1700,14 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
> >  	void *object;
> >  	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> >
> > +	if (node == NUMA_NO_NODE)
> > +		searchnode = numa_mem_id();
> > +	else {
> > +		searchnode = node;
> > +		if (!node_present_pages(node))
> 
> This check wouild need to be something that checks for other contigencies
> in the page allocator as well. A simple solution would be to actually run
> a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> If that fails then fallback. See how fallback_alloc() does it in slab.
> 

Hello, Christoph.

This !node_present_pages() ensure that allocation on this node cannot succeed.
So we can directly use numa_mem_id() here.

> > +			searchnode = get_numa_mem(node);
> > +	}
> 
> > @@ -2277,11 +2285,18 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
> >  redo:
> >
> >  	if (unlikely(!node_match(page, node))) {
> > -		stat(s, ALLOC_NODE_MISMATCH);
> > -		deactivate_slab(s, page, c->freelist);
> > -		c->page = NULL;
> > -		c->freelist = NULL;
> > -		goto new_slab;
> > +		int searchnode = node;
> > +
> > +		if (node != NUMA_NO_NODE && !node_present_pages(node))
> 
> Same issue here. I would suggest not deactivating the slab and first check
> if the node has no pages. If so then just take an object from the current
> cpu slab. If that is not available do an allcoation from the indicated
> node and take whatever the page allocator gave you.

Here I do is not to deactivate the slab. I first check if the node has no pages.
And then, not taking an object from the current cpu slab. Instead, checking
current cpu slab comes from proper node getting from introduced get_numa_mem().
I think that this approach is better than just taking an object whatever node
requested.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-06 19:11                                   ` Nishanth Aravamudan
@ 2014-02-07  5:42                                     ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-07  5:42 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Thu, Feb 06, 2014 at 11:11:31AM -0800, Nishanth Aravamudan wrote:
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 12ae6ce..66b19b8 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> >   */
> >  DECLARE_PER_CPU(int, _numa_mem_);
> > +int _node_numa_mem_[MAX_NUMNODES];
> 
> Should be static, I think?

Yes, will update it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-07  5:42                                     ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-07  5:42 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Thu, Feb 06, 2014 at 11:11:31AM -0800, Nishanth Aravamudan wrote:
> > diff --git a/include/linux/topology.h b/include/linux/topology.h
> > index 12ae6ce..66b19b8 100644
> > --- a/include/linux/topology.h
> > +++ b/include/linux/topology.h
> > @@ -233,11 +233,20 @@ static inline int numa_node_id(void)
> >   * Use the accessor functions set_numa_mem(), numa_mem_id() and cpu_to_mem().
> >   */
> >  DECLARE_PER_CPU(int, _numa_mem_);
> > +int _node_numa_mem_[MAX_NUMNODES];
> 
> Should be static, I think?

Yes, will update it.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-06 20:52                                   ` David Rientjes
@ 2014-02-07  5:48                                     ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-07  5:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Thu, Feb 06, 2014 at 12:52:11PM -0800, David Rientjes wrote:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> 
> > From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Date: Thu, 6 Feb 2014 17:07:05 +0900
> > Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> > determining the
> >  fallback node
> > 
> > We need to determine the fallback node in slub allocator if the allocation
> > target node is memoryless node. Without it, the SLUB wrongly select
> > the node which has no memory and can't use a partial slab, because of node
> > mismatch. Introduced function, node_numa_mem(X), will return
> > a node Y with memory that has the nearest distance. If X is memoryless
> > node, it will return nearest distance node, but, if
> > X is normal node, it will return itself.
> > 
> > We will use this function in following patch to determine the fallback
> > node.
> > 
> 
> I like the approach and it may fix the problem today, but it may not be 
> sufficient in the future: nodes may not only be memoryless but they may 
> also be cpuless.  It's possible that a node can only have I/O, networking, 
> or storage devices and we can define affinity for them that is remote from 
> every cpu and/or memory by the ACPI specification.
> 
> It seems like a better approach would be to do this when a node is brought 
> online and determine the fallback node based not on the zonelists as you 
> do here but rather on locality (such as through a SLIT if provided, see 
> node_distance()).

Hmm...
I guess that zonelist is base on locality. Zonelist is generated using
node_distance(), so I think that it reflects locality. But, I'm not expert
on NUMA, so please let me know what I am missing here :)

> Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make 
> a lot of sense in generic code.  I'd suggest something like 
> node_to_mem_node().

It's much better!
If this patch eventually will be needed, I will update it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-07  5:48                                     ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-07  5:48 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Thu, Feb 06, 2014 at 12:52:11PM -0800, David Rientjes wrote:
> On Thu, 6 Feb 2014, Joonsoo Kim wrote:
> 
> > From bf691e7eb07f966e3aed251eaeb18f229ee32d1f Mon Sep 17 00:00:00 2001
> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > Date: Thu, 6 Feb 2014 17:07:05 +0900
> > Subject: [RFC PATCH 2/3 v2] topology: support node_numa_mem() for
> > determining the
> >  fallback node
> > 
> > We need to determine the fallback node in slub allocator if the allocation
> > target node is memoryless node. Without it, the SLUB wrongly select
> > the node which has no memory and can't use a partial slab, because of node
> > mismatch. Introduced function, node_numa_mem(X), will return
> > a node Y with memory that has the nearest distance. If X is memoryless
> > node, it will return nearest distance node, but, if
> > X is normal node, it will return itself.
> > 
> > We will use this function in following patch to determine the fallback
> > node.
> > 
> 
> I like the approach and it may fix the problem today, but it may not be 
> sufficient in the future: nodes may not only be memoryless but they may 
> also be cpuless.  It's possible that a node can only have I/O, networking, 
> or storage devices and we can define affinity for them that is remote from 
> every cpu and/or memory by the ACPI specification.
> 
> It seems like a better approach would be to do this when a node is brought 
> online and determine the fallback node based not on the zonelists as you 
> do here but rather on locality (such as through a SLIT if provided, see 
> node_distance()).

Hmm...
I guess that zonelist is base on locality. Zonelist is generated using
node_distance(), so I think that it reflects locality. But, I'm not expert
on NUMA, so please let me know what I am missing here :)

> Also, the names aren't very descriptive: {get,set}_numa_mem() doesn't make 
> a lot of sense in generic code.  I'd suggest something like 
> node_to_mem_node().

It's much better!
If this patch eventually will be needed, I will update it.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
  2014-02-06 19:28                               ` Nishanth Aravamudan
@ 2014-02-07  8:03                                 ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-07  8:03 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On Thu, Feb 06, 2014 at 11:28:12AM -0800, Nishanth Aravamudan wrote:
> On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:
> > On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:
> > > On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> > > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > > 
> > > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > > the system showing the original problem, configured to have 15GB of
> > > > > > memory.
> > > > > > 
> > > > > > With your patch after boot:
> > > > > > 
> > > > > > MemTotal:       15604736 kB
> > > > > > MemFree:         8768192 kB
> > > > > > Slab:            3882560 kB
> > > > > > SReclaimable:     105408 kB
> > > > > > SUnreclaim:      3777152 kB
> > > > > > 
> > > > > > With Anton's patch after boot:
> > > > > > 
> > > > > > MemTotal:       15604736 kB
> > > > > > MemFree:        11195008 kB
> > > > > > Slab:            1427968 kB
> > > > > > SReclaimable:     109184 kB
> > > > > > SUnreclaim:      1318784 kB
> > > > > > 
> > > > > > 
> > > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > > 
> > > > > 
> > > > > I don't think the goal of the discussion is to reduce the amount of slab 
> > > > > allocated, but rather get the most local slab memory possible by use of 
> > > > > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > > > > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > > > > my patch is allocating it on the most local node; Anton's patch is 
> > > > > allocating it on whatever happened to be the cpu slab.
> > > > > 
> > > > > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > > > > --- a/mm/slub.c
> > > > > > > +++ b/mm/slub.c
> > > > > > > @@ -2278,10 +2278,14 @@ redo:
> > > > > > > 
> > > > > > >  	if (unlikely(!node_match(page, node))) {
> > > > > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > > > > -		deactivate_slab(s, page, c->freelist);
> > > > > > > -		c->page = NULL;
> > > > > > > -		c->freelist = NULL;
> > > > > > > -		goto new_slab;
> > > > > > > +		if (unlikely(!node_present_pages(node)))
> > > > > > > +			node = numa_mem_id();
> > > > > > > +		if (!node_match(page, node)) {
> > > > > > > +			deactivate_slab(s, page, c->freelist);
> > > > > > > +			c->page = NULL;
> > > > > > > +			c->freelist = NULL;
> > > > > > > +			goto new_slab;
> > > > > > > +		}
> > > > > > 
> > > > > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > > > > if we have a memoryless node, we expect the page's locality to be that
> > > > > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > > > > Just wanting to make sure I understand the intent.
> > > > > > 
> > > > > 
> > > > > Yeah, the default policy should be to fallback to local memory if the node 
> > > > > passed is memoryless.
> > > > > 
> > > > > > What I find odd is that there are only 2 nodes on this system, node 0
> > > > > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > > > > should be coming from node 1 (thus node_match() should always be true?)
> > > > > > 
> > > > > 
> > > > > The nice thing about slub is its debugging ability, what is 
> > > > > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > > > > patches?
> > > > 
> > > > Ok, I finally got around to writing a script that compares the objects
> > > > output from both kernels.
> > > > 
> > > > log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > > and Joonsoo's patch.
> > > > 
> > > > log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > > and Anton's patch.
> > > > 
> > > > slab                           objects    objects   percent
> > > >                                log1       log2      change
> > > > -----------------------------------------------------------
> > > > :t-0000104                     71190      85680      20.353982 %
> > > > UDP                            4352       3392       22.058824 %
> > > > inode_cache                    54302      41923      22.796582 %
> > > > fscache_cookie_jar             3276       2457       25.000000 %
> > > > :t-0000896                     438        292        33.333333 %
> > > > :t-0000080                     310401     195323     37.073978 %
> > > > ext4_inode_cache               335        201        40.000000 %
> > > > :t-0000192                     89408      128898     44.168307 %
> > > > :t-0000184                     151300     81880      45.882353 %
> > > > :t-0000512                     49698      73648      48.191074 %
> > > > :at-0000192                    242867     120948     50.199904 %
> > > > xfs_inode                      34350      15221      55.688501 %
> > > > :t-0016384                     11005      17257      56.810541 %
> > > > proc_inode_cache               103868     34717      66.575846 %
> > > > tw_sock_TCP                    768        256        66.666667 %
> > > > :t-0004096                     15240      25672      68.451444 %
> > > > nfs_inode_cache                1008       315        68.750000 %
> > > > :t-0001024                     14528      24720      70.154185 %
> > > > :t-0032768                     655        1312       100.305344%
> > > > :t-0002048                     14242      30720      115.700042%
> > > > :t-0000640                     1020       2550       150.000000%
> > > > :t-0008192                     10005      27905      178.910545%
> > > > 
> > > > FWIW, the configuration of this LPAR has slightly changed. It is now configured
> > > > for maximally 400 CPUs, of which 200 are present. The result is that even with
> > > > Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> > > > script reports:
> > > > 
> > > > slab                                   mem     objs    slabs
> > > >                                       used   active   active
> > > > ------------------------------------------------------------
> > > > kmalloc-512                        1182 MB    2.03%  100.00%
> > > > kmalloc-192                        1182 MB    1.38%  100.00%
> > > > kmalloc-16384                       966 MB   17.66%  100.00%
> > > > kmalloc-4096                        353 MB   15.92%  100.00%
> > > > kmalloc-8192                        259 MB   27.28%  100.00%
> > > > kmalloc-32768                       207 MB    9.86%  100.00%
> > > > 
> > > > In comparison (log2 above):
> > > > 
> > > > slab                                   mem     objs    slabs
> > > >                                       used   active   active
> > > > ------------------------------------------------------------
> > > > kmalloc-16384                       273 MB   98.76%  100.00%
> > > > kmalloc-8192                        225 MB   98.67%  100.00%
> > > > pgtable-2^11                        114 MB  100.00%  100.00%
> > > > pgtable-2^12                        109 MB  100.00%  100.00%
> > > > kmalloc-4096                        104 MB   98.59%  100.00%
> > > > 
> > > > I appreciate all the help so far, if anyone has any ideas how best to
> > > > proceed further, or what they'd like debugged more, I'm happy to get
> > > > this fixed. We're hitting this on a couple of different systems and I'd
> > > > like to find a good resolution to the problem.
> > > 
> > > Hello,
> > > 
> > > I have no memoryless system, so, to debug it, I need your help. :)
> > > First, please let me know node information on your system.
> > 
> > [    0.000000] Node 0 Memory:
> > [    0.000000] Node 1 Memory: 0x0-0x200000000
> > 
> > [    0.000000] On node 0 totalpages: 0
> > [    0.000000] On node 1 totalpages: 131072
> > [    0.000000]   DMA zone: 112 pages used for memmap
> > [    0.000000]   DMA zone: 0 pages reserved
> > [    0.000000]   DMA zone: 131072 pages, LIFO batch:1
> > 
> > [    0.638391] Node 0 CPUs: 0-199
> > [    0.638394] Node 1 CPUs:
> > 
> > Do you need anything else?
> > 
> > > I'm preparing 3 another patches which are nearly same with previous patch,
> > > but slightly different approach. Could you test them on your system?
> > > I will send them soon.
> > 
> > Test results are in the attached tarball [1].
> > 
> > > And I think that same problem exists if CONFIG_SLAB is enabled. Could you
> > > confirm that?
> > 
> > I will test and let you know.
> 
> Ok, with your patches applied and CONFIG_SLAB enabled:
> 
> MemTotal:        8264640 kB
> MemFree:         7119680 kB
> Slab:             207232 kB
> SReclaimable:      32896 kB
> SUnreclaim:       174336 kB
> 
> For reference, same kernel with CONFIG_SLUB:
> 
> MemTotal:        8264640 kB
> MemFree:         4264000 kB
> Slab:            3065408 kB
> SReclaimable:     104704 kB
> SUnreclaim:      2960704 kB
> 


Hello,

First of all, thanks for testing!

My patch only affects CONFIG_SLUB. Request to test on CONFIG_SLAB is just
for reference. It seems that my patches doesn't have any effect to your case.
Could you check that numa_mem_id() and get_numa_mem() returns correctly?
I think that numa_mem_id() for all cpus and get_numa_mem() for all nodes
should return 1 on your system.

I will investigate further on my side.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory
@ 2014-02-07  8:03                                 ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-07  8:03 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Thu, Feb 06, 2014 at 11:28:12AM -0800, Nishanth Aravamudan wrote:
> On 06.02.2014 [10:59:55 -0800], Nishanth Aravamudan wrote:
> > On 06.02.2014 [17:04:18 +0900], Joonsoo Kim wrote:
> > > On Wed, Feb 05, 2014 at 06:07:57PM -0800, Nishanth Aravamudan wrote:
> > > > On 24.01.2014 [16:25:58 -0800], David Rientjes wrote:
> > > > > On Fri, 24 Jan 2014, Nishanth Aravamudan wrote:
> > > > > 
> > > > > > Thank you for clarifying and providing  a test patch. I ran with this on
> > > > > > the system showing the original problem, configured to have 15GB of
> > > > > > memory.
> > > > > > 
> > > > > > With your patch after boot:
> > > > > > 
> > > > > > MemTotal:       15604736 kB
> > > > > > MemFree:         8768192 kB
> > > > > > Slab:            3882560 kB
> > > > > > SReclaimable:     105408 kB
> > > > > > SUnreclaim:      3777152 kB
> > > > > > 
> > > > > > With Anton's patch after boot:
> > > > > > 
> > > > > > MemTotal:       15604736 kB
> > > > > > MemFree:        11195008 kB
> > > > > > Slab:            1427968 kB
> > > > > > SReclaimable:     109184 kB
> > > > > > SUnreclaim:      1318784 kB
> > > > > > 
> > > > > > 
> > > > > > I know that's fairly unscientific, but the numbers are reproducible. 
> > > > > > 
> > > > > 
> > > > > I don't think the goal of the discussion is to reduce the amount of slab 
> > > > > allocated, but rather get the most local slab memory possible by use of 
> > > > > kmalloc_node().  When a memoryless node is being passed to kmalloc_node(), 
> > > > > which is probably cpu_to_node() for a cpu bound to a node without memory, 
> > > > > my patch is allocating it on the most local node; Anton's patch is 
> > > > > allocating it on whatever happened to be the cpu slab.
> > > > > 
> > > > > > > diff --git a/mm/slub.c b/mm/slub.c
> > > > > > > --- a/mm/slub.c
> > > > > > > +++ b/mm/slub.c
> > > > > > > @@ -2278,10 +2278,14 @@ redo:
> > > > > > > 
> > > > > > >  	if (unlikely(!node_match(page, node))) {
> > > > > > >  		stat(s, ALLOC_NODE_MISMATCH);
> > > > > > > -		deactivate_slab(s, page, c->freelist);
> > > > > > > -		c->page = NULL;
> > > > > > > -		c->freelist = NULL;
> > > > > > > -		goto new_slab;
> > > > > > > +		if (unlikely(!node_present_pages(node)))
> > > > > > > +			node = numa_mem_id();
> > > > > > > +		if (!node_match(page, node)) {
> > > > > > > +			deactivate_slab(s, page, c->freelist);
> > > > > > > +			c->page = NULL;
> > > > > > > +			c->freelist = NULL;
> > > > > > > +			goto new_slab;
> > > > > > > +		}
> > > > > > 
> > > > > > Semantically, and please correct me if I'm wrong, this patch is saying
> > > > > > if we have a memoryless node, we expect the page's locality to be that
> > > > > > of numa_mem_id(), and we still deactivate the slab if that isn't true.
> > > > > > Just wanting to make sure I understand the intent.
> > > > > > 
> > > > > 
> > > > > Yeah, the default policy should be to fallback to local memory if the node 
> > > > > passed is memoryless.
> > > > > 
> > > > > > What I find odd is that there are only 2 nodes on this system, node 0
> > > > > > (empty) and node 1. So won't numa_mem_id() always be 1? And every page
> > > > > > should be coming from node 1 (thus node_match() should always be true?)
> > > > > > 
> > > > > 
> > > > > The nice thing about slub is its debugging ability, what is 
> > > > > /sys/kernel/slab/cache/objects showing in comparison between the two 
> > > > > patches?
> > > > 
> > > > Ok, I finally got around to writing a script that compares the objects
> > > > output from both kernels.
> > > > 
> > > > log1 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > > and Joonsoo's patch.
> > > > 
> > > > log2 is with CONFIG_HAVE_MEMORYLESS_NODES on, my kthread locality patch
> > > > and Anton's patch.
> > > > 
> > > > slab                           objects    objects   percent
> > > >                                log1       log2      change
> > > > -----------------------------------------------------------
> > > > :t-0000104                     71190      85680      20.353982 %
> > > > UDP                            4352       3392       22.058824 %
> > > > inode_cache                    54302      41923      22.796582 %
> > > > fscache_cookie_jar             3276       2457       25.000000 %
> > > > :t-0000896                     438        292        33.333333 %
> > > > :t-0000080                     310401     195323     37.073978 %
> > > > ext4_inode_cache               335        201        40.000000 %
> > > > :t-0000192                     89408      128898     44.168307 %
> > > > :t-0000184                     151300     81880      45.882353 %
> > > > :t-0000512                     49698      73648      48.191074 %
> > > > :at-0000192                    242867     120948     50.199904 %
> > > > xfs_inode                      34350      15221      55.688501 %
> > > > :t-0016384                     11005      17257      56.810541 %
> > > > proc_inode_cache               103868     34717      66.575846 %
> > > > tw_sock_TCP                    768        256        66.666667 %
> > > > :t-0004096                     15240      25672      68.451444 %
> > > > nfs_inode_cache                1008       315        68.750000 %
> > > > :t-0001024                     14528      24720      70.154185 %
> > > > :t-0032768                     655        1312       100.305344%
> > > > :t-0002048                     14242      30720      115.700042%
> > > > :t-0000640                     1020       2550       150.000000%
> > > > :t-0008192                     10005      27905      178.910545%
> > > > 
> > > > FWIW, the configuration of this LPAR has slightly changed. It is now configured
> > > > for maximally 400 CPUs, of which 200 are present. The result is that even with
> > > > Joonsoo's patch (log1 above), we OOM pretty easily and Anton's slab usage
> > > > script reports:
> > > > 
> > > > slab                                   mem     objs    slabs
> > > >                                       used   active   active
> > > > ------------------------------------------------------------
> > > > kmalloc-512                        1182 MB    2.03%  100.00%
> > > > kmalloc-192                        1182 MB    1.38%  100.00%
> > > > kmalloc-16384                       966 MB   17.66%  100.00%
> > > > kmalloc-4096                        353 MB   15.92%  100.00%
> > > > kmalloc-8192                        259 MB   27.28%  100.00%
> > > > kmalloc-32768                       207 MB    9.86%  100.00%
> > > > 
> > > > In comparison (log2 above):
> > > > 
> > > > slab                                   mem     objs    slabs
> > > >                                       used   active   active
> > > > ------------------------------------------------------------
> > > > kmalloc-16384                       273 MB   98.76%  100.00%
> > > > kmalloc-8192                        225 MB   98.67%  100.00%
> > > > pgtable-2^11                        114 MB  100.00%  100.00%
> > > > pgtable-2^12                        109 MB  100.00%  100.00%
> > > > kmalloc-4096                        104 MB   98.59%  100.00%
> > > > 
> > > > I appreciate all the help so far, if anyone has any ideas how best to
> > > > proceed further, or what they'd like debugged more, I'm happy to get
> > > > this fixed. We're hitting this on a couple of different systems and I'd
> > > > like to find a good resolution to the problem.
> > > 
> > > Hello,
> > > 
> > > I have no memoryless system, so, to debug it, I need your help. :)
> > > First, please let me know node information on your system.
> > 
> > [    0.000000] Node 0 Memory:
> > [    0.000000] Node 1 Memory: 0x0-0x200000000
> > 
> > [    0.000000] On node 0 totalpages: 0
> > [    0.000000] On node 1 totalpages: 131072
> > [    0.000000]   DMA zone: 112 pages used for memmap
> > [    0.000000]   DMA zone: 0 pages reserved
> > [    0.000000]   DMA zone: 131072 pages, LIFO batch:1
> > 
> > [    0.638391] Node 0 CPUs: 0-199
> > [    0.638394] Node 1 CPUs:
> > 
> > Do you need anything else?
> > 
> > > I'm preparing 3 another patches which are nearly same with previous patch,
> > > but slightly different approach. Could you test them on your system?
> > > I will send them soon.
> > 
> > Test results are in the attached tarball [1].
> > 
> > > And I think that same problem exists if CONFIG_SLAB is enabled. Could you
> > > confirm that?
> > 
> > I will test and let you know.
> 
> Ok, with your patches applied and CONFIG_SLAB enabled:
> 
> MemTotal:        8264640 kB
> MemFree:         7119680 kB
> Slab:             207232 kB
> SReclaimable:      32896 kB
> SUnreclaim:       174336 kB
> 
> For reference, same kernel with CONFIG_SLUB:
> 
> MemTotal:        8264640 kB
> MemFree:         4264000 kB
> Slab:            3065408 kB
> SReclaimable:     104704 kB
> SUnreclaim:      2960704 kB
> 


Hello,

First of all, thanks for testing!

My patch only affects CONFIG_SLUB. Request to test on CONFIG_SLAB is just
for reference. It seems that my patches doesn't have any effect to your case.
Could you check that numa_mem_id() and get_numa_mem() returns correctly?
I think that numa_mem_id() for all cpus and get_numa_mem() for all nodes
should return 1 on your system.

I will investigate further on my side.

Thanks!

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
  2014-02-07  5:41                                 ` Joonsoo Kim
@ 2014-02-07 17:49                                   ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-07 17:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, penberg,
	linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > This check wouild need to be something that checks for other contigencies
> > in the page allocator as well. A simple solution would be to actually run
> > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > If that fails then fallback. See how fallback_alloc() does it in slab.
> >
>
> Hello, Christoph.
>
> This !node_present_pages() ensure that allocation on this node cannot succeed.
> So we can directly use numa_mem_id() here.

Yes of course we can use numa_mem_id().

But the check is only for not having any memory at all on a node. There
are other reason for allocations to fail on a certain node. The node could
have memory that cannot be reclaimed, all dirty, beyond certain
thresholds, not in the current set of allowed nodes etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
@ 2014-02-07 17:49                                   ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-07 17:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm,
	paulus, Anton Blanchard, David Rientjes, linuxppc-dev,
	Wanpeng Li

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > This check wouild need to be something that checks for other contigencies
> > in the page allocator as well. A simple solution would be to actually run
> > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > If that fails then fallback. See how fallback_alloc() does it in slab.
> >
>
> Hello, Christoph.
>
> This !node_present_pages() ensure that allocation on this node cannot succeed.
> So we can directly use numa_mem_id() here.

Yes of course we can use numa_mem_id().

But the check is only for not having any memory at all on a node. There
are other reason for allocations to fail on a certain node. The node could
have memory that cannot be reclaimed, all dirty, beyond certain
thresholds, not in the current set of allowed nodes etc etc.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-07  5:48                                     ` Joonsoo Kim
@ 2014-02-07 17:53                                       ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-07 17:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> >
> > It seems like a better approach would be to do this when a node is brought
> > online and determine the fallback node based not on the zonelists as you
> > do here but rather on locality (such as through a SLIT if provided, see
> > node_distance()).
>
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)

The next node can be found by going through the zonelist of a node and
checking for available memory. See fallback_alloc().

There is a function node_distance() that determines the relative
performance of a memory access from one to the other node.
The building of the fallback list for every node in build_zonelists()
relies on that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-07 17:53                                       ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-07 17:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> >
> > It seems like a better approach would be to do this when a node is brought
> > online and determine the fallback node based not on the zonelists as you
> > do here but rather on locality (such as through a SLIT if provided, see
> > node_distance()).
>
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)

The next node can be found by going through the zonelist of a node and
checking for available memory. See fallback_alloc().

There is a function node_distance() that determines the relative
performance of a memory access from one to the other node.
The building of the fallback list for every node in build_zonelists()
relies on that.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-07 17:53                                       ` Christoph Lameter
@ 2014-02-07 18:51                                         ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-07 18:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

Here is a draft of a patch to make this work with memoryless nodes.

The first thing is that we modify node_match to also match if we hit an
empty node. In that case we simply take the current slab if its there.

If there is no current slab then a regular allocation occurs with the
memoryless node. The page allocator will fallback to a possible node and
that will become the current slab. Next alloc from a memoryless node
will then use that slab.

For that we also add some tracking of allocations on nodes that were not
satisfied using the empty_node[] array. A successful alloc on a node
clears that flag.

I would rather avoid the empty_node[] array since its global and there may
be thread specific allocation restrictions but it would be expensive to do
an allocation attempt via the page allocator to make sure that there is
really no page available from the page allocator.

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
+++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
@@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+static int empty_node[MAX_NUMNODES];
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
 	void *last;
 	void *p;
 	int order;
+	int alloc_node;

 	BUG_ON(flags & GFP_SLAB_BUG_MASK);

 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-	if (!page)
+	if (!page) {
+		if (node != NUMA_NO_NODE)
+			empty_node[node] = 1;
 		goto out;
+	}

 	order = compound_order(page);
-	inc_slabs_node(s, page_to_nid(page), page->objects);
+	alloc_node = page_to_nid(page);
+	empty_node[alloc_node] = 0;
+	inc_slabs_node(s, alloc_node, page->objects);
 	memcg_bind_pages(s, order);
 	page->slab_cache = s;
 	__SetPageSlab(page);
@@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
@@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+	int page_node;
+
+	/* No data means no match */
+	if (!page)
 		return 0;
+
+	/* Node does not matter. Therefore anything is a match */
+	if (node == NUMA_NO_NODE)
+		return 1;
+
+	/* Did we hit the requested node ? */
+	page_node = page_to_nid(page);
+	if (page_node == node)
+		return 1;
+
+	/* If the node has available data then we can use it. Mismatch */
+	return !empty_node[page_node];
+
+	/* Target node empty so just take anything */
 #endif
 	return 1;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-07 18:51                                         ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-07 18:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

Here is a draft of a patch to make this work with memoryless nodes.

The first thing is that we modify node_match to also match if we hit an
empty node. In that case we simply take the current slab if its there.

If there is no current slab then a regular allocation occurs with the
memoryless node. The page allocator will fallback to a possible node and
that will become the current slab. Next alloc from a memoryless node
will then use that slab.

For that we also add some tracking of allocations on nodes that were not
satisfied using the empty_node[] array. A successful alloc on a node
clears that flag.

I would rather avoid the empty_node[] array since its global and there may
be thread specific allocation restrictions but it would be expensive to do
an allocation attempt via the page allocator to make sure that there is
really no page available from the page allocator.

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
+++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
@@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+static int empty_node[MAX_NUMNODES];
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
 	void *last;
 	void *p;
 	int order;
+	int alloc_node;

 	BUG_ON(flags & GFP_SLAB_BUG_MASK);

 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-	if (!page)
+	if (!page) {
+		if (node != NUMA_NO_NODE)
+			empty_node[node] = 1;
 		goto out;
+	}

 	order = compound_order(page);
-	inc_slabs_node(s, page_to_nid(page), page->objects);
+	alloc_node = page_to_nid(page);
+	empty_node[alloc_node] = 0;
+	inc_slabs_node(s, alloc_node, page->objects);
 	memcg_bind_pages(s, order);
 	page->slab_cache = s;
 	__SetPageSlab(page);
@@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
@@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+	int page_node;
+
+	/* No data means no match */
+	if (!page)
 		return 0;
+
+	/* Node does not matter. Therefore anything is a match */
+	if (node == NUMA_NO_NODE)
+		return 1;
+
+	/* Did we hit the requested node ? */
+	page_node = page_to_nid(page);
+	if (page_node == node)
+		return 1;
+
+	/* If the node has available data then we can use it. Mismatch */
+	return !empty_node[page_node];
+
+	/* Target node empty so just take anything */
 #endif
 	return 1;
 }

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-07 18:51                                         ` Christoph Lameter
@ 2014-02-07 21:38                                           ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-07 21:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.

Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-07 21:38                                           ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-07 21:38 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.

Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-07  5:48                                     ` Joonsoo Kim
@ 2014-02-08  9:57                                       ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-08  9:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > It seems like a better approach would be to do this when a node is brought 
> > online and determine the fallback node based not on the zonelists as you 
> > do here but rather on locality (such as through a SLIT if provided, see 
> > node_distance()).
> 
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)
> 

The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
If your solution is going to become the generic kernel API that determines 
what node has local memory for a particular node, then it will have to 
support all definitions of node.  That includes nodes that consist solely 
of I/O, chipsets, networking, or storage devices.  These nodes may not 
have memory or cpus, so doing it as part of onlining cpus isn't going to 
be generic enough.  You want a node_to_mem_node() API for all possible 
node types (the possible node types listed above are straight from the 
ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
X and we can optimize for that, but any solution that relies on cpu online 
is probably shortsighted right now.

I think it would be much better to do this as a part of setting a node to 
be online.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-08  9:57                                       ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-08  9:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Fri, 7 Feb 2014, Joonsoo Kim wrote:

> > It seems like a better approach would be to do this when a node is brought 
> > online and determine the fallback node based not on the zonelists as you 
> > do here but rather on locality (such as through a SLIT if provided, see 
> > node_distance()).
> 
> Hmm...
> I guess that zonelist is base on locality. Zonelist is generated using
> node_distance(), so I think that it reflects locality. But, I'm not expert
> on NUMA, so please let me know what I am missing here :)
> 

The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
If your solution is going to become the generic kernel API that determines 
what node has local memory for a particular node, then it will have to 
support all definitions of node.  That includes nodes that consist solely 
of I/O, chipsets, networking, or storage devices.  These nodes may not 
have memory or cpus, so doing it as part of onlining cpus isn't going to 
be generic enough.  You want a node_to_mem_node() API for all possible 
node types (the possible node types listed above are straight from the 
ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
X and we can optimize for that, but any solution that relies on cpu online 
is probably shortsighted right now.

I think it would be much better to do this as a part of setting a node to 
be online.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-08  9:57                                       ` David Rientjes
@ 2014-02-10  1:09                                         ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-10  1:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > It seems like a better approach would be to do this when a node is brought 
> > > online and determine the fallback node based not on the zonelists as you 
> > > do here but rather on locality (such as through a SLIT if provided, see 
> > > node_distance()).
> > 
> > Hmm...
> > I guess that zonelist is base on locality. Zonelist is generated using
> > node_distance(), so I think that it reflects locality. But, I'm not expert
> > on NUMA, so please let me know what I am missing here :)
> > 
> 
> The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> If your solution is going to become the generic kernel API that determines 
> what node has local memory for a particular node, then it will have to 
> support all definitions of node.  That includes nodes that consist solely 
> of I/O, chipsets, networking, or storage devices.  These nodes may not 
> have memory or cpus, so doing it as part of onlining cpus isn't going to 
> be generic enough.  You want a node_to_mem_node() API for all possible 
> node types (the possible node types listed above are straight from the 
> ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> X and we can optimize for that, but any solution that relies on cpu online 
> is probably shortsighted right now.
> 
> I think it would be much better to do this as a part of setting a node to 
> be online.

Okay. I got your point.
I will change it to rely on node online if this patch is really needed.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-10  1:09                                         ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-10  1:09 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > It seems like a better approach would be to do this when a node is brought 
> > > online and determine the fallback node based not on the zonelists as you 
> > > do here but rather on locality (such as through a SLIT if provided, see 
> > > node_distance()).
> > 
> > Hmm...
> > I guess that zonelist is base on locality. Zonelist is generated using
> > node_distance(), so I think that it reflects locality. But, I'm not expert
> > on NUMA, so please let me know what I am missing here :)
> > 
> 
> The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> If your solution is going to become the generic kernel API that determines 
> what node has local memory for a particular node, then it will have to 
> support all definitions of node.  That includes nodes that consist solely 
> of I/O, chipsets, networking, or storage devices.  These nodes may not 
> have memory or cpus, so doing it as part of onlining cpus isn't going to 
> be generic enough.  You want a node_to_mem_node() API for all possible 
> node types (the possible node types listed above are straight from the 
> ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> X and we can optimize for that, but any solution that relies on cpu online 
> is probably shortsighted right now.
> 
> I think it would be much better to do this as a part of setting a node to 
> be online.

Okay. I got your point.
I will change it to rely on node online if this patch is really needed.

Thanks!

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-07 21:38                                           ` Nishanth Aravamudan
@ 2014-02-10  1:15                                             ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-10  1:15 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Fri, Feb 07, 2014 at 01:38:55PM -0800, Nishanth Aravamudan wrote:
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> 
> Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Hello,

I guess that your system has another problem that makes my patches inactive.
Maybe it will also affect to the Christoph's one. Could you confirm page_to_nid(),
numa_mem_id() and node_present_pages although I doubt mostly about page_to_nid()?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-10  1:15                                             ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-10  1:15 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Fri, Feb 07, 2014 at 01:38:55PM -0800, Nishanth Aravamudan wrote:
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> 
> Hi Christoph, this should be tested instead of Joonsoo's patch 2 (and 3)?

Hello,

I guess that your system has another problem that makes my patches inactive.
Maybe it will also affect to the Christoph's one. Could you confirm page_to_nid(),
numa_mem_id() and node_present_pages although I doubt mostly about page_to_nid()?

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
  2014-02-07 17:49                                   ` Christoph Lameter
@ 2014-02-10  1:22                                     ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-10  1:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, penberg,
	linux-mm, paulus, Anton Blanchard, mpm, linuxppc-dev, Wanpeng Li

On Fri, Feb 07, 2014 at 11:49:57AM -0600, Christoph Lameter wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > This check wouild need to be something that checks for other contigencies
> > > in the page allocator as well. A simple solution would be to actually run
> > > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > > If that fails then fallback. See how fallback_alloc() does it in slab.
> > >
> >
> > Hello, Christoph.
> >
> > This !node_present_pages() ensure that allocation on this node cannot succeed.
> > So we can directly use numa_mem_id() here.
> 
> Yes of course we can use numa_mem_id().
> 
> But the check is only for not having any memory at all on a node. There
> are other reason for allocations to fail on a certain node. The node could
> have memory that cannot be reclaimed, all dirty, beyond certain
> thresholds, not in the current set of allowed nodes etc etc.

Yes. There are many other cases, but I prefer that we think them separately.
Maybe they needs another approach. For now, to solve memoryless node problem,
my solution is enough and safe.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node
@ 2014-02-10  1:22                                     ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-10  1:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, mpm, penberg, linux-mm,
	paulus, Anton Blanchard, David Rientjes, linuxppc-dev,
	Wanpeng Li

On Fri, Feb 07, 2014 at 11:49:57AM -0600, Christoph Lameter wrote:
> On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> 
> > > This check wouild need to be something that checks for other contigencies
> > > in the page allocator as well. A simple solution would be to actually run
> > > a GFP_THIS_NODE alloc to see if you can grab a page from the proper node.
> > > If that fails then fallback. See how fallback_alloc() does it in slab.
> > >
> >
> > Hello, Christoph.
> >
> > This !node_present_pages() ensure that allocation on this node cannot succeed.
> > So we can directly use numa_mem_id() here.
> 
> Yes of course we can use numa_mem_id().
> 
> But the check is only for not having any memory at all on a node. There
> are other reason for allocations to fail on a certain node. The node could
> have memory that cannot be reclaimed, all dirty, beyond certain
> thresholds, not in the current set of allowed nodes etc etc.

Yes. There are many other cases, but I prefer that we think them separately.
Maybe they needs another approach. For now, to solve memoryless node problem,
my solution is enough and safe.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-07 18:51                                         ` Christoph Lameter
@ 2014-02-10  1:29                                           ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-10  1:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.

Why not inspecting whether we can get the page on the best node such as
numa_mem_id() node?

> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}

empty_node cannot be set on memoryless node, since page allocation would
succeed on different node.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-10  1:29                                           ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-10  1:29 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.

Why not inspecting whether we can get the page on the best node such as
numa_mem_id() node?

> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}

empty_node cannot be set on memoryless node, since page allocation would
succeed on different node.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-07 18:51                                         ` Christoph Lameter
@ 2014-02-10 19:13                                           ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-10 19:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

Hi Christoph,

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.
> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.

With this patch on our test system (I pulled out the numa_mem_id()
change, since you Acked Joonsoo's already), on top of 3.13.0 + my
kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
patch 1):

MemTotal:        8264704 kB
MemFree:         5924608 kB
...
Slab:            1402496 kB
SReclaimable:     102848 kB
SUnreclaim:      1299648 kB

And Anton's slabusage reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       207 MB   98.60%  100.00%
task_struct                         134 MB   97.82%  100.00%
kmalloc-8192                        117 MB  100.00%  100.00%
pgtable-2^12                        111 MB  100.00%  100.00%
pgtable-2^10                        104 MB  100.00%  100.00%

For comparison, Anton's patch applied at the same point in the series:

meminfo:

MemTotal:        8264704 kB
MemFree:         4150464 kB
...
Slab:            1590336 kB
SReclaimable:     208768 kB
SUnreclaim:      1381568 kB

slabusage:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       227 MB   98.63%  100.00%
kmalloc-8192                        130 MB  100.00%  100.00%
task_struct                         129 MB   97.73%  100.00%
pgtable-2^12                        112 MB  100.00%  100.00%
pgtable-2^10                        106 MB  100.00%  100.00%


Consider this patch:

Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

I was thinking about your concerns about empty_node[]. Would it make
sense to use a helper function, rather than direct access to
direct_node, such as:

	bool is_node_empty(int nid)

	void set_node_empty(int nid, bool empty)

which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop
respectively?

That way only architectures that have memoryless nodes pay the penalty
of the array allocation?

Thanks,
Nish

> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +	empty_node[alloc_node] = 0;
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node;
> +
> +	/* No data means no match */
> +	if (!page)
>  		return 0;
> +
> +	/* Node does not matter. Therefore anything is a match */
> +	if (node == NUMA_NO_NODE)
> +		return 1;
> +
> +	/* Did we hit the requested node ? */
> +	page_node = page_to_nid(page);
> +	if (page_node == node)
> +		return 1;
> +
> +	/* If the node has available data then we can use it. Mismatch */
> +	return !empty_node[page_node];
> +
> +	/* Target node empty so just take anything */
>  #endif
>  	return 1;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-10 19:13                                           ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-10 19:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

Hi Christoph,

On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> Here is a draft of a patch to make this work with memoryless nodes.
> 
> The first thing is that we modify node_match to also match if we hit an
> empty node. In that case we simply take the current slab if its there.
> 
> If there is no current slab then a regular allocation occurs with the
> memoryless node. The page allocator will fallback to a possible node and
> that will become the current slab. Next alloc from a memoryless node
> will then use that slab.
> 
> For that we also add some tracking of allocations on nodes that were not
> satisfied using the empty_node[] array. A successful alloc on a node
> clears that flag.
> 
> I would rather avoid the empty_node[] array since its global and there may
> be thread specific allocation restrictions but it would be expensive to do
> an allocation attempt via the page allocator to make sure that there is
> really no page available from the page allocator.

With this patch on our test system (I pulled out the numa_mem_id()
change, since you Acked Joonsoo's already), on top of 3.13.0 + my
kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
patch 1):

MemTotal:        8264704 kB
MemFree:         5924608 kB
...
Slab:            1402496 kB
SReclaimable:     102848 kB
SUnreclaim:      1299648 kB

And Anton's slabusage reports:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       207 MB   98.60%  100.00%
task_struct                         134 MB   97.82%  100.00%
kmalloc-8192                        117 MB  100.00%  100.00%
pgtable-2^12                        111 MB  100.00%  100.00%
pgtable-2^10                        104 MB  100.00%  100.00%

For comparison, Anton's patch applied at the same point in the series:

meminfo:

MemTotal:        8264704 kB
MemFree:         4150464 kB
...
Slab:            1590336 kB
SReclaimable:     208768 kB
SUnreclaim:      1381568 kB

slabusage:

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       227 MB   98.63%  100.00%
kmalloc-8192                        130 MB  100.00%  100.00%
task_struct                         129 MB   97.73%  100.00%
pgtable-2^12                        112 MB  100.00%  100.00%
pgtable-2^10                        106 MB  100.00%  100.00%


Consider this patch:

Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

I was thinking about your concerns about empty_node[]. Would it make
sense to use a helper function, rather than direct access to
direct_node, such as:

	bool is_node_empty(int nid)

	void set_node_empty(int nid, bool empty)

which we stub out if !HAVE_MEMORYLESS_NODES to return false and noop
respectively?

That way only architectures that have memoryless nodes pay the penalty
of the array allocation?

Thanks,
Nish

> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-03 13:19:22.896853227 -0600
> +++ linux/mm/slub.c	2014-02-07 12:44:49.311494806 -0600
> @@ -132,6 +132,8 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +static int empty_node[MAX_NUMNODES];
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1407,22 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +		if (node != NUMA_NO_NODE)
> +			empty_node[node] = 1;
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +	empty_node[alloc_node] = 0;
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1712,7 +1720,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2107,8 +2115,25 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node;
> +
> +	/* No data means no match */
> +	if (!page)
>  		return 0;
> +
> +	/* Node does not matter. Therefore anything is a match */
> +	if (node == NUMA_NO_NODE)
> +		return 1;
> +
> +	/* Did we hit the requested node ? */
> +	page_node = page_to_nid(page);
> +	if (page_node == node)
> +		return 1;
> +
> +	/* If the node has available data then we can use it. Mismatch */
> +	return !empty_node[page_node];
> +
> +	/* Target node empty so just take anything */
>  #endif
>  	return 1;
>  }
> 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-10 19:13                                           ` Nishanth Aravamudan
@ 2014-02-11  7:42                                             ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-11  7:42 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> Hi Christoph,
> 
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> > 
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
> > 
> > If there is no current slab then a regular allocation occurs with the
> > memoryless node. The page allocator will fallback to a possible node and
> > that will become the current slab. Next alloc from a memoryless node
> > will then use that slab.
> > 
> > For that we also add some tracking of allocations on nodes that were not
> > satisfied using the empty_node[] array. A successful alloc on a node
> > clears that flag.
> > 
> > I would rather avoid the empty_node[] array since its global and there may
> > be thread specific allocation restrictions but it would be expensive to do
> > an allocation attempt via the page allocator to make sure that there is
> > really no page available from the page allocator.
> 
> With this patch on our test system (I pulled out the numa_mem_id()
> change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> patch 1):
> 
> MemTotal:        8264704 kB
> MemFree:         5924608 kB
> ...
> Slab:            1402496 kB
> SReclaimable:     102848 kB
> SUnreclaim:      1299648 kB
> 
> And Anton's slabusage reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       207 MB   98.60%  100.00%
> task_struct                         134 MB   97.82%  100.00%
> kmalloc-8192                        117 MB  100.00%  100.00%
> pgtable-2^12                        111 MB  100.00%  100.00%
> pgtable-2^10                        104 MB  100.00%  100.00%
> 
> For comparison, Anton's patch applied at the same point in the series:
> 
> meminfo:
> 
> MemTotal:        8264704 kB
> MemFree:         4150464 kB
> ...
> Slab:            1590336 kB
> SReclaimable:     208768 kB
> SUnreclaim:      1381568 kB
> 
> slabusage:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       227 MB   98.63%  100.00%
> kmalloc-8192                        130 MB  100.00%  100.00%
> task_struct                         129 MB   97.73%  100.00%
> pgtable-2^12                        112 MB  100.00%  100.00%
> pgtable-2^10                        106 MB  100.00%  100.00%
> 
> 
> Consider this patch:
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

Hello,

I still think that there is another problem.
Your report about CONFIG_SLAB said that SLAB uses just 200MB.
Below is your previous report.

  Ok, with your patches applied and CONFIG_SLAB enabled:

  MemTotal:        8264640 kB
  MemFree:         7119680 kB
  Slab:             207232 kB
  SReclaimable:      32896 kB
  SUnreclaim:       174336 kB

The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
There is large difference on slab usage.

And, I should note that number of active objects on slabinfo can be wrong
on some situation, since it doesn't consider cpu slab (and cpu partial slab).

I recommend to confirm page_to_nid() and other things as I mentioned earlier.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-11  7:42                                             ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-11  7:42 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> Hi Christoph,
> 
> On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> > 
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
> > 
> > If there is no current slab then a regular allocation occurs with the
> > memoryless node. The page allocator will fallback to a possible node and
> > that will become the current slab. Next alloc from a memoryless node
> > will then use that slab.
> > 
> > For that we also add some tracking of allocations on nodes that were not
> > satisfied using the empty_node[] array. A successful alloc on a node
> > clears that flag.
> > 
> > I would rather avoid the empty_node[] array since its global and there may
> > be thread specific allocation restrictions but it would be expensive to do
> > an allocation attempt via the page allocator to make sure that there is
> > really no page available from the page allocator.
> 
> With this patch on our test system (I pulled out the numa_mem_id()
> change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> patch 1):
> 
> MemTotal:        8264704 kB
> MemFree:         5924608 kB
> ...
> Slab:            1402496 kB
> SReclaimable:     102848 kB
> SUnreclaim:      1299648 kB
> 
> And Anton's slabusage reports:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       207 MB   98.60%  100.00%
> task_struct                         134 MB   97.82%  100.00%
> kmalloc-8192                        117 MB  100.00%  100.00%
> pgtable-2^12                        111 MB  100.00%  100.00%
> pgtable-2^10                        104 MB  100.00%  100.00%
> 
> For comparison, Anton's patch applied at the same point in the series:
> 
> meminfo:
> 
> MemTotal:        8264704 kB
> MemFree:         4150464 kB
> ...
> Slab:            1590336 kB
> SReclaimable:     208768 kB
> SUnreclaim:      1381568 kB
> 
> slabusage:
> 
> slab                                   mem     objs    slabs
>                                       used   active   active
> ------------------------------------------------------------
> kmalloc-16384                       227 MB   98.63%  100.00%
> kmalloc-8192                        130 MB  100.00%  100.00%
> task_struct                         129 MB   97.73%  100.00%
> pgtable-2^12                        112 MB  100.00%  100.00%
> pgtable-2^10                        106 MB  100.00%  100.00%
> 
> 
> Consider this patch:
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

Hello,

I still think that there is another problem.
Your report about CONFIG_SLAB said that SLAB uses just 200MB.
Below is your previous report.

  Ok, with your patches applied and CONFIG_SLAB enabled:

  MemTotal:        8264640 kB
  MemFree:         7119680 kB
  Slab:             207232 kB
  SReclaimable:      32896 kB
  SUnreclaim:       174336 kB

The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
There is large difference on slab usage.

And, I should note that number of active objects on slabinfo can be wrong
on some situation, since it doesn't consider cpu slab (and cpu partial slab).

I recommend to confirm page_to_nid() and other things as I mentioned earlier.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-10  1:29                                           ` Joonsoo Kim
@ 2014-02-11 18:45                                             ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-11 18:45 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Mon, 10 Feb 2014, Joonsoo Kim wrote:

> On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> >
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
>
> Why not inspecting whether we can get the page on the best node such as
> numa_mem_id() node?

Its expensive to do so.

> empty_node cannot be set on memoryless node, since page allocation would
> succeed on different node.

Ok then we need to add a check for being on the rignt node there too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-11 18:45                                             ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-11 18:45 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

On Mon, 10 Feb 2014, Joonsoo Kim wrote:

> On Fri, Feb 07, 2014 at 12:51:07PM -0600, Christoph Lameter wrote:
> > Here is a draft of a patch to make this work with memoryless nodes.
> >
> > The first thing is that we modify node_match to also match if we hit an
> > empty node. In that case we simply take the current slab if its there.
>
> Why not inspecting whether we can get the page on the best node such as
> numa_mem_id() node?

Its expensive to do so.

> empty_node cannot be set on memoryless node, since page allocation would
> succeed on different node.

Ok then we need to add a check for being on the rignt node there too.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-11  7:42                                             ` Joonsoo Kim
@ 2014-02-12 22:16                                               ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-12 22:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

Here is another patch with some fixes. The additional logic is only
compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.

Subject: slub: Memoryless node support

Support memoryless nodes by tracking which allocations are failing.
Allocations targeted to the nodes without memory fall back to the
current available per cpu objects and if that is not available will
create a new slab using the page allocator to fallback from the
memoryless node to some other node.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-02-12 16:07:48.957869570 -0600
+++ linux/mm/slub.c	2014-02-12 16:09:22.198928260 -0600
@@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+static nodemask_t empty_nodes;
+#endif
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
 	void *last;
 	void *p;
 	int order;
+	int alloc_node;

 	BUG_ON(flags & GFP_SLAB_BUG_MASK);

 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-	if (!page)
+	if (!page) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+		if (node != NUMA_NO_NODE)
+			node_set(node, empty_nodes);
+#endif
 		goto out;
+	}

 	order = compound_order(page);
-	inc_slabs_node(s, page_to_nid(page), page->objects);
+	alloc_node = page_to_nid(page);
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+	node_clear(alloc_node, empty_nodes);
+	if (node != NUMA_NO_NODE && alloc_node != node)
+		node_set(node, empty_nodes);
+#endif
+	inc_slabs_node(s, alloc_node, page->objects);
 	memcg_bind_pages(s, order);
 	page->slab_cache = s;
 	__SetPageSlab(page);
@@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
@@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+	int page_node = page_to_nid(page);
+
+	if (!page)
 		return 0;
+
+	if (node != NUMA_NO_NODE) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+		if (node_isset(node, empty_nodes))
+			return 1;
+#endif
+		if (page_node != node)
+			return 0;
+	}
 #endif
 	return 1;
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-12 22:16                                               ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-12 22:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

Here is another patch with some fixes. The additional logic is only
compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.

Subject: slub: Memoryless node support

Support memoryless nodes by tracking which allocations are failing.
Allocations targeted to the nodes without memory fall back to the
current available per cpu objects and if that is not available will
create a new slab using the page allocator to fallback from the
memoryless node to some other node.

Signed-off-by: Christoph Lameter <cl@linux.com>

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c	2014-02-12 16:07:48.957869570 -0600
+++ linux/mm/slub.c	2014-02-12 16:09:22.198928260 -0600
@@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
 #endif
 }

+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+static nodemask_t empty_nodes;
+#endif
+
 /*
  * Issues still to be resolved:
  *
@@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
 	void *last;
 	void *p;
 	int order;
+	int alloc_node;

 	BUG_ON(flags & GFP_SLAB_BUG_MASK);

 	page = allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
-	if (!page)
+	if (!page) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+		if (node != NUMA_NO_NODE)
+			node_set(node, empty_nodes);
+#endif
 		goto out;
+	}

 	order = compound_order(page);
-	inc_slabs_node(s, page_to_nid(page), page->objects);
+	alloc_node = page_to_nid(page);
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+	node_clear(alloc_node, empty_nodes);
+	if (node != NUMA_NO_NODE && alloc_node != node)
+		node_set(node, empty_nodes);
+#endif
+	inc_slabs_node(s, alloc_node, page->objects);
 	memcg_bind_pages(s, order);
 	page->slab_cache = s;
 	__SetPageSlab(page);
@@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
 		struct kmem_cache_cpu *c)
 {
 	void *object;
-	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;

 	object = get_partial_node(s, get_node(s, searchnode), c, flags);
 	if (object || node != NUMA_NO_NODE)
@@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
 static inline int node_match(struct page *page, int node)
 {
 #ifdef CONFIG_NUMA
-	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
+	int page_node = page_to_nid(page);
+
+	if (!page)
 		return 0;
+
+	if (node != NUMA_NO_NODE) {
+#ifdef CONFIG_HAVE_MEMORYLESS_NODES
+		if (node_isset(node, empty_nodes))
+			return 1;
+#endif
+		if (page_node != node)
+			return 0;
+	}
 #endif
 	return 1;
 }

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-12 22:16                                               ` Christoph Lameter
@ 2014-02-13  3:53                                                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-13  3:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On 12.02.2014 [16:16:11 -0600], Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.
> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.

I'll try and retest this once the LPAR in question comes free. Hopefully
in the next day or two.

Thanks,
Nish

> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-12 16:07:48.957869570 -0600
> +++ linux/mm/slub.c	2014-02-12 16:09:22.198928260 -0600
> @@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +static nodemask_t empty_nodes;
> +#endif
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +		if (node != NUMA_NO_NODE)
> +			node_set(node, empty_nodes);
> +#endif
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +	node_clear(alloc_node, empty_nodes);
> +	if (node != NUMA_NO_NODE && alloc_node != node)
> +		node_set(node, empty_nodes);
> +#endif
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node = page_to_nid(page);
> +
> +	if (!page)
>  		return 0;
> +
> +	if (node != NUMA_NO_NODE) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +		if (node_isset(node, empty_nodes))
> +			return 1;
> +#endif
> +		if (page_node != node)
> +			return 0;
> +	}
>  #endif
>  	return 1;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-13  3:53                                                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-13  3:53 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 12.02.2014 [16:16:11 -0600], Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.
> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.

I'll try and retest this once the LPAR in question comes free. Hopefully
in the next day or two.

Thanks,
Nish

> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-12 16:07:48.957869570 -0600
> +++ linux/mm/slub.c	2014-02-12 16:09:22.198928260 -0600
> @@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +static nodemask_t empty_nodes;
> +#endif
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +		if (node != NUMA_NO_NODE)
> +			node_set(node, empty_nodes);
> +#endif
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +	node_clear(alloc_node, empty_nodes);
> +	if (node != NUMA_NO_NODE && alloc_node != node)
> +		node_set(node, empty_nodes);
> +#endif
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node = page_to_nid(page);
> +
> +	if (!page)
>  		return 0;
> +
> +	if (node != NUMA_NO_NODE) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +		if (node_isset(node, empty_nodes))
> +			return 1;
> +#endif
> +		if (page_node != node)
> +			return 0;
> +	}
>  #endif
>  	return 1;
>  }
> 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-11  7:42                                             ` Joonsoo Kim
@ 2014-02-13  6:51                                               ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-13  6:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

Hi Joonsoo,

On 11.02.2014 [16:42:00 +0900], Joonsoo Kim wrote:
> On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> > Hi Christoph,
> > 
> > On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > > Here is a draft of a patch to make this work with memoryless nodes.
> > > 
> > > The first thing is that we modify node_match to also match if we hit an
> > > empty node. In that case we simply take the current slab if its there.
> > > 
> > > If there is no current slab then a regular allocation occurs with the
> > > memoryless node. The page allocator will fallback to a possible node and
> > > that will become the current slab. Next alloc from a memoryless node
> > > will then use that slab.
> > > 
> > > For that we also add some tracking of allocations on nodes that were not
> > > satisfied using the empty_node[] array. A successful alloc on a node
> > > clears that flag.
> > > 
> > > I would rather avoid the empty_node[] array since its global and there may
> > > be thread specific allocation restrictions but it would be expensive to do
> > > an allocation attempt via the page allocator to make sure that there is
> > > really no page available from the page allocator.
> > 
> > With this patch on our test system (I pulled out the numa_mem_id()
> > change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> > kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> > patch 1):
> > 
> > MemTotal:        8264704 kB
> > MemFree:         5924608 kB
> > ...
> > Slab:            1402496 kB
> > SReclaimable:     102848 kB
> > SUnreclaim:      1299648 kB
> > 
> > And Anton's slabusage reports:
> > 
> > slab                                   mem     objs    slabs
> >                                       used   active   active
> > ------------------------------------------------------------
> > kmalloc-16384                       207 MB   98.60%  100.00%
> > task_struct                         134 MB   97.82%  100.00%
> > kmalloc-8192                        117 MB  100.00%  100.00%
> > pgtable-2^12                        111 MB  100.00%  100.00%
> > pgtable-2^10                        104 MB  100.00%  100.00%
> > 
> > For comparison, Anton's patch applied at the same point in the series:
> > 
> > meminfo:
> > 
> > MemTotal:        8264704 kB
> > MemFree:         4150464 kB
> > ...
> > Slab:            1590336 kB
> > SReclaimable:     208768 kB
> > SUnreclaim:      1381568 kB
> > 
> > slabusage:
> > 
> > slab                                   mem     objs    slabs
> >                                       used   active   active
> > ------------------------------------------------------------
> > kmalloc-16384                       227 MB   98.63%  100.00%
> > kmalloc-8192                        130 MB  100.00%  100.00%
> > task_struct                         129 MB   97.73%  100.00%
> > pgtable-2^12                        112 MB  100.00%  100.00%
> > pgtable-2^10                        106 MB  100.00%  100.00%
> > 
> > 
> > Consider this patch:
> > 
> > Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> > Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> Hello,
> 
> I still think that there is another problem.
> Your report about CONFIG_SLAB said that SLAB uses just 200MB.
> Below is your previous report.
> 
>   Ok, with your patches applied and CONFIG_SLAB enabled:
> 
>   MemTotal:        8264640 kB
>   MemFree:         7119680 kB
>   Slab:             207232 kB
>   SReclaimable:      32896 kB
>   SUnreclaim:       174336 kB
> 
> The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
> There is large difference on slab usage.

Agreed. But, at least for now, this gets us to not OOM all the time :) I
think that's significant progress. I will continue to look at this
issue for where the other gaps are, but would like to see Christoph's
latest patch get merged (pending my re-testing).

> And, I should note that number of active objects on slabinfo can be
> wrong on some situation, since it doesn't consider cpu slab (and cpu
> partial slab).

Well, I grabbed everything from /sys/kernel/slab for you in the
tarballs, I believe.

> I recommend to confirm page_to_nid() and other things as I mentioned
> earlier.

I believe these all work once CONFIG_HAVE_MEMORYLESS_NODES was set for
ppc64, but will test it again when I have access to the test system.

Also, given that only ia64 and (hopefuly soon) ppc64 can set
CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
memoryless nodes present? Even with fakenuma? Just curious.

-Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-13  6:51                                               ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-13  6:51 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

Hi Joonsoo,

On 11.02.2014 [16:42:00 +0900], Joonsoo Kim wrote:
> On Mon, Feb 10, 2014 at 11:13:21AM -0800, Nishanth Aravamudan wrote:
> > Hi Christoph,
> > 
> > On 07.02.2014 [12:51:07 -0600], Christoph Lameter wrote:
> > > Here is a draft of a patch to make this work with memoryless nodes.
> > > 
> > > The first thing is that we modify node_match to also match if we hit an
> > > empty node. In that case we simply take the current slab if its there.
> > > 
> > > If there is no current slab then a regular allocation occurs with the
> > > memoryless node. The page allocator will fallback to a possible node and
> > > that will become the current slab. Next alloc from a memoryless node
> > > will then use that slab.
> > > 
> > > For that we also add some tracking of allocations on nodes that were not
> > > satisfied using the empty_node[] array. A successful alloc on a node
> > > clears that flag.
> > > 
> > > I would rather avoid the empty_node[] array since its global and there may
> > > be thread specific allocation restrictions but it would be expensive to do
> > > an allocation attempt via the page allocator to make sure that there is
> > > really no page available from the page allocator.
> > 
> > With this patch on our test system (I pulled out the numa_mem_id()
> > change, since you Acked Joonsoo's already), on top of 3.13.0 + my
> > kthread locality change + CONFIG_HAVE_MEMORYLESS_NODES + Joonsoo's RFC
> > patch 1):
> > 
> > MemTotal:        8264704 kB
> > MemFree:         5924608 kB
> > ...
> > Slab:            1402496 kB
> > SReclaimable:     102848 kB
> > SUnreclaim:      1299648 kB
> > 
> > And Anton's slabusage reports:
> > 
> > slab                                   mem     objs    slabs
> >                                       used   active   active
> > ------------------------------------------------------------
> > kmalloc-16384                       207 MB   98.60%  100.00%
> > task_struct                         134 MB   97.82%  100.00%
> > kmalloc-8192                        117 MB  100.00%  100.00%
> > pgtable-2^12                        111 MB  100.00%  100.00%
> > pgtable-2^10                        104 MB  100.00%  100.00%
> > 
> > For comparison, Anton's patch applied at the same point in the series:
> > 
> > meminfo:
> > 
> > MemTotal:        8264704 kB
> > MemFree:         4150464 kB
> > ...
> > Slab:            1590336 kB
> > SReclaimable:     208768 kB
> > SUnreclaim:      1381568 kB
> > 
> > slabusage:
> > 
> > slab                                   mem     objs    slabs
> >                                       used   active   active
> > ------------------------------------------------------------
> > kmalloc-16384                       227 MB   98.63%  100.00%
> > kmalloc-8192                        130 MB  100.00%  100.00%
> > task_struct                         129 MB   97.73%  100.00%
> > pgtable-2^12                        112 MB  100.00%  100.00%
> > pgtable-2^10                        106 MB  100.00%  100.00%
> > 
> > 
> > Consider this patch:
> > 
> > Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> > Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> Hello,
> 
> I still think that there is another problem.
> Your report about CONFIG_SLAB said that SLAB uses just 200MB.
> Below is your previous report.
> 
>   Ok, with your patches applied and CONFIG_SLAB enabled:
> 
>   MemTotal:        8264640 kB
>   MemFree:         7119680 kB
>   Slab:             207232 kB
>   SReclaimable:      32896 kB
>   SUnreclaim:       174336 kB
> 
> The number on CONFIG_SLUB with these patches tell us that SLUB uses 1.4GB.
> There is large difference on slab usage.

Agreed. But, at least for now, this gets us to not OOM all the time :) I
think that's significant progress. I will continue to look at this
issue for where the other gaps are, but would like to see Christoph's
latest patch get merged (pending my re-testing).

> And, I should note that number of active objects on slabinfo can be
> wrong on some situation, since it doesn't consider cpu slab (and cpu
> partial slab).

Well, I grabbed everything from /sys/kernel/slab for you in the
tarballs, I believe.

> I recommend to confirm page_to_nid() and other things as I mentioned
> earlier.

I believe these all work once CONFIG_HAVE_MEMORYLESS_NODES was set for
ppc64, but will test it again when I have access to the test system.

Also, given that only ia64 and (hopefuly soon) ppc64 can set
CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
memoryless nodes present? Even with fakenuma? Just curious.

-Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-12 22:16                                               ` Christoph Lameter
@ 2014-02-17  6:52                                                 ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-17  6:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.

I still don't understand why this tracking is needed.
All we need for allcation targeted to memoryless node is to fallback proper
node, that it, numa_mem_id() node of targeted node. My previous patch
implements it and use proper fallback node on every allocation code path.
Why this tracking is needed? Please elaborate more on this.

> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)

This isn't enough.
Consider that allcation targeted to memoryless node.
get_partial_node() always fails even if there are some partial slab on
memoryless node's neareast node.
We should fallback to some proper node in this case, since there is no slab
on memoryless node.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-17  6:52                                                 ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-17  6:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.

I still don't understand why this tracking is needed.
All we need for allcation targeted to memoryless node is to fallback proper
node, that it, numa_mem_id() node of targeted node. My previous patch
implements it and use proper fallback node on every allocation code path.
Why this tracking is needed? Please elaborate more on this.

> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>
> 
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)

This isn't enough.
Consider that allcation targeted to memoryless node.
get_partial_node() always fails even if there are some partial slab on
memoryless node's neareast node.
We should fallback to some proper node in this case, since there is no slab
on memoryless node.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-13  6:51                                               ` Nishanth Aravamudan
@ 2014-02-17  7:00                                                 ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-17  7:00 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Christoph Lameter, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> Hi Joonsoo,
> Also, given that only ia64 and (hopefuly soon) ppc64 can set
> CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> memoryless nodes present? Even with fakenuma? Just curious.

I don't know, because I'm not expert on NUMA system :)
At first glance, fakenuma can't be used for testing
CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-17  7:00                                                 ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-17  7:00 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> Hi Joonsoo,
> Also, given that only ia64 and (hopefuly soon) ppc64 can set
> CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> memoryless nodes present? Even with fakenuma? Just curious.

I don't know, because I'm not expert on NUMA system :)
At first glance, fakenuma can't be used for testing
CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-17  6:52                                                 ` Joonsoo Kim
@ 2014-02-18 16:38                                                   ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-18 16:38 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> > Here is another patch with some fixes. The additional logic is only
> > compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> >
> > Subject: slub: Memoryless node support
> >
> > Support memoryless nodes by tracking which allocations are failing.
>
> I still don't understand why this tracking is needed.

Its an optimization to avoid calling the page allocator to figure out if
there is memory available on a particular node.

> All we need for allcation targeted to memoryless node is to fallback proper
> node, that it, numa_mem_id() node of targeted node. My previous patch
> implements it and use proper fallback node on every allocation code path.
> Why this tracking is needed? Please elaborate more on this.

Its too slow to do that on every alloc. One needs to be able to satisfy
most allocations without switching percpu slabs for optimal performance.

> > Allocations targeted to the nodes without memory fall back to the
> > current available per cpu objects and if that is not available will
> > create a new slab using the page allocator to fallback from the
> > memoryless node to some other node.

And what about the next alloc? Assuem there are N allocs from a memoryless
node this means we push back the partial slab on each alloc and then fall
back?

> >  {
> >  	void *object;
> > -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> >
> >  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >  	if (object || node != NUMA_NO_NODE)
>
> This isn't enough.
> Consider that allcation targeted to memoryless node.

It will not common get there because of the tracking. Instead a per cpu
object will be used.

> get_partial_node() always fails even if there are some partial slab on
> memoryless node's neareast node.

Correct and that leads to a page allocator action whereupon the node will
be marked as empty.

> We should fallback to some proper node in this case, since there is no slab
> on memoryless node.

NUMA is about optimization of memory allocations. It is often *not* about
correctness but heuristics are used in many cases. F.e. see the zone
reclaim logic, zone reclaim mode, fallback scenarios in the page allocator
etc etc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-18 16:38                                                   ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-18 16:38 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> > Here is another patch with some fixes. The additional logic is only
> > compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> >
> > Subject: slub: Memoryless node support
> >
> > Support memoryless nodes by tracking which allocations are failing.
>
> I still don't understand why this tracking is needed.

Its an optimization to avoid calling the page allocator to figure out if
there is memory available on a particular node.

> All we need for allcation targeted to memoryless node is to fallback proper
> node, that it, numa_mem_id() node of targeted node. My previous patch
> implements it and use proper fallback node on every allocation code path.
> Why this tracking is needed? Please elaborate more on this.

Its too slow to do that on every alloc. One needs to be able to satisfy
most allocations without switching percpu slabs for optimal performance.

> > Allocations targeted to the nodes without memory fall back to the
> > current available per cpu objects and if that is not available will
> > create a new slab using the page allocator to fallback from the
> > memoryless node to some other node.

And what about the next alloc? Assuem there are N allocs from a memoryless
node this means we push back the partial slab on each alloc and then fall
back?

> >  {
> >  	void *object;
> > -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> >
> >  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
> >  	if (object || node != NUMA_NO_NODE)
>
> This isn't enough.
> Consider that allcation targeted to memoryless node.

It will not common get there because of the tracking. Instead a per cpu
object will be used.

> get_partial_node() always fails even if there are some partial slab on
> memoryless node's neareast node.

Correct and that leads to a page allocator action whereupon the node will
be marked as empty.

> We should fallback to some proper node in this case, since there is no slab
> on memoryless node.

NUMA is about optimization of memory allocations. It is often *not* about
correctness but heuristics are used in many cases. F.e. see the zone
reclaim logic, zone reclaim mode, fallback scenarios in the page allocator
etc etc.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-17  7:00                                                 ` Joonsoo Kim
@ 2014-02-18 16:57                                                   ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-18 16:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > Hi Joonsoo,
> > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > memoryless nodes present? Even with fakenuma? Just curious.

x86_64 currently does not support memoryless nodes otherwise it would
have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
node without memory but just processors. Not very efficient. Not sure why
people use these configurations.

> I don't know, because I'm not expert on NUMA system :)
> At first glance, fakenuma can't be used for testing
> CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Well yeah. You'd have to do some mods to enable that testing.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-18 16:57                                                   ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-18 16:57 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

On Mon, 17 Feb 2014, Joonsoo Kim wrote:

> On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > Hi Joonsoo,
> > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > memoryless nodes present? Even with fakenuma? Just curious.

x86_64 currently does not support memoryless nodes otherwise it would
have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
node without memory but just processors. Not very efficient. Not sure why
people use these configurations.

> I don't know, because I'm not expert on NUMA system :)
> At first glance, fakenuma can't be used for testing
> CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.

Well yeah. You'd have to do some mods to enable that testing.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-12 22:16                                               ` Christoph Lameter
@ 2014-02-18 17:22                                                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18 17:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On 12.02.2014 [16:16:11 -0600], Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.
> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>

Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-12 16:07:48.957869570 -0600
> +++ linux/mm/slub.c	2014-02-12 16:09:22.198928260 -0600
> @@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +static nodemask_t empty_nodes;
> +#endif
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +		if (node != NUMA_NO_NODE)
> +			node_set(node, empty_nodes);
> +#endif
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +	node_clear(alloc_node, empty_nodes);
> +	if (node != NUMA_NO_NODE && alloc_node != node)
> +		node_set(node, empty_nodes);
> +#endif
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node = page_to_nid(page);
> +
> +	if (!page)
>  		return 0;
> +
> +	if (node != NUMA_NO_NODE) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +		if (node_isset(node, empty_nodes))
> +			return 1;
> +#endif
> +		if (page_node != node)
> +			return 0;
> +	}
>  #endif
>  	return 1;
>  }
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-18 17:22                                                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18 17:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 12.02.2014 [16:16:11 -0600], Christoph Lameter wrote:
> Here is another patch with some fixes. The additional logic is only
> compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> 
> Subject: slub: Memoryless node support
> 
> Support memoryless nodes by tracking which allocations are failing.
> Allocations targeted to the nodes without memory fall back to the
> current available per cpu objects and if that is not available will
> create a new slab using the page allocator to fallback from the
> memoryless node to some other node.
> 
> Signed-off-by: Christoph Lameter <cl@linux.com>

Tested-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

> Index: linux/mm/slub.c
> ===================================================================
> --- linux.orig/mm/slub.c	2014-02-12 16:07:48.957869570 -0600
> +++ linux/mm/slub.c	2014-02-12 16:09:22.198928260 -0600
> @@ -134,6 +134,10 @@ static inline bool kmem_cache_has_cpu_pa
>  #endif
>  }
> 
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +static nodemask_t empty_nodes;
> +#endif
> +
>  /*
>   * Issues still to be resolved:
>   *
> @@ -1405,16 +1409,28 @@ static struct page *new_slab(struct kmem
>  	void *last;
>  	void *p;
>  	int order;
> +	int alloc_node;
> 
>  	BUG_ON(flags & GFP_SLAB_BUG_MASK);
> 
>  	page = allocate_slab(s,
>  		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
> -	if (!page)
> +	if (!page) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +		if (node != NUMA_NO_NODE)
> +			node_set(node, empty_nodes);
> +#endif
>  		goto out;
> +	}
> 
>  	order = compound_order(page);
> -	inc_slabs_node(s, page_to_nid(page), page->objects);
> +	alloc_node = page_to_nid(page);
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +	node_clear(alloc_node, empty_nodes);
> +	if (node != NUMA_NO_NODE && alloc_node != node)
> +		node_set(node, empty_nodes);
> +#endif
> +	inc_slabs_node(s, alloc_node, page->objects);
>  	memcg_bind_pages(s, order);
>  	page->slab_cache = s;
>  	__SetPageSlab(page);
> @@ -1722,7 +1738,7 @@ static void *get_partial(struct kmem_cac
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> 
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)
> @@ -2117,8 +2133,19 @@ static void flush_all(struct kmem_cache
>  static inline int node_match(struct page *page, int node)
>  {
>  #ifdef CONFIG_NUMA
> -	if (!page || (node != NUMA_NO_NODE && page_to_nid(page) != node))
> +	int page_node = page_to_nid(page);
> +
> +	if (!page)
>  		return 0;
> +
> +	if (node != NUMA_NO_NODE) {
> +#ifdef CONFIG_HAVE_MEMORYLESS_NODES
> +		if (node_isset(node, empty_nodes))
> +			return 1;
> +#endif
> +		if (page_node != node)
> +			return 0;
> +	}
>  #endif
>  	return 1;
>  }
> 

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 16:57                                                   ` Christoph Lameter
@ 2014-02-18 17:28                                                     ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18 17:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On 18.02.2014 [10:57:09 -0600], Christoph Lameter wrote:
> On Mon, 17 Feb 2014, Joonsoo Kim wrote:
> 
> > On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > > Hi Joonsoo,
> > > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > > memoryless nodes present? Even with fakenuma? Just curious.
> 
> x86_64 currently does not support memoryless nodes otherwise it would
> have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
> a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
> memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
> node without memory but just processors. Not very efficient. Not sure why
> people use these configurations.

Well, on powerpc, with the hypervisor providing the resources and the
topology, you can have cpuless and memoryless nodes. I'm not sure how
"fake" the NUMA is -- as I think since the resources are virtualized to
be one system, it's logically possible that the actual topology of the
resources can be CPUs from physical node 0 and memory from physical node
2. I would think with KVM on a sufficiently large (physically NUMA
x86_64) and loaded system, one could cause the same sort of
configuration to occur for a guest?

In any case, these configurations happen fairly often on long-running
(not rebooted) systems as LPARs are created/destroyed, resources are
DLPAR'd in and out of LPARs, etc.

> > I don't know, because I'm not expert on NUMA system :)
> > At first glance, fakenuma can't be used for testing
> > CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.
> 
> Well yeah. You'd have to do some mods to enable that testing.

I might look into it, as it might have sped up testing these changes.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-18 17:28                                                     ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18 17:28 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 18.02.2014 [10:57:09 -0600], Christoph Lameter wrote:
> On Mon, 17 Feb 2014, Joonsoo Kim wrote:
> 
> > On Wed, Feb 12, 2014 at 10:51:37PM -0800, Nishanth Aravamudan wrote:
> > > Hi Joonsoo,
> > > Also, given that only ia64 and (hopefuly soon) ppc64 can set
> > > CONFIG_HAVE_MEMORYLESS_NODES, does that mean x86_64 can't have
> > > memoryless nodes present? Even with fakenuma? Just curious.
> 
> x86_64 currently does not support memoryless nodes otherwise it would
> have set CONFIG_HAVE_MEMORYLESS_NODES in the kconfig. Memoryless nodes are
> a bit strange given that the NUMA paradigm is to have NUMA nodes (meaning
> memory) with processors. MEMORYLESS nodes means that we have a fake NUMA
> node without memory but just processors. Not very efficient. Not sure why
> people use these configurations.

Well, on powerpc, with the hypervisor providing the resources and the
topology, you can have cpuless and memoryless nodes. I'm not sure how
"fake" the NUMA is -- as I think since the resources are virtualized to
be one system, it's logically possible that the actual topology of the
resources can be CPUs from physical node 0 and memory from physical node
2. I would think with KVM on a sufficiently large (physically NUMA
x86_64) and loaded system, one could cause the same sort of
configuration to occur for a guest?

In any case, these configurations happen fairly often on long-running
(not rebooted) systems as LPARs are created/destroyed, resources are
DLPAR'd in and out of LPARs, etc.

> > I don't know, because I'm not expert on NUMA system :)
> > At first glance, fakenuma can't be used for testing
> > CONFIG_HAVE_MEMORYLESS_NODES. Maybe some modification is needed.
> 
> Well yeah. You'd have to do some mods to enable that testing.

I might look into it, as it might have sped up testing these changes.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 17:28                                                     ` Nishanth Aravamudan
@ 2014-02-18 19:58                                                       ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-18 19:58 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

>
> Well, on powerpc, with the hypervisor providing the resources and the
> topology, you can have cpuless and memoryless nodes. I'm not sure how
> "fake" the NUMA is -- as I think since the resources are virtualized to
> be one system, it's logically possible that the actual topology of the
> resources can be CPUs from physical node 0 and memory from physical node
> 2. I would think with KVM on a sufficiently large (physically NUMA
> x86_64) and loaded system, one could cause the same sort of
> configuration to occur for a guest?

Ok but since you have a virtualized environment: Why not provide a fake
home node with fake memory that could be anywhere? This would avoid the
whole problem of supporting such a config at the kernel level.

Do not have a fake node that has no memory.

> In any case, these configurations happen fairly often on long-running
> (not rebooted) systems as LPARs are created/destroyed, resources are
> DLPAR'd in and out of LPARs, etc.

Ok then also move the memory of the local node somewhere?

> I might look into it, as it might have sped up testing these changes.

I guess that will be necessary in order to support the memoryless nodes
long term.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-18 19:58                                                       ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-18 19:58 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

>
> Well, on powerpc, with the hypervisor providing the resources and the
> topology, you can have cpuless and memoryless nodes. I'm not sure how
> "fake" the NUMA is -- as I think since the resources are virtualized to
> be one system, it's logically possible that the actual topology of the
> resources can be CPUs from physical node 0 and memory from physical node
> 2. I would think with KVM on a sufficiently large (physically NUMA
> x86_64) and loaded system, one could cause the same sort of
> configuration to occur for a guest?

Ok but since you have a virtualized environment: Why not provide a fake
home node with fake memory that could be anywhere? This would avoid the
whole problem of supporting such a config at the kernel level.

Do not have a fake node that has no memory.

> In any case, these configurations happen fairly often on long-running
> (not rebooted) systems as LPARs are created/destroyed, resources are
> DLPAR'd in and out of LPARs, etc.

Ok then also move the memory of the local node somewhere?

> I might look into it, as it might have sped up testing these changes.

I guess that will be necessary in order to support the memoryless nodes
long term.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 19:58                                                       ` Christoph Lameter
@ 2014-02-18 21:09                                                         ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18 21:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On 18.02.2014 [13:58:20 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> >
> > Well, on powerpc, with the hypervisor providing the resources and the
> > topology, you can have cpuless and memoryless nodes. I'm not sure how
> > "fake" the NUMA is -- as I think since the resources are virtualized to
> > be one system, it's logically possible that the actual topology of the
> > resources can be CPUs from physical node 0 and memory from physical node
> > 2. I would think with KVM on a sufficiently large (physically NUMA
> > x86_64) and loaded system, one could cause the same sort of
> > configuration to occur for a guest?
> 
> Ok but since you have a virtualized environment: Why not provide a fake
> home node with fake memory that could be anywhere? This would avoid the
> whole problem of supporting such a config at the kernel level.

We use the topology provided by the hypervisor, it does actually reflect
where CPUs and memory are, and their corresponding performance/NUMA
characteristics.

> Do not have a fake node that has no memory.
> 
> > In any case, these configurations happen fairly often on long-running
> > (not rebooted) systems as LPARs are created/destroyed, resources are
> > DLPAR'd in and out of LPARs, etc.
> 
> Ok then also move the memory of the local node somewhere?

This happens below the OS, we don't control the hypervisor's decisions.
I'm not sure if that's what you are suggesting.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-18 21:09                                                         ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18 21:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 18.02.2014 [13:58:20 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> >
> > Well, on powerpc, with the hypervisor providing the resources and the
> > topology, you can have cpuless and memoryless nodes. I'm not sure how
> > "fake" the NUMA is -- as I think since the resources are virtualized to
> > be one system, it's logically possible that the actual topology of the
> > resources can be CPUs from physical node 0 and memory from physical node
> > 2. I would think with KVM on a sufficiently large (physically NUMA
> > x86_64) and loaded system, one could cause the same sort of
> > configuration to occur for a guest?
> 
> Ok but since you have a virtualized environment: Why not provide a fake
> home node with fake memory that could be anywhere? This would avoid the
> whole problem of supporting such a config at the kernel level.

We use the topology provided by the hypervisor, it does actually reflect
where CPUs and memory are, and their corresponding performance/NUMA
characteristics.

> Do not have a fake node that has no memory.
> 
> > In any case, these configurations happen fairly often on long-running
> > (not rebooted) systems as LPARs are created/destroyed, resources are
> > DLPAR'd in and out of LPARs, etc.
> 
> Ok then also move the memory of the local node somewhere?

This happens below the OS, we don't control the hypervisor's decisions.
I'm not sure if that's what you are suggesting.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 21:09                                                         ` Nishanth Aravamudan
@ 2014-02-18 21:49                                                           ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-18 21:49 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> We use the topology provided by the hypervisor, it does actually reflect
> where CPUs and memory are, and their corresponding performance/NUMA
> characteristics.

And so there are actually nodes without memory that have processors?
Can the hypervisor or the linux arch code be convinced to ignore nodes
without memory or assign a sane default node to processors?

> > Ok then also move the memory of the local node somewhere?
>
> This happens below the OS, we don't control the hypervisor's decisions.
> I'm not sure if that's what you are suggesting.

You could also do this from the powerpc arch code by sanitizing the
processor / node information that is then used by Linux.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-18 21:49                                                           ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-18 21:49 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> We use the topology provided by the hypervisor, it does actually reflect
> where CPUs and memory are, and their corresponding performance/NUMA
> characteristics.

And so there are actually nodes without memory that have processors?
Can the hypervisor or the linux arch code be convinced to ignore nodes
without memory or assign a sane default node to processors?

> > Ok then also move the memory of the local node somewhere?
>
> This happens below the OS, we don't control the hypervisor's decisions.
> I'm not sure if that's what you are suggesting.

You could also do this from the powerpc arch code by sanitizing the
processor / node information that is then used by Linux.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 21:49                                                           ` Christoph Lameter
@ 2014-02-18 22:22                                                             ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18 22:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On 18.02.2014 [15:49:22 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> > We use the topology provided by the hypervisor, it does actually reflect
> > where CPUs and memory are, and their corresponding performance/NUMA
> > characteristics.
> 
> And so there are actually nodes without memory that have processors?

Virtually (topologically as indicated to Linux), yes. Physically, I
don't think they are, but they might be exhausted, which is we get sort
of odd-appearing NUMA configurations.

> Can the hypervisor or the linux arch code be convinced to ignore nodes
> without memory or assign a sane default node to processors?

I think this happens quite often, so I don't know that we want to ignore
the performance impact of the underlying NUMA configuration. I guess we
could special-case memoryless/cpuless configurations somewhat, but I
don't think there's any reason to do that if we can make memoryless-node
support work in-kernel?

> > > Ok then also move the memory of the local node somewhere?
> >
> > This happens below the OS, we don't control the hypervisor's decisions.
> > I'm not sure if that's what you are suggesting.
> 
> You could also do this from the powerpc arch code by sanitizing the
> processor / node information that is then used by Linux.

I see what you're saying now, thanks!

-Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-18 22:22                                                             ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-02-18 22:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 18.02.2014 [15:49:22 -0600], Christoph Lameter wrote:
> On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:
> 
> > We use the topology provided by the hypervisor, it does actually reflect
> > where CPUs and memory are, and their corresponding performance/NUMA
> > characteristics.
> 
> And so there are actually nodes without memory that have processors?

Virtually (topologically as indicated to Linux), yes. Physically, I
don't think they are, but they might be exhausted, which is we get sort
of odd-appearing NUMA configurations.

> Can the hypervisor or the linux arch code be convinced to ignore nodes
> without memory or assign a sane default node to processors?

I think this happens quite often, so I don't know that we want to ignore
the performance impact of the underlying NUMA configuration. I guess we
could special-case memoryless/cpuless configurations somewhat, but I
don't think there's any reason to do that if we can make memoryless-node
support work in-kernel?

> > > Ok then also move the memory of the local node somewhere?
> >
> > This happens below the OS, we don't control the hypervisor's decisions.
> > I'm not sure if that's what you are suggesting.
> 
> You could also do this from the powerpc arch code by sanitizing the
> processor / node information that is then used by Linux.

I see what you're saying now, thanks!

-Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 22:22                                                             ` Nishanth Aravamudan
@ 2014-02-19 16:11                                                               ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-19 16:11 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> the performance impact of the underlying NUMA configuration. I guess we
> could special-case memoryless/cpuless configurations somewhat, but I
> don't think there's any reason to do that if we can make memoryless-node
> support work in-kernel?

Well we can make it work in-kernel but it always has been a bit wacky (as
is the idea of numa "memory" nodes without memory).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-19 16:11                                                               ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-19 16:11 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Nishanth Aravamudan wrote:

> the performance impact of the underlying NUMA configuration. I guess we
> could special-case memoryless/cpuless configurations somewhat, but I
> don't think there's any reason to do that if we can make memoryless-node
> support work in-kernel?

Well we can make it work in-kernel but it always has been a bit wacky (as
is the idea of numa "memory" nodes without memory).

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 19:58                                                       ` Christoph Lameter
@ 2014-02-19 22:03                                                         ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-19 22:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, Joonsoo Kim, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Christoph Lameter wrote:

> Ok but since you have a virtualized environment: Why not provide a fake
> home node with fake memory that could be anywhere? This would avoid the
> whole problem of supporting such a config at the kernel level.
> 

By acpi, the abstraction of a NUMA node can include any combination of 
cpus, memory, I/O resources, networking, or storage devices.  This allows 
two memoryless nodes, for example, to have different proximity to memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-19 22:03                                                         ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-19 22:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Christoph Lameter wrote:

> Ok but since you have a virtualized environment: Why not provide a fake
> home node with fake memory that could be anywhere? This would avoid the
> whole problem of supporting such a config at the kernel level.
> 

By acpi, the abstraction of a NUMA node can include any combination of 
cpus, memory, I/O resources, networking, or storage devices.  This allows 
two memoryless nodes, for example, to have different proximity to memory.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 16:38                                                   ` Christoph Lameter
@ 2014-02-19 22:04                                                     ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-19 22:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Christoph Lameter wrote:

> Its an optimization to avoid calling the page allocator to figure out if
> there is memory available on a particular node.
> 

Thus this patch breaks with memory hot-add for a memoryless node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-19 22:04                                                     ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-02-19 22:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Tue, 18 Feb 2014, Christoph Lameter wrote:

> Its an optimization to avoid calling the page allocator to figure out if
> there is memory available on a particular node.
> 

Thus this patch breaks with memory hot-add for a memoryless node.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-19 22:04                                                     ` David Rientjes
@ 2014-02-20 16:02                                                       ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-20 16:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Nishanth Aravamudan, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Wed, 19 Feb 2014, David Rientjes wrote:

> On Tue, 18 Feb 2014, Christoph Lameter wrote:
>
> > Its an optimization to avoid calling the page allocator to figure out if
> > there is memory available on a particular node.
> Thus this patch breaks with memory hot-add for a memoryless node.

As soon as the per cpu slab is exhausted the node number of the so far
"empty" node will be used for allocation. That will be sucessfull and the
node will no longer be marked as empty.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-20 16:02                                                       ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-20 16:02 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Nishanth Aravamudan, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On Wed, 19 Feb 2014, David Rientjes wrote:

> On Tue, 18 Feb 2014, Christoph Lameter wrote:
>
> > Its an optimization to avoid calling the page allocator to figure out if
> > there is memory available on a particular node.
> Thus this patch breaks with memory hot-add for a memoryless node.

As soon as the per cpu slab is exhausted the node number of the so far
"empty" node will be used for allocation. That will be sucessfull and the
node will no longer be marked as empty.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-18 16:38                                                   ` Christoph Lameter
@ 2014-02-24  5:08                                                     ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-24  5:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Tue, Feb 18, 2014 at 10:38:01AM -0600, Christoph Lameter wrote:
> On Mon, 17 Feb 2014, Joonsoo Kim wrote:
> 
> > On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> > > Here is another patch with some fixes. The additional logic is only
> > > compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> > >
> > > Subject: slub: Memoryless node support
> > >
> > > Support memoryless nodes by tracking which allocations are failing.
> >
> > I still don't understand why this tracking is needed.
> 
> Its an optimization to avoid calling the page allocator to figure out if
> there is memory available on a particular node.
> 
> > All we need for allcation targeted to memoryless node is to fallback proper
> > node, that it, numa_mem_id() node of targeted node. My previous patch
> > implements it and use proper fallback node on every allocation code path.
> > Why this tracking is needed? Please elaborate more on this.
> 
> Its too slow to do that on every alloc. One needs to be able to satisfy
> most allocations without switching percpu slabs for optimal performance.

I don't think that we need to switch percpu slabs on every alloc.
Allocation targeted to specific node is rare. And most of these allocations
may be targeted to either numa_node_id() or numa_mem_id(). My patch considers
these cases, so most of allocations are processed by percpu slabs. There is
no suboptimal performance.

> 
> > > Allocations targeted to the nodes without memory fall back to the
> > > current available per cpu objects and if that is not available will
> > > create a new slab using the page allocator to fallback from the
> > > memoryless node to some other node.
> 
> And what about the next alloc? Assuem there are N allocs from a memoryless
> node this means we push back the partial slab on each alloc and then fall
> back?
> 
> > >  {
> > >  	void *object;
> > > -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > > +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> > >
> > >  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > >  	if (object || node != NUMA_NO_NODE)
> >
> > This isn't enough.
> > Consider that allcation targeted to memoryless node.
> 
> It will not common get there because of the tracking. Instead a per cpu
> object will be used.
> > get_partial_node() always fails even if there are some partial slab on
> > memoryless node's neareast node.
> 
> Correct and that leads to a page allocator action whereupon the node will
> be marked as empty.

Why do we need to request to a page allocator if there is partial slab?
Checking whether node is memoryless or not is really easy, so we don't need
to skip this. To skip this is suboptimal solution.

> > We should fallback to some proper node in this case, since there is no slab
> > on memoryless node.
> 
> NUMA is about optimization of memory allocations. It is often *not* about
> correctness but heuristics are used in many cases. F.e. see the zone
> reclaim logic, zone reclaim mode, fallback scenarios in the page allocator
> etc etc.

Okay. But, 'do our best' is preferable to me.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-24  5:08                                                     ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-02-24  5:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

On Tue, Feb 18, 2014 at 10:38:01AM -0600, Christoph Lameter wrote:
> On Mon, 17 Feb 2014, Joonsoo Kim wrote:
> 
> > On Wed, Feb 12, 2014 at 04:16:11PM -0600, Christoph Lameter wrote:
> > > Here is another patch with some fixes. The additional logic is only
> > > compiled in if CONFIG_HAVE_MEMORYLESS_NODES is set.
> > >
> > > Subject: slub: Memoryless node support
> > >
> > > Support memoryless nodes by tracking which allocations are failing.
> >
> > I still don't understand why this tracking is needed.
> 
> Its an optimization to avoid calling the page allocator to figure out if
> there is memory available on a particular node.
> 
> > All we need for allcation targeted to memoryless node is to fallback proper
> > node, that it, numa_mem_id() node of targeted node. My previous patch
> > implements it and use proper fallback node on every allocation code path.
> > Why this tracking is needed? Please elaborate more on this.
> 
> Its too slow to do that on every alloc. One needs to be able to satisfy
> most allocations without switching percpu slabs for optimal performance.

I don't think that we need to switch percpu slabs on every alloc.
Allocation targeted to specific node is rare. And most of these allocations
may be targeted to either numa_node_id() or numa_mem_id(). My patch considers
these cases, so most of allocations are processed by percpu slabs. There is
no suboptimal performance.

> 
> > > Allocations targeted to the nodes without memory fall back to the
> > > current available per cpu objects and if that is not available will
> > > create a new slab using the page allocator to fallback from the
> > > memoryless node to some other node.
> 
> And what about the next alloc? Assuem there are N allocs from a memoryless
> node this means we push back the partial slab on each alloc and then fall
> back?
> 
> > >  {
> > >  	void *object;
> > > -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> > > +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
> > >
> > >  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
> > >  	if (object || node != NUMA_NO_NODE)
> >
> > This isn't enough.
> > Consider that allcation targeted to memoryless node.
> 
> It will not common get there because of the tracking. Instead a per cpu
> object will be used.
> > get_partial_node() always fails even if there are some partial slab on
> > memoryless node's neareast node.
> 
> Correct and that leads to a page allocator action whereupon the node will
> be marked as empty.

Why do we need to request to a page allocator if there is partial slab?
Checking whether node is memoryless or not is really easy, so we don't need
to skip this. To skip this is suboptimal solution.

> > We should fallback to some proper node in this case, since there is no slab
> > on memoryless node.
> 
> NUMA is about optimization of memory allocations. It is often *not* about
> correctness but heuristics are used in many cases. F.e. see the zone
> reclaim logic, zone reclaim mode, fallback scenarios in the page allocator
> etc etc.

Okay. But, 'do our best' is preferable to me.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-24  5:08                                                     ` Joonsoo Kim
@ 2014-02-24 19:54                                                       ` Christoph Lameter
  -1 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-24 19:54 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Nishanth Aravamudan, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On Mon, 24 Feb 2014, Joonsoo Kim wrote:

> > It will not common get there because of the tracking. Instead a per cpu
> > object will be used.
> > > get_partial_node() always fails even if there are some partial slab on
> > > memoryless node's neareast node.
> >
> > Correct and that leads to a page allocator action whereupon the node will
> > be marked as empty.
>
> Why do we need to request to a page allocator if there is partial slab?
> Checking whether node is memoryless or not is really easy, so we don't need
> to skip this. To skip this is suboptimal solution.

The page allocator action is also used to determine to which other node we
should fall back if the node is empty. So we need to call the page
allocator when the per cpu slab is exhaused with the node of the
memoryless node to get memory from the proper fallback node.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-02-24 19:54                                                       ` Christoph Lameter
  0 siblings, 0 replies; 229+ messages in thread
From: Christoph Lameter @ 2014-02-24 19:54 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Nishanth Aravamudan, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, linuxppc-dev, Wanpeng Li

On Mon, 24 Feb 2014, Joonsoo Kim wrote:

> > It will not common get there because of the tracking. Instead a per cpu
> > object will be used.
> > > get_partial_node() always fails even if there are some partial slab on
> > > memoryless node's neareast node.
> >
> > Correct and that leads to a page allocator action whereupon the node will
> > be marked as empty.
>
> Why do we need to request to a page allocator if there is partial slab?
> Checking whether node is memoryless or not is really easy, so we don't need
> to skip this. To skip this is suboptimal solution.

The page allocator action is also used to determine to which other node we
should fall back if the node is empty. So we need to call the page
allocator when the per cpu slab is exhaused with the node of the
memoryless node to get memory from the proper fallback node.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-24 19:54                                                       ` Christoph Lameter
@ 2014-03-13 16:51                                                         ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-03-13 16:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, linuxppc-dev, Wanpeng Li

On 24.02.2014 [13:54:35 -0600], Christoph Lameter wrote:
> On Mon, 24 Feb 2014, Joonsoo Kim wrote:
> 
> > > It will not common get there because of the tracking. Instead a per cpu
> > > object will be used.
> > > > get_partial_node() always fails even if there are some partial slab on
> > > > memoryless node's neareast node.
> > >
> > > Correct and that leads to a page allocator action whereupon the node will
> > > be marked as empty.
> >
> > Why do we need to request to a page allocator if there is partial slab?
> > Checking whether node is memoryless or not is really easy, so we don't need
> > to skip this. To skip this is suboptimal solution.
> 
> The page allocator action is also used to determine to which other node we
> should fall back if the node is empty. So we need to call the page
> allocator when the per cpu slab is exhaused with the node of the
> memoryless node to get memory from the proper fallback node.

Where do we stand with these patches? I feel like no resolution was
really found...

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-03-13 16:51                                                         ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-03-13 16:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Wanpeng Li

On 24.02.2014 [13:54:35 -0600], Christoph Lameter wrote:
> On Mon, 24 Feb 2014, Joonsoo Kim wrote:
> 
> > > It will not common get there because of the tracking. Instead a per cpu
> > > object will be used.
> > > > get_partial_node() always fails even if there are some partial slab on
> > > > memoryless node's neareast node.
> > >
> > > Correct and that leads to a page allocator action whereupon the node will
> > > be marked as empty.
> >
> > Why do we need to request to a page allocator if there is partial slab?
> > Checking whether node is memoryless or not is really easy, so we don't need
> > to skip this. To skip this is suboptimal solution.
> 
> The page allocator action is also used to determine to which other node we
> should fall back if the node is empty. So we need to call the page
> allocator when the per cpu slab is exhaused with the node of the
> memoryless node to get memory from the proper fallback node.

Where do we stand with these patches? I feel like no resolution was
really found...

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
  2014-02-06  8:07                           ` Joonsoo Kim
@ 2014-05-16 23:37                             ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-05-16 23:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On 06.02.2014 [17:07:04 +0900], Joonsoo Kim wrote:
> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
> 
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

Joonsoo, would you send this one on to Andrew?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
@ 2014-05-16 23:37                             ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-05-16 23:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On 06.02.2014 [17:07:04 +0900], Joonsoo Kim wrote:
> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
> 
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>

Joonsoo, would you send this one on to Andrew?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
  2014-05-16 23:37                             ` Nishanth Aravamudan
@ 2014-05-19  2:41                               ` Joonsoo Kim
  -1 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-05-19  2:41 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Han Pingtian, penberg, linux-mm, paulus,
	Anton Blanchard, mpm, Christoph Lameter, linuxppc-dev,
	Wanpeng Li

On Fri, May 16, 2014 at 04:37:35PM -0700, Nishanth Aravamudan wrote:
> On 06.02.2014 [17:07:04 +0900], Joonsoo Kim wrote:
> > Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> > a partial slab on numa_node_id() node. This doesn't work properly on the
> > system having memoryless node, since it can have no memory on that node and
> > there must be no partial slab on that node.
> > 
> > On that node, page allocation always fallback to numa_mem_id() first. So
> > searching a partial slab on numa_node_id() in that case is proper solution
> > for memoryless node case.
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> Joonsoo, would you send this one on to Andrew?

Hello,

Okay. I will do it.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id()
@ 2014-05-19  2:41                               ` Joonsoo Kim
  0 siblings, 0 replies; 229+ messages in thread
From: Joonsoo Kim @ 2014-05-19  2:41 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, mpm, penberg, linux-mm, paulus, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Fri, May 16, 2014 at 04:37:35PM -0700, Nishanth Aravamudan wrote:
> On 06.02.2014 [17:07:04 +0900], Joonsoo Kim wrote:
> > Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> > a partial slab on numa_node_id() node. This doesn't work properly on the
> > system having memoryless node, since it can have no memory on that node and
> > there must be no partial slab on that node.
> > 
> > On that node, page allocation always fallback to numa_mem_id() first. So
> > searching a partial slab on numa_node_id() in that case is proper solution
> > for memoryless node case.
> > 
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> 
> Joonsoo, would you send this one on to Andrew?

Hello,

Okay. I will do it.

Thanks.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RESEND PATCH] slub: search partial list on numa_mem_id(), instead of numa_node_id()
  2014-02-06  8:07                           ` Joonsoo Kim
  (?)
@ 2014-06-05  0:13                             ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-06-05  0:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Nishanth Aravamudan, Pekka Enberg,
	Christoph Lameter, linux-mm, linux-kernel, Joonsoo Kim,
	Han Pingtian, paulus, Anton Blanchard, mpm, linuxppc-dev,
	Wanpeng Li

On Wed, 21 May 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
> 
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 545a170..cc1f995 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1698,7 +1698,7 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
>  
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)

Andrew, can you merge this please?  It's still not in linux-next.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RESEND PATCH] slub: search partial list on numa_mem_id(), instead of numa_node_id()
@ 2014-06-05  0:13                             ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-06-05  0:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Nishanth Aravamudan, Pekka Enberg,
	Christoph Lameter, linux-mm, linux-kernel, Joonsoo Kim,
	Han Pingtian, paulus, Anton Blanchard, mpm, linuxppc-dev,
	Wanpeng Li

On Wed, 21 May 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
> 
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 545a170..cc1f995 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1698,7 +1698,7 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
>  
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)

Andrew, can you merge this please?  It's still not in linux-next.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RESEND PATCH] slub: search partial list on numa_mem_id(), instead of numa_node_id()
@ 2014-06-05  0:13                             ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-06-05  0:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joonsoo Kim, Han Pingtian, Nishanth Aravamudan, linux-kernel,
	Pekka Enberg, linux-mm, paulus, Anton Blanchard, mpm,
	Joonsoo Kim, linuxppc-dev, Christoph Lameter, Wanpeng Li

On Wed, 21 May 2014, Joonsoo Kim wrote:

> Currently, if allocation constraint to node is NUMA_NO_NODE, we search
> a partial slab on numa_node_id() node. This doesn't work properly on the
> system having memoryless node, since it can have no memory on that node and
> there must be no partial slab on that node.
> 
> On that node, page allocation always fallback to numa_mem_id() first. So
> searching a partial slab on numa_node_id() in that case is proper solution
> for memoryless node case.
> 
> Acked-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Christoph Lameter <cl@linux.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 545a170..cc1f995 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1698,7 +1698,7 @@ static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
>  		struct kmem_cache_cpu *c)
>  {
>  	void *object;
> -	int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
> +	int searchnode = (node == NUMA_NO_NODE) ? numa_mem_id() : node;
>  
>  	object = get_partial_node(s, get_node(s, searchnode), c, flags);
>  	if (object || node != NUMA_NO_NODE)

Andrew, can you merge this please?  It's still not in linux-next.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-02-10  1:09                                         ` Joonsoo Kim
@ 2014-07-22  1:03                                           ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-07-22  1:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: David Rientjes, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On 10.02.2014 [10:09:36 +0900], Joonsoo Kim wrote:
> On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> > On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> > 
> > > > It seems like a better approach would be to do this when a node is brought 
> > > > online and determine the fallback node based not on the zonelists as you 
> > > > do here but rather on locality (such as through a SLIT if provided, see 
> > > > node_distance()).
> > > 
> > > Hmm...
> > > I guess that zonelist is base on locality. Zonelist is generated using
> > > node_distance(), so I think that it reflects locality. But, I'm not expert
> > > on NUMA, so please let me know what I am missing here :)
> > > 
> > 
> > The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> > If your solution is going to become the generic kernel API that determines 
> > what node has local memory for a particular node, then it will have to 
> > support all definitions of node.  That includes nodes that consist solely 
> > of I/O, chipsets, networking, or storage devices.  These nodes may not 
> > have memory or cpus, so doing it as part of onlining cpus isn't going to 
> > be generic enough.  You want a node_to_mem_node() API for all possible 
> > node types (the possible node types listed above are straight from the 
> > ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> > X and we can optimize for that, but any solution that relies on cpu online 
> > is probably shortsighted right now.
> > 
> > I think it would be much better to do this as a part of setting a node to 
> > be online.
> 
> Okay. I got your point.
> I will change it to rely on node online if this patch is really needed.

Sorry for bringing up this old thread again, but I had a question for
you, David. node_to_mem_node(), which does seem like a useful API,
doesn't seem like it can just node_distance() solely, right? Because
that just tells us the relative cost (or so I think about it) of using
resources from that node. But we also need to know if that node itself
has memory, etc. So using the zonelists is required no matter what? And
upon memory hotplug (or unplug), the topology can change in a way that
affects things, so node online time isn't right either?

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-07-22  1:03                                           ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-07-22  1:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Christoph Lameter, linuxppc-dev, Wanpeng Li

On 10.02.2014 [10:09:36 +0900], Joonsoo Kim wrote:
> On Sat, Feb 08, 2014 at 01:57:39AM -0800, David Rientjes wrote:
> > On Fri, 7 Feb 2014, Joonsoo Kim wrote:
> > 
> > > > It seems like a better approach would be to do this when a node is brought 
> > > > online and determine the fallback node based not on the zonelists as you 
> > > > do here but rather on locality (such as through a SLIT if provided, see 
> > > > node_distance()).
> > > 
> > > Hmm...
> > > I guess that zonelist is base on locality. Zonelist is generated using
> > > node_distance(), so I think that it reflects locality. But, I'm not expert
> > > on NUMA, so please let me know what I am missing here :)
> > > 
> > 
> > The zonelist is, yes, but I'm talking about memoryless and cpuless nodes.  
> > If your solution is going to become the generic kernel API that determines 
> > what node has local memory for a particular node, then it will have to 
> > support all definitions of node.  That includes nodes that consist solely 
> > of I/O, chipsets, networking, or storage devices.  These nodes may not 
> > have memory or cpus, so doing it as part of onlining cpus isn't going to 
> > be generic enough.  You want a node_to_mem_node() API for all possible 
> > node types (the possible node types listed above are straight from the 
> > ACPI spec).  For 99% of people, node_to_mem_node(X) is always going to be 
> > X and we can optimize for that, but any solution that relies on cpu online 
> > is probably shortsighted right now.
> > 
> > I think it would be much better to do this as a part of setting a node to 
> > be online.
> 
> Okay. I got your point.
> I will change it to rely on node online if this patch is really needed.

Sorry for bringing up this old thread again, but I had a question for
you, David. node_to_mem_node(), which does seem like a useful API,
doesn't seem like it can just node_distance() solely, right? Because
that just tells us the relative cost (or so I think about it) of using
resources from that node. But we also need to know if that node itself
has memory, etc. So using the zonelists is required no matter what? And
upon memory hotplug (or unplug), the topology can change in a way that
affects things, so node online time isn't right either?

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-07-22  1:03                                           ` Nishanth Aravamudan
@ 2014-07-22  1:16                                             ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-07-22  1:16 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:

> Sorry for bringing up this old thread again, but I had a question for
> you, David. node_to_mem_node(), which does seem like a useful API,
> doesn't seem like it can just node_distance() solely, right? Because
> that just tells us the relative cost (or so I think about it) of using
> resources from that node. But we also need to know if that node itself
> has memory, etc. So using the zonelists is required no matter what? And
> upon memory hotplug (or unplug), the topology can change in a way that
> affects things, so node online time isn't right either?
> 

I think there's two use cases of interest:

 - allocating from a memoryless node where numa_node_id() is memoryless, 
   and

 - using node_to_mem_node() for a possibly-memoryless node for kmalloc().

I believe the first should have its own node_zonelist[0], whether it's 
memoryless or not, that points to a list of zones that start with those 
with the smallest distance.  I think its own node_zonelist[1], for 
__GFP_THISNODE allocations, should point to the node with present memory 
that has the smallest distance.

For sure node_zonelist[0] cannot be NULL since things like 
first_online_pgdat() would break and it should be unnecessary to do 
node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
since the zonelists should already be defined properly.  All nodes, 
regardless of whether they have memory or not, should probably end up 
having a struct pglist_data unless there's a reason for another level of 
indirection.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-07-22  1:16                                             ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-07-22  1:16 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Pekka Enberg, Linux Memory Management List,
	Paul Mackerras, Anton Blanchard, Matt Mackall, Joonsoo Kim,
	linuxppc-dev, Christoph Lameter, Wanpeng Li

On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:

> Sorry for bringing up this old thread again, but I had a question for
> you, David. node_to_mem_node(), which does seem like a useful API,
> doesn't seem like it can just node_distance() solely, right? Because
> that just tells us the relative cost (or so I think about it) of using
> resources from that node. But we also need to know if that node itself
> has memory, etc. So using the zonelists is required no matter what? And
> upon memory hotplug (or unplug), the topology can change in a way that
> affects things, so node online time isn't right either?
> 

I think there's two use cases of interest:

 - allocating from a memoryless node where numa_node_id() is memoryless, 
   and

 - using node_to_mem_node() for a possibly-memoryless node for kmalloc().

I believe the first should have its own node_zonelist[0], whether it's 
memoryless or not, that points to a list of zones that start with those 
with the smallest distance.  I think its own node_zonelist[1], for 
__GFP_THISNODE allocations, should point to the node with present memory 
that has the smallest distance.

For sure node_zonelist[0] cannot be NULL since things like 
first_online_pgdat() would break and it should be unnecessary to do 
node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
since the zonelists should already be defined properly.  All nodes, 
regardless of whether they have memory or not, should probably end up 
having a struct pglist_data unless there's a reason for another level of 
indirection.

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-07-22  1:16                                             ` David Rientjes
@ 2014-07-22 21:43                                               ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-07-22 21:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li,
	Tejun Heo

Hi David,

On 21.07.2014 [18:16:58 -0700], David Rientjes wrote:
> On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:
> 
> > Sorry for bringing up this old thread again, but I had a question for
> > you, David. node_to_mem_node(), which does seem like a useful API,
> > doesn't seem like it can just node_distance() solely, right? Because
> > that just tells us the relative cost (or so I think about it) of using
> > resources from that node. But we also need to know if that node itself
> > has memory, etc. So using the zonelists is required no matter what? And
> > upon memory hotplug (or unplug), the topology can change in a way that
> > affects things, so node online time isn't right either?
> > 
> 
> I think there's two use cases of interest:
> 
>  - allocating from a memoryless node where numa_node_id() is memoryless, 
>    and
> 
>  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> 
> I believe the first should have its own node_zonelist[0], whether it's 
> memoryless or not, that points to a list of zones that start with those 
> with the smallest distance.

Ok, and that would be used for falling back in the appropriate priority?

> I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> should point to the node with present memory that has the smallest
> distance.

And so would this, but with the caveat that we can fail here and don't
go further? Semantically, __GFP_THISNODE then means "as close as
physically possible ignoring run-time memory constraints". I say that
because obviously we might get off-node memory without memoryless nodes,
but that shouldn't be used to satisfy __GPF_THISNODE allocations.

> For sure node_zonelist[0] cannot be NULL since things like 
> first_online_pgdat() would break and it should be unnecessary to do 
> node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
> since the zonelists should already be defined properly.  All nodes, 
> regardless of whether they have memory or not, should probably end up 
> having a struct pglist_data unless there's a reason for another level of 
> indirection.

So I've re-tested Joonsoo's patch 2 and 3 from the series he sent, and
on powerpc now, things look really good. On a KVM instance with the
following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 14274 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

3.16.0-rc6 gives:

        Slab:            1039744 kB
	SReclaimable:      38976 kB
	SUnreclaim:      1000768 kB

Joonsoo's patches give:

        Slab:             366144 kB
	SReclaimable:      36928 kB
	SUnreclaim:       329216 kB

For reference, CONFIG_SLAB gives:

        Slab:             122496 kB
	SReclaimable:      14912 kB
	SUnreclaim:       107584 kB

At Tejun's request [adding him to Cc], I also partially reverted
81c98869faa5 ("kthread: ensure locality of task_struct allocations"): 

	Slab:             428864 kB
	SReclaimable:      44288 kB
	SUnreclaim:       384576 kB

This seems slightly worse, but I think it's because of the same
root-cause that I indicated in my RFC patch 2/2, quoting it here:

"    There is an issue currently where NUMA information is used on powerpc
    (and possibly ia64) before it has been read from the device-tree, which
    leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

    NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
    after start_secondary(), similar to ia64, which is invoked via
    smp_init().

    Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
    early_initcall()") made init_workqueues() be invoked via
    do_pre_smp_initcalls(), which is obviously before the secondary
    processors are online.
    ...
    Therefore, when init_workqueues() runs, it sees all CPUs as being on
    Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
    a high number of slab deactivations
    (http://www.spinics.net/lists/linux-mm/msg67489.html)."

Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
have to be especially careful that users of cpu_to_{node,mem} and
related APIs run *after* correct values are stored for all used CPUs?

In any case, with Joonsoo's patches, we shouldn't see slab deactivations
*if* the NUMA topology information is stored correctly. The full
changelog and patch is at http://patchwork.ozlabs.org/patch/371266/.

Adding my patch on top of Joonsoo's and the revert, I get:

	Slab:             411776 kB
	SReclaimable:      40960 kB
	SUnreclaim:       370816 kB

So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so
much that we are close to OOM with small VM/LPAR sizes.

Thoughts?

I would like to push:

1) Joonsoo's patch to add get_numa_mem, renamed to node_to_mem_node(),
which is caching the result of local_memory_node() for each node.

2) Joonsoo's patch to use node_to_mem_node in __slab_alloc() and
get_partial() when memoryless nodes are encountered.

3) Partial revert of 81c98869faa5 ("kthread: ensure locality of
task_struct allocations") to remove a reference to cpu_to_mem() from the
kthread code. After this, the only references to cpu_to_mem() are in
headers, mm/slab.c, and kernel/profile.c (the last of which is because
of the use of alloc_pages_exact_node(), it seems).

4) Re-post of my patch to fix an ordering issue for the per-CPU NUMA
information on powerpc

I understand your concerns, I think, about Joonsoo's patches, but we're
hitting this pretty regularly in the field and it would be nice to have
something workable in the short-term, while I try and follow-up on these
more invasive ideas.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-07-22 21:43                                               ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-07-22 21:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Pekka Enberg, Linux Memory Management List,
	Paul Mackerras, Anton Blanchard, Matt Mackall, Tejun Heo,
	Joonsoo Kim, linuxppc-dev, Christoph Lameter, Wanpeng Li

Hi David,

On 21.07.2014 [18:16:58 -0700], David Rientjes wrote:
> On Mon, 21 Jul 2014, Nishanth Aravamudan wrote:
> 
> > Sorry for bringing up this old thread again, but I had a question for
> > you, David. node_to_mem_node(), which does seem like a useful API,
> > doesn't seem like it can just node_distance() solely, right? Because
> > that just tells us the relative cost (or so I think about it) of using
> > resources from that node. But we also need to know if that node itself
> > has memory, etc. So using the zonelists is required no matter what? And
> > upon memory hotplug (or unplug), the topology can change in a way that
> > affects things, so node online time isn't right either?
> > 
> 
> I think there's two use cases of interest:
> 
>  - allocating from a memoryless node where numa_node_id() is memoryless, 
>    and
> 
>  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> 
> I believe the first should have its own node_zonelist[0], whether it's 
> memoryless or not, that points to a list of zones that start with those 
> with the smallest distance.

Ok, and that would be used for falling back in the appropriate priority?

> I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> should point to the node with present memory that has the smallest
> distance.

And so would this, but with the caveat that we can fail here and don't
go further? Semantically, __GFP_THISNODE then means "as close as
physically possible ignoring run-time memory constraints". I say that
because obviously we might get off-node memory without memoryless nodes,
but that shouldn't be used to satisfy __GPF_THISNODE allocations.

> For sure node_zonelist[0] cannot be NULL since things like 
> first_online_pgdat() would break and it should be unnecessary to do 
> node_to_mem_node() for all allocations when CONFIG_HAVE_MEMORYLESS_NODES 
> since the zonelists should already be defined properly.  All nodes, 
> regardless of whether they have memory or not, should probably end up 
> having a struct pglist_data unless there's a reason for another level of 
> indirection.

So I've re-tested Joonsoo's patch 2 and 3 from the series he sent, and
on powerpc now, things look really good. On a KVM instance with the
following topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 14274 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10 

3.16.0-rc6 gives:

        Slab:            1039744 kB
	SReclaimable:      38976 kB
	SUnreclaim:      1000768 kB

Joonsoo's patches give:

        Slab:             366144 kB
	SReclaimable:      36928 kB
	SUnreclaim:       329216 kB

For reference, CONFIG_SLAB gives:

        Slab:             122496 kB
	SReclaimable:      14912 kB
	SUnreclaim:       107584 kB

At Tejun's request [adding him to Cc], I also partially reverted
81c98869faa5 ("kthread: ensure locality of task_struct allocations"): 

	Slab:             428864 kB
	SReclaimable:      44288 kB
	SUnreclaim:       384576 kB

This seems slightly worse, but I think it's because of the same
root-cause that I indicated in my RFC patch 2/2, quoting it here:

"    There is an issue currently where NUMA information is used on powerpc
    (and possibly ia64) before it has been read from the device-tree, which
    leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

    NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
    after start_secondary(), similar to ia64, which is invoked via
    smp_init().

    Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
    early_initcall()") made init_workqueues() be invoked via
    do_pre_smp_initcalls(), which is obviously before the secondary
    processors are online.
    ...
    Therefore, when init_workqueues() runs, it sees all CPUs as being on
    Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
    a high number of slab deactivations
    (http://www.spinics.net/lists/linux-mm/msg67489.html)."

Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
have to be especially careful that users of cpu_to_{node,mem} and
related APIs run *after* correct values are stored for all used CPUs?

In any case, with Joonsoo's patches, we shouldn't see slab deactivations
*if* the NUMA topology information is stored correctly. The full
changelog and patch is at http://patchwork.ozlabs.org/patch/371266/.

Adding my patch on top of Joonsoo's and the revert, I get:

	Slab:             411776 kB
	SReclaimable:      40960 kB
	SUnreclaim:       370816 kB

So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so
much that we are close to OOM with small VM/LPAR sizes.

Thoughts?

I would like to push:

1) Joonsoo's patch to add get_numa_mem, renamed to node_to_mem_node(),
which is caching the result of local_memory_node() for each node.

2) Joonsoo's patch to use node_to_mem_node in __slab_alloc() and
get_partial() when memoryless nodes are encountered.

3) Partial revert of 81c98869faa5 ("kthread: ensure locality of
task_struct allocations") to remove a reference to cpu_to_mem() from the
kthread code. After this, the only references to cpu_to_mem() are in
headers, mm/slab.c, and kernel/profile.c (the last of which is because
of the use of alloc_pages_exact_node(), it seems).

4) Re-post of my patch to fix an ordering issue for the per-CPU NUMA
information on powerpc

I understand your concerns, I think, about Joonsoo's patches, but we're
hitting this pretty regularly in the field and it would be nice to have
something workable in the short-term, while I try and follow-up on these
more invasive ideas.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-07-22 21:43                                               ` Nishanth Aravamudan
@ 2014-07-22 21:49                                                 ` Tejun Heo
  -1 siblings, 0 replies; 229+ messages in thread
From: Tejun Heo @ 2014-07-22 21:49 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: David Rientjes, Joonsoo Kim, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li

Hello,

On Tue, Jul 22, 2014 at 02:43:11PM -0700, Nishanth Aravamudan wrote:
...
> "    There is an issue currently where NUMA information is used on powerpc
>     (and possibly ia64) before it has been read from the device-tree, which
>     leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
>     
>     NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
>     after start_secondary(), similar to ia64, which is invoked via
>     smp_init().
>     
>     Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
>     early_initcall()") made init_workqueues() be invoked via
>     do_pre_smp_initcalls(), which is obviously before the secondary
>     processors are online.
>     ...
>     Therefore, when init_workqueues() runs, it sees all CPUs as being on
>     Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
>     a high number of slab deactivations
>     (http://www.spinics.net/lists/linux-mm/msg67489.html)."
> 
> Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
> correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
> have to be especially careful that users of cpu_to_{node,mem} and
> related APIs run *after* correct values are stored for all used CPUs?

Without delving into the code, yes, NUMA info should be set up as soon
as possible before major allocations happen.  All allocations which
happen beforehand would naturally be done with bogus NUMA information.

Thanks.

-- 
tejun

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-07-22 21:49                                                 ` Tejun Heo
  0 siblings, 0 replies; 229+ messages in thread
From: Tejun Heo @ 2014-07-22 21:49 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Matt Mackall, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	David Rientjes, Joonsoo Kim, linuxppc-dev, Christoph Lameter,
	Wanpeng Li

Hello,

On Tue, Jul 22, 2014 at 02:43:11PM -0700, Nishanth Aravamudan wrote:
...
> "    There is an issue currently where NUMA information is used on powerpc
>     (and possibly ia64) before it has been read from the device-tree, which
>     leads to large slab consumption with CONFIG_SLUB and memoryless nodes.
>     
>     NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
>     after start_secondary(), similar to ia64, which is invoked via
>     smp_init().
>     
>     Commit 6ee0578b4daae ("workqueue: mark init_workqueues() as
>     early_initcall()") made init_workqueues() be invoked via
>     do_pre_smp_initcalls(), which is obviously before the secondary
>     processors are online.
>     ...
>     Therefore, when init_workqueues() runs, it sees all CPUs as being on
>     Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
>     a high number of slab deactivations
>     (http://www.spinics.net/lists/linux-mm/msg67489.html)."
> 
> Christoph/Tejun, do you see the issue I'm referring to? Is my analysis
> correct? It seems like regardless of CONFIG_USE_PERCPU_NUMA_NODE_ID, we
> have to be especially careful that users of cpu_to_{node,mem} and
> related APIs run *after* correct values are stored for all used CPUs?

Without delving into the code, yes, NUMA info should be set up as soon
as possible before major allocations happen.  All allocations which
happen beforehand would naturally be done with bogus NUMA information.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-07-22 21:43                                               ` Nishanth Aravamudan
@ 2014-07-22 23:47                                                 ` Nishanth Aravamudan
  -1 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-07-22 23:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Joonsoo Kim, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li,
	Tejun Heo

On 22.07.2014 [14:43:11 -0700], Nishanth Aravamudan wrote:
> Hi David,

<snip>

> on powerpc now, things look really good. On a KVM instance with the
> following topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
> node 1 size: 16336 MB
> node 1 free: 14274 MB
> node distances:
> node   0   1 
>   0:  10  40 
>   1:  40  10 
> 
> 3.16.0-rc6 gives:
> 
>         Slab:            1039744 kB
> 	SReclaimable:      38976 kB
> 	SUnreclaim:      1000768 kB

<snip>

> Adding my patch on top of Joonsoo's and the revert, I get:
> 
> 	Slab:             411776 kB
> 	SReclaimable:      40960 kB
> 	SUnreclaim:       370816 kB
> 
> So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so
> much that we are close to OOM with small VM/LPAR sizes.

Just to clarify/add one more datapoint, with a balanced topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 8154 MB
node 0 free: 8075 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 8181 MB
node 1 free: 7776 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10

I see the following for my patch + Joonsoo's + the revert:

Slab:             495872 kB
SReclaimable:      46528 kB
SUnreclaim:       449344 kB

(Although these numbers fluctuate quite a bit between 250M and 500M),
which indicates that the memoryless node slab consumption is now on-par
with a populated topology. And both are still more than CONFIG_SLAB
requires.

Thanks,
Nish

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-07-22 23:47                                                 ` Nishanth Aravamudan
  0 siblings, 0 replies; 229+ messages in thread
From: Nishanth Aravamudan @ 2014-07-22 23:47 UTC (permalink / raw)
  To: David Rientjes
  Cc: Han Pingtian, Pekka Enberg, Linux Memory Management List,
	Paul Mackerras, Anton Blanchard, Matt Mackall, Tejun Heo,
	Joonsoo Kim, linuxppc-dev, Christoph Lameter, Wanpeng Li

On 22.07.2014 [14:43:11 -0700], Nishanth Aravamudan wrote:
> Hi David,

<snip>

> on powerpc now, things look really good. On a KVM instance with the
> following topology:
> 
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
> node 1 size: 16336 MB
> node 1 free: 14274 MB
> node distances:
> node   0   1 
>   0:  10  40 
>   1:  40  10 
> 
> 3.16.0-rc6 gives:
> 
>         Slab:            1039744 kB
> 	SReclaimable:      38976 kB
> 	SUnreclaim:      1000768 kB

<snip>

> Adding my patch on top of Joonsoo's and the revert, I get:
> 
> 	Slab:             411776 kB
> 	SReclaimable:      40960 kB
> 	SUnreclaim:       370816 kB
> 
> So CONFIG_SLUB still uses about 3x as much slab memory, but it's not so
> much that we are close to OOM with small VM/LPAR sizes.

Just to clarify/add one more datapoint, with a balanced topology:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 8154 MB
node 0 free: 8075 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 8181 MB
node 1 free: 7776 MB
node distances:
node   0   1 
  0:  10  40 
  1:  40  10

I see the following for my patch + Joonsoo's + the revert:

Slab:             495872 kB
SReclaimable:      46528 kB
SUnreclaim:       449344 kB

(Although these numbers fluctuate quite a bit between 250M and 500M),
which indicates that the memoryless node slab consumption is now on-par
with a populated topology. And both are still more than CONFIG_SLAB
requires.

Thanks,
Nish

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
  2014-07-22 21:43                                               ` Nishanth Aravamudan
@ 2014-07-23  0:43                                                 ` David Rientjes
  -1 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-07-23  0:43 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Joonsoo Kim, Han Pingtian, Pekka Enberg,
	Linux Memory Management List, Paul Mackerras, Anton Blanchard,
	Matt Mackall, Christoph Lameter, linuxppc-dev, Wanpeng Li,
	Tejun Heo

On Tue, 22 Jul 2014, Nishanth Aravamudan wrote:

> > I think there's two use cases of interest:
> > 
> >  - allocating from a memoryless node where numa_node_id() is memoryless, 
> >    and
> > 
> >  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> > 
> > I believe the first should have its own node_zonelist[0], whether it's 
> > memoryless or not, that points to a list of zones that start with those 
> > with the smallest distance.
> 
> Ok, and that would be used for falling back in the appropriate priority?
> 

There's no real fallback since there's never a case when you can allocate 
on a memoryless node.  The zonelist defines the appropriate order in which 
to try to allocate from zones, so it depends on things like the 
numa_node_id() in alloc_pages_current() and whether the zonelist for a 
memoryless node is properly initialized or whether this needs to be 
numa_mem_id().  It depends on the intended behavior of calling 
alloc_pages_{node,vma}() with a memoryless node, the complexity of 
(re-)building the zonelists at bootstrap and for memory hotplug isn't a 
hotpath.

This choice would also impact MPOL_PREFERRED mempolicies when MPOL_F_LOCAL 
is set.

> > I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> > should point to the node with present memory that has the smallest
> > distance.
> 
> And so would this, but with the caveat that we can fail here and don't
> go further? Semantically, __GFP_THISNODE then means "as close as
> physically possible ignoring run-time memory constraints". I say that
> because obviously we might get off-node memory without memoryless nodes,
> but that shouldn't be used to satisfy __GPF_THISNODE allocations.
> 

alloc_pages_current() substitutes any existing mempolicy for the default 
local policy when __GFP_THISNODE is set, and that would require local 
allocation.  That, currently, is numa_node_id() and not numa_mem_id().

The slab allocator already only uses __GFP_THISNODE for numa_mem_id() so 
it will allocate remotely anyway.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 229+ messages in thread

* Re: [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node
@ 2014-07-23  0:43                                                 ` David Rientjes
  0 siblings, 0 replies; 229+ messages in thread
From: David Rientjes @ 2014-07-23  0:43 UTC (permalink / raw)
  To: Nishanth Aravamudan
  Cc: Han Pingtian, Pekka Enberg, Linux Memory Management List,
	Paul Mackerras, Anton Blanchard, Matt Mackall, Tejun Heo,
	Joonsoo Kim, linuxppc-dev, Christoph Lameter, Wanpeng Li

On Tue, 22 Jul 2014, Nishanth Aravamudan wrote:

> > I think there's two use cases of interest:
> > 
> >  - allocating from a memoryless node where numa_node_id() is memoryless, 
> >    and
> > 
> >  - using node_to_mem_node() for a possibly-memoryless node for kmalloc().
> > 
> > I believe the first should have its own node_zonelist[0], whether it's 
> > memoryless or not, that points to a list of zones that start with those 
> > with the smallest distance.
> 
> Ok, and that would be used for falling back in the appropriate priority?
> 

There's no real fallback since there's never a case when you can allocate 
on a memoryless node.  The zonelist defines the appropriate order in which 
to try to allocate from zones, so it depends on things like the 
numa_node_id() in alloc_pages_current() and whether the zonelist for a 
memoryless node is properly initialized or whether this needs to be 
numa_mem_id().  It depends on the intended behavior of calling 
alloc_pages_{node,vma}() with a memoryless node, the complexity of 
(re-)building the zonelists at bootstrap and for memory hotplug isn't a 
hotpath.

This choice would also impact MPOL_PREFERRED mempolicies when MPOL_F_LOCAL 
is set.

> > I think its own node_zonelist[1], for __GFP_THISNODE allocations,
> > should point to the node with present memory that has the smallest
> > distance.
> 
> And so would this, but with the caveat that we can fail here and don't
> go further? Semantically, __GFP_THISNODE then means "as close as
> physically possible ignoring run-time memory constraints". I say that
> because obviously we might get off-node memory without memoryless nodes,
> but that shouldn't be used to satisfy __GPF_THISNODE allocations.
> 

alloc_pages_current() substitutes any existing mempolicy for the default 
local policy when __GFP_THISNODE is set, and that would require local 
allocation.  That, currently, is numa_node_id() and not numa_mem_id().

The slab allocator already only uses __GFP_THISNODE for numa_mem_id() so 
it will allocate remotely anyway.

^ permalink raw reply	[flat|nested] 229+ messages in thread

end of thread, other threads:[~2014-07-23  0:43 UTC | newest]

Thread overview: 229+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-01-07  2:21 [PATCH] slub: Don't throw away partial remote slabs if there is no local memory Anton Blanchard
2014-01-07  2:21 ` Anton Blanchard
2014-01-07  4:19 ` Wanpeng Li
2014-01-07  4:19 ` Wanpeng Li
2014-01-07  4:19 ` Wanpeng Li
2014-01-08 14:17   ` Anton Blanchard
2014-01-08 14:17     ` Anton Blanchard
2014-01-07  4:19 ` Wanpeng Li
2014-01-07  6:49 ` Andi Kleen
2014-01-07  6:49   ` Andi Kleen
2014-01-08 14:03   ` Anton Blanchard
2014-01-08 14:03     ` Anton Blanchard
2014-01-07  7:41 ` Joonsoo Kim
2014-01-07  7:41   ` Joonsoo Kim
2014-01-07  8:48   ` Wanpeng Li
2014-01-07  8:48   ` Wanpeng Li
2014-01-07  8:48   ` Wanpeng Li
2014-01-07  8:48   ` Wanpeng Li
2014-01-07  9:10     ` Joonsoo Kim
2014-01-07  9:10       ` Joonsoo Kim
2014-01-07  9:21       ` Wanpeng Li
2014-01-07  9:21       ` Wanpeng Li
2014-01-07  9:31         ` Joonsoo Kim
2014-01-07  9:31           ` Joonsoo Kim
2014-01-07  9:49           ` Wanpeng Li
2014-01-07  9:49           ` Wanpeng Li
2014-01-07  9:49           ` Wanpeng Li
2014-01-07  9:49           ` Wanpeng Li
2014-01-07  9:21       ` Wanpeng Li
2014-01-07  9:21       ` Wanpeng Li
2014-01-07  9:52   ` Wanpeng Li
2014-01-07  9:52   ` Wanpeng Li
2014-01-07  9:52   ` Wanpeng Li
2014-01-09  0:20     ` Joonsoo Kim
2014-01-09  0:20       ` Joonsoo Kim
2014-01-07  9:52   ` Wanpeng Li
2014-01-20  9:10   ` Wanpeng Li
2014-01-20  9:10   ` Wanpeng Li
2014-01-20  9:10   ` Wanpeng Li
2014-01-20  9:10   ` Wanpeng Li
     [not found]   ` <52dce7fe.e5e6420a.5ff6.ffff84a0SMTPIN_ADDED_BROKEN@mx.google.com>
2014-01-20 22:13     ` Christoph Lameter
2014-01-20 22:13       ` Christoph Lameter
2014-01-21  2:20       ` Wanpeng Li
2014-01-21  2:20       ` Wanpeng Li
2014-01-21  2:20       ` Wanpeng Li
2014-01-21  2:20       ` Wanpeng Li
2014-01-24  3:09       ` Wanpeng Li
2014-01-24  3:09       ` Wanpeng Li
2014-01-24  3:09       ` Wanpeng Li
2014-01-24  3:09       ` Wanpeng Li
2014-01-24  3:14         ` Wanpeng Li
2014-01-24  3:14         ` Wanpeng Li
2014-01-24  3:14         ` Wanpeng Li
2014-01-24  3:14         ` Wanpeng Li
     [not found]         ` <52e1da8f.86f7440a.120f.25f3SMTPIN_ADDED_BROKEN@mx.google.com>
2014-01-24 15:50           ` Christoph Lameter
2014-01-24 15:50             ` Christoph Lameter
2014-01-24 21:03             ` David Rientjes
2014-01-24 21:03               ` David Rientjes
2014-01-24 22:19               ` Nishanth Aravamudan
2014-01-24 22:19                 ` Nishanth Aravamudan
2014-01-24 23:29               ` Nishanth Aravamudan
2014-01-24 23:29                 ` Nishanth Aravamudan
2014-01-24 23:49                 ` David Rientjes
2014-01-24 23:49                   ` David Rientjes
2014-01-25  0:16                   ` Nishanth Aravamudan
2014-01-25  0:16                     ` Nishanth Aravamudan
2014-01-25  0:25                     ` David Rientjes
2014-01-25  0:25                       ` David Rientjes
2014-01-25  1:10                       ` Nishanth Aravamudan
2014-01-25  1:10                         ` Nishanth Aravamudan
2014-01-27  5:58                         ` Joonsoo Kim
2014-01-27  5:58                           ` Joonsoo Kim
2014-01-28 18:29                           ` Nishanth Aravamudan
2014-01-28 18:29                             ` Nishanth Aravamudan
2014-01-29 15:54                             ` Christoph Lameter
2014-01-29 15:54                               ` Christoph Lameter
2014-01-29 22:36                             ` Nishanth Aravamudan
2014-01-29 22:36                               ` Nishanth Aravamudan
2014-01-30 16:26                               ` Christoph Lameter
2014-01-30 16:26                                 ` Christoph Lameter
2014-02-03 23:00                             ` Nishanth Aravamudan
2014-02-03 23:00                               ` Nishanth Aravamudan
2014-02-04  3:38                               ` Christoph Lameter
2014-02-04  3:38                                 ` Christoph Lameter
2014-02-04  7:26                                 ` Nishanth Aravamudan
2014-02-04  7:26                                   ` Nishanth Aravamudan
2014-02-04 20:39                                   ` Christoph Lameter
2014-02-04 20:39                                     ` Christoph Lameter
2014-02-05  0:13                                     ` Nishanth Aravamudan
2014-02-05  0:13                                       ` Nishanth Aravamudan
2014-02-05 19:28                                       ` Christoph Lameter
2014-02-05 19:28                                         ` Christoph Lameter
2014-02-06  2:08                                         ` Nishanth Aravamudan
2014-02-06  2:08                                           ` Nishanth Aravamudan
2014-02-06 17:25                                           ` Christoph Lameter
2014-02-06 17:25                                             ` Christoph Lameter
2014-01-27 16:18                         ` Christoph Lameter
2014-01-27 16:18                           ` Christoph Lameter
2014-02-06  2:07                       ` Nishanth Aravamudan
2014-02-06  2:07                         ` Nishanth Aravamudan
2014-02-06  8:04                         ` Joonsoo Kim
2014-02-06  8:04                           ` Joonsoo Kim
     [not found]                           ` <20140206185955.GA7845@linux.vnet.ibm.com>
2014-02-06 19:28                             ` Nishanth Aravamudan
2014-02-06 19:28                               ` Nishanth Aravamudan
2014-02-07  8:03                               ` Joonsoo Kim
2014-02-07  8:03                                 ` Joonsoo Kim
2014-02-06  8:07                         ` [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id() Joonsoo Kim
2014-02-06  8:07                           ` Joonsoo Kim
2014-02-06  8:07                           ` [RFC PATCH 2/3] topology: support node_numa_mem() for determining the fallback node Joonsoo Kim
2014-02-06  8:07                             ` Joonsoo Kim
2014-02-06  8:52                             ` David Rientjes
2014-02-06  8:52                               ` David Rientjes
2014-02-06 10:29                               ` Joonsoo Kim
2014-02-06 10:29                                 ` Joonsoo Kim
2014-02-06 19:11                                 ` Nishanth Aravamudan
2014-02-06 19:11                                   ` Nishanth Aravamudan
2014-02-07  5:42                                   ` Joonsoo Kim
2014-02-07  5:42                                     ` Joonsoo Kim
2014-02-06 20:52                                 ` David Rientjes
2014-02-06 20:52                                   ` David Rientjes
2014-02-07  5:48                                   ` Joonsoo Kim
2014-02-07  5:48                                     ` Joonsoo Kim
2014-02-07 17:53                                     ` Christoph Lameter
2014-02-07 17:53                                       ` Christoph Lameter
2014-02-07 18:51                                       ` Christoph Lameter
2014-02-07 18:51                                         ` Christoph Lameter
2014-02-07 21:38                                         ` Nishanth Aravamudan
2014-02-07 21:38                                           ` Nishanth Aravamudan
2014-02-10  1:15                                           ` Joonsoo Kim
2014-02-10  1:15                                             ` Joonsoo Kim
2014-02-10  1:29                                         ` Joonsoo Kim
2014-02-10  1:29                                           ` Joonsoo Kim
2014-02-11 18:45                                           ` Christoph Lameter
2014-02-11 18:45                                             ` Christoph Lameter
2014-02-10 19:13                                         ` Nishanth Aravamudan
2014-02-10 19:13                                           ` Nishanth Aravamudan
2014-02-11  7:42                                           ` Joonsoo Kim
2014-02-11  7:42                                             ` Joonsoo Kim
2014-02-12 22:16                                             ` Christoph Lameter
2014-02-12 22:16                                               ` Christoph Lameter
2014-02-13  3:53                                               ` Nishanth Aravamudan
2014-02-13  3:53                                                 ` Nishanth Aravamudan
2014-02-17  6:52                                               ` Joonsoo Kim
2014-02-17  6:52                                                 ` Joonsoo Kim
2014-02-18 16:38                                                 ` Christoph Lameter
2014-02-18 16:38                                                   ` Christoph Lameter
2014-02-19 22:04                                                   ` David Rientjes
2014-02-19 22:04                                                     ` David Rientjes
2014-02-20 16:02                                                     ` Christoph Lameter
2014-02-20 16:02                                                       ` Christoph Lameter
2014-02-24  5:08                                                   ` Joonsoo Kim
2014-02-24  5:08                                                     ` Joonsoo Kim
2014-02-24 19:54                                                     ` Christoph Lameter
2014-02-24 19:54                                                       ` Christoph Lameter
2014-03-13 16:51                                                       ` Nishanth Aravamudan
2014-03-13 16:51                                                         ` Nishanth Aravamudan
2014-02-18 17:22                                               ` Nishanth Aravamudan
2014-02-18 17:22                                                 ` Nishanth Aravamudan
2014-02-13  6:51                                             ` Nishanth Aravamudan
2014-02-13  6:51                                               ` Nishanth Aravamudan
2014-02-17  7:00                                               ` Joonsoo Kim
2014-02-17  7:00                                                 ` Joonsoo Kim
2014-02-18 16:57                                                 ` Christoph Lameter
2014-02-18 16:57                                                   ` Christoph Lameter
2014-02-18 17:28                                                   ` Nishanth Aravamudan
2014-02-18 17:28                                                     ` Nishanth Aravamudan
2014-02-18 19:58                                                     ` Christoph Lameter
2014-02-18 19:58                                                       ` Christoph Lameter
2014-02-18 21:09                                                       ` Nishanth Aravamudan
2014-02-18 21:09                                                         ` Nishanth Aravamudan
2014-02-18 21:49                                                         ` Christoph Lameter
2014-02-18 21:49                                                           ` Christoph Lameter
2014-02-18 22:22                                                           ` Nishanth Aravamudan
2014-02-18 22:22                                                             ` Nishanth Aravamudan
2014-02-19 16:11                                                             ` Christoph Lameter
2014-02-19 16:11                                                               ` Christoph Lameter
2014-02-19 22:03                                                       ` David Rientjes
2014-02-19 22:03                                                         ` David Rientjes
2014-02-08  9:57                                     ` David Rientjes
2014-02-08  9:57                                       ` David Rientjes
2014-02-10  1:09                                       ` Joonsoo Kim
2014-02-10  1:09                                         ` Joonsoo Kim
2014-07-22  1:03                                         ` Nishanth Aravamudan
2014-07-22  1:03                                           ` Nishanth Aravamudan
2014-07-22  1:16                                           ` David Rientjes
2014-07-22  1:16                                             ` David Rientjes
2014-07-22 21:43                                             ` Nishanth Aravamudan
2014-07-22 21:43                                               ` Nishanth Aravamudan
2014-07-22 21:49                                               ` Tejun Heo
2014-07-22 21:49                                                 ` Tejun Heo
2014-07-22 23:47                                               ` Nishanth Aravamudan
2014-07-22 23:47                                                 ` Nishanth Aravamudan
2014-07-23  0:43                                               ` David Rientjes
2014-07-23  0:43                                                 ` David Rientjes
2014-02-06  8:07                           ` [RFC PATCH 3/3] slub: fallback to get_numa_mem() node if we want to allocate on memoryless node Joonsoo Kim
2014-02-06  8:07                             ` Joonsoo Kim
2014-02-06 17:30                             ` Christoph Lameter
2014-02-06 17:30                               ` Christoph Lameter
2014-02-07  5:41                               ` Joonsoo Kim
2014-02-07  5:41                                 ` Joonsoo Kim
2014-02-07 17:49                                 ` Christoph Lameter
2014-02-07 17:49                                   ` Christoph Lameter
2014-02-10  1:22                                   ` Joonsoo Kim
2014-02-10  1:22                                     ` Joonsoo Kim
2014-02-06  8:37                           ` [RFC PATCH 1/3] slub: search partial list on numa_mem_id(), instead of numa_node_id() David Rientjes
2014-02-06  8:37                             ` David Rientjes
2014-02-06 17:31                             ` Christoph Lameter
2014-02-06 17:31                               ` Christoph Lameter
2014-02-06 17:26                           ` Christoph Lameter
2014-02-06 17:26                             ` Christoph Lameter
2014-05-16 23:37                           ` Nishanth Aravamudan
2014-05-16 23:37                             ` Nishanth Aravamudan
2014-05-19  2:41                             ` Joonsoo Kim
2014-05-19  2:41                               ` Joonsoo Kim
2014-06-05  0:13                           ` [RESEND PATCH] " David Rientjes
2014-06-05  0:13                             ` David Rientjes
2014-06-05  0:13                             ` David Rientjes
2014-01-27 16:24                     ` [PATCH] slub: Don't throw away partial remote slabs if there is no local memory Christoph Lameter
2014-01-27 16:24                       ` Christoph Lameter
2014-01-27 16:16                   ` Christoph Lameter
2014-01-27 16:16                     ` Christoph Lameter
2014-01-07  9:42 ` David Laight
2014-01-07  9:42   ` David Laight
2014-01-08 14:14   ` Anton Blanchard
2014-01-08 14:14     ` Anton Blanchard
2014-01-07 10:28 ` Wanpeng Li
2014-01-07 10:28 ` Wanpeng Li
2014-01-07 10:28 ` Wanpeng Li
2014-01-07 10:28 ` Wanpeng Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.