All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Fix SLQB on memoryless configurations V3
@ 2009-09-22 12:54 ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 12:54 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Mel Gorman,
	Tejun Heo, Benjamin Herrenschmidt

Changelog since V2
  o Turned out that allocating per-cpu areas for node ids on ppc64 just
    wasn't stable. This series statically declares the per-node data. This
    wastes memory but it appears to work.

Currently SLQB is not allowed to be configured on PPC and S390 machines as
CPUs can belong to memoryless nodes. SLQB does not deal with this very well
and crashes reliably.

These patches partially fix the memoryless node problem for SLQB. The
machine will boot successfully but is unstable under stress indicating
that SLQB has some serious problems when dealing with pages from remote
nodes. The remote node stability may be linked to the per-cpu stability
problem so should be treated as separate bugs.

Patch 1 statically defines some per-node structures instead of using a fun
        hack with DEFINE_PER_CPU. The per-node areas are not always getting
        initialised by the architecture which led to a crash.

Patch 2 notes that on memoryless configurations, memory is always freed
	remotely but always allocates locally and falls back to the page
	allocator on failure. This effectively is a memory leak. This patch
	records in kmem_cache_cpu what node it considers local to be either
	the real local node or the closest node available.

Patch 3 allows SLQB to be configured on PPC again and S390. These patches
	address most of the memoryless node issues on PPC and the expectation
	is that the remaining bugs in SLQB are to do with remote nodes,
	per-cpu area allocation or both. This patch enables SLQB on S390
	as it has been reported by Heiko Carstens that issues there have
	been independently resolved.

I believe these are ready for merging although it would be preferred if
Nick signed-off.  Christoph has suggested that SLQB should be disabled for
NUMA but I feel if it's disabled, the problem may never be resolved. Hence
I didn't patch accordingly but Pekka or Nick may feel different.

 include/linux/slqb_def.h |    3 ++
 init/Kconfig             |    1 -
 mm/slqb.c                |   52 ++++++++++++++++++++++++++++-----------------
 3 files changed, 35 insertions(+), 21 deletions(-)


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 0/3] Fix SLQB on memoryless configurations V3
@ 2009-09-22 12:54 ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 12:54 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Mel Gorman,
	Tejun Heo, Benjamin Herrenschmidt

Changelog since V2
  o Turned out that allocating per-cpu areas for node ids on ppc64 just
    wasn't stable. This series statically declares the per-node data. This
    wastes memory but it appears to work.

Currently SLQB is not allowed to be configured on PPC and S390 machines as
CPUs can belong to memoryless nodes. SLQB does not deal with this very well
and crashes reliably.

These patches partially fix the memoryless node problem for SLQB. The
machine will boot successfully but is unstable under stress indicating
that SLQB has some serious problems when dealing with pages from remote
nodes. The remote node stability may be linked to the per-cpu stability
problem so should be treated as separate bugs.

Patch 1 statically defines some per-node structures instead of using a fun
        hack with DEFINE_PER_CPU. The per-node areas are not always getting
        initialised by the architecture which led to a crash.

Patch 2 notes that on memoryless configurations, memory is always freed
	remotely but always allocates locally and falls back to the page
	allocator on failure. This effectively is a memory leak. This patch
	records in kmem_cache_cpu what node it considers local to be either
	the real local node or the closest node available.

Patch 3 allows SLQB to be configured on PPC again and S390. These patches
	address most of the memoryless node issues on PPC and the expectation
	is that the remaining bugs in SLQB are to do with remote nodes,
	per-cpu area allocation or both. This patch enables SLQB on S390
	as it has been reported by Heiko Carstens that issues there have
	been independently resolved.

I believe these are ready for merging although it would be preferred if
Nick signed-off.  Christoph has suggested that SLQB should be disabled for
NUMA but I feel if it's disabled, the problem may never be resolved. Hence
I didn't patch accordingly but Pekka or Nick may feel different.

 include/linux/slqb_def.h |    3 ++
 init/Kconfig             |    1 -
 mm/slqb.c                |   52 ++++++++++++++++++++++++++++-----------------
 3 files changed, 35 insertions(+), 21 deletions(-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 1/4] slqb: Do not use DEFINE_PER_CPU for per-node data
  2009-09-22 12:54 ` Mel Gorman
@ 2009-09-22 12:54   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 12:54 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Mel Gorman,
	Tejun Heo, Benjamin Herrenschmidt

SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
assumption is made that all valid node IDs will have matching valid CPU
ids. In memoryless configurations, it is possible to have a node ID with
no CPU having the same ID. When this happens, per-cpu areas are not
initialised and the per-node data is effectively random.

An attempt was made to force the allocation of per-cpu areas corresponding
to active node IDs. However, for reasons unknown this led to silent
lockups. Instead, this patch fixes the SLQB problem by forcing the per-node
data to be statically declared.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/slqb.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/slqb.c b/mm/slqb.c
index 4ca85e2..4d72be2 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1944,16 +1944,16 @@ static void init_kmem_cache_node(struct kmem_cache *s,
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cache_cpus);
 #endif
 #ifdef CONFIG_NUMA
-/* XXX: really need a DEFINE_PER_NODE for per-node data, but this is better than
- * a static array */
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cache_nodes);
+/* XXX: really need a DEFINE_PER_NODE for per-node data because a static
+ *      array is wasteful */
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
 #endif
 
 #ifdef CONFIG_SMP
 static struct kmem_cache kmem_cpu_cache;
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cpu_cpus);
 #ifdef CONFIG_NUMA
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cpu_nodes); /* XXX per-nid */
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES]; /* XXX per-nid */
 #endif
 #endif
 
@@ -1962,7 +1962,7 @@ static struct kmem_cache kmem_node_cache;
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_node_cpus);
 #endif
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_node_nodes); /*XXX per-nid */
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES]; /*XXX per-nid */
 #endif
 
 #ifdef CONFIG_SMP
@@ -2918,15 +2918,15 @@ void __init kmem_cache_init(void)
 	for_each_node_state(i, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n;
 
-		n = &per_cpu(kmem_cache_nodes, i);
+		n = &kmem_cache_nodes[i];
 		init_kmem_cache_node(&kmem_cache_cache, n);
 		kmem_cache_cache.node_slab[i] = n;
 #ifdef CONFIG_SMP
-		n = &per_cpu(kmem_cpu_nodes, i);
+		n = &kmem_cpu_nodes[i];
 		init_kmem_cache_node(&kmem_cpu_cache, n);
 		kmem_cpu_cache.node_slab[i] = n;
 #endif
-		n = &per_cpu(kmem_node_nodes, i);
+		n = &kmem_node_nodes[i];
 		init_kmem_cache_node(&kmem_node_cache, n);
 		kmem_node_cache.node_slab[i] = n;
 	}
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 1/4] slqb: Do not use DEFINE_PER_CPU for per-node data
@ 2009-09-22 12:54   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 12:54 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Mel Gorman,
	Tejun Heo, Benjamin Herrenschmidt

SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
assumption is made that all valid node IDs will have matching valid CPU
ids. In memoryless configurations, it is possible to have a node ID with
no CPU having the same ID. When this happens, per-cpu areas are not
initialised and the per-node data is effectively random.

An attempt was made to force the allocation of per-cpu areas corresponding
to active node IDs. However, for reasons unknown this led to silent
lockups. Instead, this patch fixes the SLQB problem by forcing the per-node
data to be statically declared.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 mm/slqb.c |   16 ++++++++--------
 1 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/slqb.c b/mm/slqb.c
index 4ca85e2..4d72be2 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1944,16 +1944,16 @@ static void init_kmem_cache_node(struct kmem_cache *s,
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cache_cpus);
 #endif
 #ifdef CONFIG_NUMA
-/* XXX: really need a DEFINE_PER_NODE for per-node data, but this is better than
- * a static array */
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cache_nodes);
+/* XXX: really need a DEFINE_PER_NODE for per-node data because a static
+ *      array is wasteful */
+static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES];
 #endif
 
 #ifdef CONFIG_SMP
 static struct kmem_cache kmem_cpu_cache;
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_cpu_cpus);
 #ifdef CONFIG_NUMA
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_cpu_nodes); /* XXX per-nid */
+static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES]; /* XXX per-nid */
 #endif
 #endif
 
@@ -1962,7 +1962,7 @@ static struct kmem_cache kmem_node_cache;
 #ifdef CONFIG_SMP
 static DEFINE_PER_CPU(struct kmem_cache_cpu, kmem_node_cpus);
 #endif
-static DEFINE_PER_CPU(struct kmem_cache_node, kmem_node_nodes); /*XXX per-nid */
+static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES]; /*XXX per-nid */
 #endif
 
 #ifdef CONFIG_SMP
@@ -2918,15 +2918,15 @@ void __init kmem_cache_init(void)
 	for_each_node_state(i, N_NORMAL_MEMORY) {
 		struct kmem_cache_node *n;
 
-		n = &per_cpu(kmem_cache_nodes, i);
+		n = &kmem_cache_nodes[i];
 		init_kmem_cache_node(&kmem_cache_cache, n);
 		kmem_cache_cache.node_slab[i] = n;
 #ifdef CONFIG_SMP
-		n = &per_cpu(kmem_cpu_nodes, i);
+		n = &kmem_cpu_nodes[i];
 		init_kmem_cache_node(&kmem_cpu_cache, n);
 		kmem_cpu_cache.node_slab[i] = n;
 #endif
-		n = &per_cpu(kmem_node_nodes, i);
+		n = &kmem_node_nodes[i];
 		init_kmem_cache_node(&kmem_node_cache, n);
 		kmem_node_cache.node_slab[i] = n;
 	}
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-22 12:54 ` Mel Gorman
@ 2009-09-22 12:54   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 12:54 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Mel Gorman,
	Tejun Heo, Benjamin Herrenschmidt

When freeing a page, SLQB checks if the page belongs to the local node.
If it is not, it is considered a remote free. On the allocation side, it
always checks the local lists and if they are empty, the page allocator
is called. On memoryless configurations, this is effectively a memory
leak and the machine quickly kills itself in an OOM storm.

This patch records what node ID is considered local to a CPU. As the
management structure for the CPU is always allocated from the closest
node, the node the CPU structure resides on is considered "local".

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/slqb_def.h |    3 +++
 mm/slqb.c                |   23 +++++++++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/slqb_def.h b/include/linux/slqb_def.h
index 1243dda..2ccbe7e 100644
--- a/include/linux/slqb_def.h
+++ b/include/linux/slqb_def.h
@@ -101,6 +101,9 @@ struct kmem_cache_cpu {
 	struct kmem_cache_list	list;		/* List for node-local slabs */
 	unsigned int		colour_next;	/* Next colour offset to use */
 
+	/* local_nid will be numa_node_id() except when memoryless */
+	unsigned int		local_nid;
+
 #ifdef CONFIG_SMP
 	/*
 	 * rlist is a list of objects that don't fit on list.freelist (ie.
diff --git a/mm/slqb.c b/mm/slqb.c
index 4d72be2..89fd8e4 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1375,7 +1375,7 @@ static noinline void *__slab_alloc_page(struct kmem_cache *s,
 	if (unlikely(!page))
 		return page;
 
-	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == c->local_nid)) {
 		struct kmem_cache_cpu *c;
 		int cpu = smp_processor_id();
 
@@ -1501,15 +1501,16 @@ static __always_inline void *__slab_alloc(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 	struct kmem_cache_list *l;
 
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+
 #ifdef CONFIG_NUMA
-	if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+	if (unlikely(node != -1) && unlikely(node != c->local_nid)) {
 try_remote:
 		return __remote_slab_alloc(s, gfpflags, node);
 	}
 #endif
 
-	c = get_cpu_slab(s, smp_processor_id());
-	VM_BUG_ON(!c);
 	l = &c->list;
 	object = __cache_list_get_object(s, l);
 	if (unlikely(!object)) {
@@ -1518,7 +1519,7 @@ try_remote:
 			object = __slab_alloc_page(s, gfpflags, node);
 #ifdef CONFIG_NUMA
 			if (unlikely(!object)) {
-				node = numa_node_id();
+				node = c->local_nid;
 				goto try_remote;
 			}
 #endif
@@ -1733,7 +1734,7 @@ static __always_inline void __slab_free(struct kmem_cache *s,
 	slqb_stat_inc(l, FREE);
 
 	if (!NUMA_BUILD || !slab_numa(s) ||
-			likely(slqb_page_to_nid(page) == numa_node_id())) {
+			likely(slqb_page_to_nid(page) == c->local_nid)) {
 		/*
 		 * Freeing fastpath. Collects all local-node objects, not
 		 * just those allocated from our per-CPU list. This allows
@@ -1928,6 +1929,16 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
 	c->rlist.tail		= NULL;
 	c->remote_cache_list	= NULL;
 #endif
+
+	/*
+	 * Determine what the local node to this CPU is. Ordinarily
+	 * this would be cpu_to_node() but for memoryless nodes, that
+	 * is not the best value. Instead, we take the numa node that
+	 * kmem_cache_cpu is allocated from as being the best guess
+	 * as being local because it'll match what the page allocator
+	 * thinks is the most local
+	 */
+	c->local_nid = page_to_nid(virt_to_page((unsigned long)c & PAGE_MASK));
 }
 
 #ifdef CONFIG_NUMA
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-22 12:54   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 12:54 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Mel Gorman,
	Tejun Heo, Benjamin Herrenschmidt

When freeing a page, SLQB checks if the page belongs to the local node.
If it is not, it is considered a remote free. On the allocation side, it
always checks the local lists and if they are empty, the page allocator
is called. On memoryless configurations, this is effectively a memory
leak and the machine quickly kills itself in an OOM storm.

This patch records what node ID is considered local to a CPU. As the
management structure for the CPU is always allocated from the closest
node, the node the CPU structure resides on is considered "local".

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/linux/slqb_def.h |    3 +++
 mm/slqb.c                |   23 +++++++++++++++++------
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/linux/slqb_def.h b/include/linux/slqb_def.h
index 1243dda..2ccbe7e 100644
--- a/include/linux/slqb_def.h
+++ b/include/linux/slqb_def.h
@@ -101,6 +101,9 @@ struct kmem_cache_cpu {
 	struct kmem_cache_list	list;		/* List for node-local slabs */
 	unsigned int		colour_next;	/* Next colour offset to use */
 
+	/* local_nid will be numa_node_id() except when memoryless */
+	unsigned int		local_nid;
+
 #ifdef CONFIG_SMP
 	/*
 	 * rlist is a list of objects that don't fit on list.freelist (ie.
diff --git a/mm/slqb.c b/mm/slqb.c
index 4d72be2..89fd8e4 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1375,7 +1375,7 @@ static noinline void *__slab_alloc_page(struct kmem_cache *s,
 	if (unlikely(!page))
 		return page;
 
-	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == numa_node_id())) {
+	if (!NUMA_BUILD || likely(slqb_page_to_nid(page) == c->local_nid)) {
 		struct kmem_cache_cpu *c;
 		int cpu = smp_processor_id();
 
@@ -1501,15 +1501,16 @@ static __always_inline void *__slab_alloc(struct kmem_cache *s,
 	struct kmem_cache_cpu *c;
 	struct kmem_cache_list *l;
 
+	c = get_cpu_slab(s, smp_processor_id());
+	VM_BUG_ON(!c);
+
 #ifdef CONFIG_NUMA
-	if (unlikely(node != -1) && unlikely(node != numa_node_id())) {
+	if (unlikely(node != -1) && unlikely(node != c->local_nid)) {
 try_remote:
 		return __remote_slab_alloc(s, gfpflags, node);
 	}
 #endif
 
-	c = get_cpu_slab(s, smp_processor_id());
-	VM_BUG_ON(!c);
 	l = &c->list;
 	object = __cache_list_get_object(s, l);
 	if (unlikely(!object)) {
@@ -1518,7 +1519,7 @@ try_remote:
 			object = __slab_alloc_page(s, gfpflags, node);
 #ifdef CONFIG_NUMA
 			if (unlikely(!object)) {
-				node = numa_node_id();
+				node = c->local_nid;
 				goto try_remote;
 			}
 #endif
@@ -1733,7 +1734,7 @@ static __always_inline void __slab_free(struct kmem_cache *s,
 	slqb_stat_inc(l, FREE);
 
 	if (!NUMA_BUILD || !slab_numa(s) ||
-			likely(slqb_page_to_nid(page) == numa_node_id())) {
+			likely(slqb_page_to_nid(page) == c->local_nid)) {
 		/*
 		 * Freeing fastpath. Collects all local-node objects, not
 		 * just those allocated from our per-CPU list. This allows
@@ -1928,6 +1929,16 @@ static void init_kmem_cache_cpu(struct kmem_cache *s,
 	c->rlist.tail		= NULL;
 	c->remote_cache_list	= NULL;
 #endif
+
+	/*
+	 * Determine what the local node to this CPU is. Ordinarily
+	 * this would be cpu_to_node() but for memoryless nodes, that
+	 * is not the best value. Instead, we take the numa node that
+	 * kmem_cache_cpu is allocated from as being the best guess
+	 * as being local because it'll match what the page allocator
+	 * thinks is the most local
+	 */
+	c->local_nid = page_to_nid(virt_to_page((unsigned long)c & PAGE_MASK));
 }
 
 #ifdef CONFIG_NUMA
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/4] slqb: Allow SLQB to be used on PPC and S390
  2009-09-22 12:54 ` Mel Gorman
@ 2009-09-22 12:54   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 12:54 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Mel Gorman,
	Tejun Heo, Benjamin Herrenschmidt

SLQB was disabled on PPC as it would stab itself in the face when running
on machines with CPUs on memoryless nodes and was disabled on S390 due
to other functionality difficulties. S390 has been independently fixed
and PPC should work in most configurations with remote locking of nodes
still with some difficulties. Allow SLQB to be configured again so the
dodgy configurations can be further identified and debugged.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 init/Kconfig |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index adc10ab..c56248f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1033,7 +1033,6 @@ config SLUB
 
 config SLQB
 	bool "SLQB (Queued allocator)"
-	depends on !PPC && !S390
 	help
 	  SLQB is a proposed new slab allocator.
 
-- 
1.6.3.3


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH 3/4] slqb: Allow SLQB to be used on PPC and S390
@ 2009-09-22 12:54   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 12:54 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Mel Gorman,
	Tejun Heo, Benjamin Herrenschmidt

SLQB was disabled on PPC as it would stab itself in the face when running
on machines with CPUs on memoryless nodes and was disabled on S390 due
to other functionality difficulties. S390 has been independently fixed
and PPC should work in most configurations with remote locking of nodes
still with some difficulties. Allow SLQB to be configured again so the
dodgy configurations can be further identified and debugged.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 init/Kconfig |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index adc10ab..c56248f 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1033,7 +1033,6 @@ config SLUB
 
 config SLQB
 	bool "SLQB (Queued allocator)"
-	depends on !PPC && !S390
 	help
 	  SLQB is a proposed new slab allocator.
 
-- 
1.6.3.3

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/3] Fix SLQB on memoryless configurations V3
  2009-09-22 12:54 ` Mel Gorman
@ 2009-09-22 13:21   ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 13:21 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Tejun Heo,
	Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 01:54:11PM +0100, Mel Gorman wrote:
> Changelog since V2
>   o Turned out that allocating per-cpu areas for node ids on ppc64 just
>     wasn't stable. This series statically declares the per-node data. This
>     wastes memory but it appears to work.
> 
> Currently SLQB is not allowed to be configured on PPC and S390 machines as
> CPUs can belong to memoryless nodes. SLQB does not deal with this very well
> and crashes reliably.
> 

GACK. Sorry about the 1/4, 2/4, 3/4 problem. There are only three
patches in this set. I dropped the last patch which was related to the
SLQB corruption problem because it didn't appear to help and didn't fix
up the number. Sorry.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 0/3] Fix SLQB on memoryless configurations V3
@ 2009-09-22 13:21   ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 13:21 UTC (permalink / raw)
  To: Nick Piggin, Pekka Enberg, Christoph Lameter
  Cc: heiko.carstens, sachinp, linux-kernel, linux-mm, Tejun Heo,
	Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 01:54:11PM +0100, Mel Gorman wrote:
> Changelog since V2
>   o Turned out that allocating per-cpu areas for node ids on ppc64 just
>     wasn't stable. This series statically declares the per-node data. This
>     wastes memory but it appears to work.
> 
> Currently SLQB is not allowed to be configured on PPC and S390 machines as
> CPUs can belong to memoryless nodes. SLQB does not deal with this very well
> and crashes reliably.
> 

GACK. Sorry about the 1/4, 2/4, 3/4 problem. There are only three
patches in this set. I dropped the last patch which was related to the
SLQB corruption problem because it didn't appear to help and didn't fix
up the number. Sorry.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-22 12:54   ` Mel Gorman
@ 2009-09-22 13:38     ` Pekka Enberg
  -1 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-09-22 13:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

Hi Mel,

On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> When freeing a page, SLQB checks if the page belongs to the local node.
> If it is not, it is considered a remote free. On the allocation side, it
> always checks the local lists and if they are empty, the page allocator
> is called. On memoryless configurations, this is effectively a memory
> leak and the machine quickly kills itself in an OOM storm.
>
> This patch records what node ID is considered local to a CPU. As the
> management structure for the CPU is always allocated from the closest
> node, the node the CPU structure resides on is considered "local".
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

I don't understand how the memory leak happens from the above
description (or reading the code). page_to_nid() returns some crazy
value at free time? The remote list isn't drained properly?

                        Pekka

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-22 13:38     ` Pekka Enberg
  0 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-09-22 13:38 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

Hi Mel,

On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> When freeing a page, SLQB checks if the page belongs to the local node.
> If it is not, it is considered a remote free. On the allocation side, it
> always checks the local lists and if they are empty, the page allocator
> is called. On memoryless configurations, this is effectively a memory
> leak and the machine quickly kills itself in an OOM storm.
>
> This patch records what node ID is considered local to a CPU. As the
> management structure for the CPU is always allocated from the closest
> node, the node the CPU structure resides on is considered "local".
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

I don't understand how the memory leak happens from the above
description (or reading the code). page_to_nid() returns some crazy
value at free time? The remote list isn't drained properly?

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-22 13:38     ` Pekka Enberg
@ 2009-09-22 13:54       ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 13:54 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 04:38:32PM +0300, Pekka Enberg wrote:
> Hi Mel,
> 
> On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > When freeing a page, SLQB checks if the page belongs to the local node.
> > If it is not, it is considered a remote free. On the allocation side, it
> > always checks the local lists and if they are empty, the page allocator
> > is called. On memoryless configurations, this is effectively a memory
> > leak and the machine quickly kills itself in an OOM storm.
> >
> > This patch records what node ID is considered local to a CPU. As the
> > management structure for the CPU is always allocated from the closest
> > node, the node the CPU structure resides on is considered "local".
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> I don't understand how the memory leak happens from the above
> description (or reading the code). page_to_nid() returns some crazy
> value at free time?

Nope, it isn't a leak as such, the allocator knows where the memory is.
The problem is that is always frees remote but on allocation, it sees
the per-cpu list is empty and calls the page allocator again. The remote
lists just grow.

> The remote list isn't drained properly?
> 

That is another way of looking at it. When the remote lists get to a
watermark, they should drain. However, it's worth pointing out if it's
repaired in this fashion, the performance of SLQB will suffer as it'll
never reuse the local list of pages and instead always get cold pages
from the allocator.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-22 13:54       ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 13:54 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 04:38:32PM +0300, Pekka Enberg wrote:
> Hi Mel,
> 
> On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > When freeing a page, SLQB checks if the page belongs to the local node.
> > If it is not, it is considered a remote free. On the allocation side, it
> > always checks the local lists and if they are empty, the page allocator
> > is called. On memoryless configurations, this is effectively a memory
> > leak and the machine quickly kills itself in an OOM storm.
> >
> > This patch records what node ID is considered local to a CPU. As the
> > management structure for the CPU is always allocated from the closest
> > node, the node the CPU structure resides on is considered "local".
> >
> > Signed-off-by: Mel Gorman <mel@csn.ul.ie>
> 
> I don't understand how the memory leak happens from the above
> description (or reading the code). page_to_nid() returns some crazy
> value at free time?

Nope, it isn't a leak as such, the allocator knows where the memory is.
The problem is that is always frees remote but on allocation, it sees
the per-cpu list is empty and calls the page allocator again. The remote
lists just grow.

> The remote list isn't drained properly?
> 

That is another way of looking at it. When the remote lists get to a
watermark, they should drain. However, it's worth pointing out if it's
repaired in this fashion, the performance of SLQB will suffer as it'll
never reuse the local list of pages and instead always get cold pages
from the allocator.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-22 13:54       ` Mel Gorman
@ 2009-09-22 18:54         ` Pekka Enberg
  -1 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-09-22 18:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

Hi Mel,

On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
>> I don't understand how the memory leak happens from the above
>> description (or reading the code). page_to_nid() returns some crazy
>> value at free time?
>
> Nope, it isn't a leak as such, the allocator knows where the memory is.
> The problem is that is always frees remote but on allocation, it sees
> the per-cpu list is empty and calls the page allocator again. The remote
> lists just grow.
>
>> The remote list isn't drained properly?
>
> That is another way of looking at it. When the remote lists get to a
> watermark, they should drain. However, it's worth pointing out if it's
> repaired in this fashion, the performance of SLQB will suffer as it'll
> never reuse the local list of pages and instead always get cold pages
> from the allocator.

I worry about setting c->local_nid to the node of the allocated struct
kmem_cache_cpu. It seems like an arbitrary policy decision that's not
necessarily the best option and I'm not totally convinced it's correct
when cpusets are configured. SLUB seems to do the sane thing here by
using page allocator fallback (which respects cpusets AFAICT) and
recycling one slab slab at a time.

Can I persuade you into sending me a patch that fixes remote list
draining to get things working on PPC? I'd much rather wait for Nick's
input on the allocation policy and performance.

                        Pekka

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-22 18:54         ` Pekka Enberg
  0 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-09-22 18:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

Hi Mel,

On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
>> I don't understand how the memory leak happens from the above
>> description (or reading the code). page_to_nid() returns some crazy
>> value at free time?
>
> Nope, it isn't a leak as such, the allocator knows where the memory is.
> The problem is that is always frees remote but on allocation, it sees
> the per-cpu list is empty and calls the page allocator again. The remote
> lists just grow.
>
>> The remote list isn't drained properly?
>
> That is another way of looking at it. When the remote lists get to a
> watermark, they should drain. However, it's worth pointing out if it's
> repaired in this fashion, the performance of SLQB will suffer as it'll
> never reuse the local list of pages and instead always get cold pages
> from the allocator.

I worry about setting c->local_nid to the node of the allocated struct
kmem_cache_cpu. It seems like an arbitrary policy decision that's not
necessarily the best option and I'm not totally convinced it's correct
when cpusets are configured. SLUB seems to do the sane thing here by
using page allocator fallback (which respects cpusets AFAICT) and
recycling one slab slab at a time.

Can I persuade you into sending me a patch that fixes remote list
draining to get things working on PPC? I'd much rather wait for Nick's
input on the allocation policy and performance.

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] slqb: Do not use DEFINE_PER_CPU for per-node data
  2009-09-22 12:54   ` Mel Gorman
@ 2009-09-22 18:55     ` Pekka Enberg
  -1 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-09-22 18:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
> assumption is made that all valid node IDs will have matching valid CPU
> ids. In memoryless configurations, it is possible to have a node ID with
> no CPU having the same ID. When this happens, per-cpu areas are not
> initialised and the per-node data is effectively random.
>
> An attempt was made to force the allocation of per-cpu areas corresponding
> to active node IDs. However, for reasons unknown this led to silent
> lockups. Instead, this patch fixes the SLQB problem by forcing the per-node
> data to be statically declared.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Applied, thanks!

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 1/4] slqb: Do not use DEFINE_PER_CPU for per-node data
@ 2009-09-22 18:55     ` Pekka Enberg
  0 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-09-22 18:55 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 3:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> SLQB uses DEFINE_PER_CPU to define per-node areas. An implicit
> assumption is made that all valid node IDs will have matching valid CPU
> ids. In memoryless configurations, it is possible to have a node ID with
> no CPU having the same ID. When this happens, per-cpu areas are not
> initialised and the per-node data is effectively random.
>
> An attempt was made to force the allocation of per-cpu areas corresponding
> to active node IDs. However, for reasons unknown this led to silent
> lockups. Instead, this patch fixes the SLQB problem by forcing the per-node
> data to be statically declared.
>
> Signed-off-by: Mel Gorman <mel@csn.ul.ie>

Applied, thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-22 18:54         ` Pekka Enberg
@ 2009-09-22 18:56           ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 18:56 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 09:54:33PM +0300, Pekka Enberg wrote:
> Hi Mel,
> 
> On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> I don't understand how the memory leak happens from the above
> >> description (or reading the code). page_to_nid() returns some crazy
> >> value at free time?
> >
> > Nope, it isn't a leak as such, the allocator knows where the memory is.
> > The problem is that is always frees remote but on allocation, it sees
> > the per-cpu list is empty and calls the page allocator again. The remote
> > lists just grow.
> >
> >> The remote list isn't drained properly?
> >
> > That is another way of looking at it. When the remote lists get to a
> > watermark, they should drain. However, it's worth pointing out if it's
> > repaired in this fashion, the performance of SLQB will suffer as it'll
> > never reuse the local list of pages and instead always get cold pages
> > from the allocator.
> 
> I worry about setting c->local_nid to the node of the allocated struct
> kmem_cache_cpu. It seems like an arbitrary policy decision that's not
> necessarily the best option and I'm not totally convinced it's correct
> when cpusets are configured. SLUB seems to do the sane thing here by
> using page allocator fallback (which respects cpusets AFAICT) and
> recycling one slab slab at a time.
> 
> Can I persuade you into sending me a patch that fixes remote list
> draining to get things working on PPC? I'd much rather wait for Nick's
> input on the allocation policy and performance.
> 

It'll be at least next week before I can revisit this again. I'm afraid
I'm going offline from tomorrow until Tuesday.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-22 18:56           ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-22 18:56 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 09:54:33PM +0300, Pekka Enberg wrote:
> Hi Mel,
> 
> On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> >> I don't understand how the memory leak happens from the above
> >> description (or reading the code). page_to_nid() returns some crazy
> >> value at free time?
> >
> > Nope, it isn't a leak as such, the allocator knows where the memory is.
> > The problem is that is always frees remote but on allocation, it sees
> > the per-cpu list is empty and calls the page allocator again. The remote
> > lists just grow.
> >
> >> The remote list isn't drained properly?
> >
> > That is another way of looking at it. When the remote lists get to a
> > watermark, they should drain. However, it's worth pointing out if it's
> > repaired in this fashion, the performance of SLQB will suffer as it'll
> > never reuse the local list of pages and instead always get cold pages
> > from the allocator.
> 
> I worry about setting c->local_nid to the node of the allocated struct
> kmem_cache_cpu. It seems like an arbitrary policy decision that's not
> necessarily the best option and I'm not totally convinced it's correct
> when cpusets are configured. SLUB seems to do the sane thing here by
> using page allocator fallback (which respects cpusets AFAICT) and
> recycling one slab slab at a time.
> 
> Can I persuade you into sending me a patch that fixes remote list
> draining to get things working on PPC? I'd much rather wait for Nick's
> input on the allocation policy and performance.
> 

It'll be at least next week before I can revisit this again. I'm afraid
I'm going offline from tomorrow until Tuesday.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-22 18:56           ` Mel Gorman
@ 2009-09-30 14:41             ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-30 14:41 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 07:56:08PM +0100, Mel Gorman wrote:
> On Tue, Sep 22, 2009 at 09:54:33PM +0300, Pekka Enberg wrote:
> > Hi Mel,
> > 
> > On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > >> I don't understand how the memory leak happens from the above
> > >> description (or reading the code). page_to_nid() returns some crazy
> > >> value at free time?
> > >
> > > Nope, it isn't a leak as such, the allocator knows where the memory is.
> > > The problem is that is always frees remote but on allocation, it sees
> > > the per-cpu list is empty and calls the page allocator again. The remote
> > > lists just grow.
> > >
> > >> The remote list isn't drained properly?
> > >
> > > That is another way of looking at it. When the remote lists get to a
> > > watermark, they should drain. However, it's worth pointing out if it's
> > > repaired in this fashion, the performance of SLQB will suffer as it'll
> > > never reuse the local list of pages and instead always get cold pages
> > > from the allocator.
> > 
> > I worry about setting c->local_nid to the node of the allocated struct
> > kmem_cache_cpu. It seems like an arbitrary policy decision that's not
> > necessarily the best option and I'm not totally convinced it's correct
> > when cpusets are configured. SLUB seems to do the sane thing here by
> > using page allocator fallback (which respects cpusets AFAICT) and
> > recycling one slab slab at a time.
> > 
> > Can I persuade you into sending me a patch that fixes remote list
> > draining to get things working on PPC? I'd much rather wait for Nick's
> > input on the allocation policy and performance.
> > 
> 
> It'll be at least next week before I can revisit this again. I'm afraid
> I'm going offline from tomorrow until Tuesday.
> 

Ok, so I spent today looking at this again. The problem is not with faulty
drain logic as such. As frees always place an object on a remote list
and the allocation side is often (but not always) allocating a new page,
a significant number of objects in the free list are the only object
in a page. SLQB drains based on the number of objects on the free list,
not the number of pages. With many of the pages having only one object,
the freelists are pinning a lot more memory than expected.  For example,
a watermark to drain of 512 could be pinning 2MB of pages.

The drain logic could be extended to track not only the number of objects on
the free list but also the number of pages but I really don't think that is
desirable behaviour. I'm somewhat running out of sensible ideas for dealing
with this but here is another go anyway that might be more palatable than
tracking what a "local" node is within the slab.

This boots on 2.6.32-rc1 with the latest slqb-core git tree with
Kconfig modified to allow SLQB to be set on ppc64.

==== CUT HERE ====
SLQB: Allocate from the remote lists when the local node is memoryless and has no free objects

When SLQB is freeing an object, it checks if the object belongs to a
page within the local node. If it is not, the object is freed to a
remote list. When the remote list has too many objects, the list is
drained.

On allocation, the remote list is only used if a specific node is specified
and that node is not the local node. On memoryless nodes, there is a problem
in that the specified node will often not be the local node. The impact is
that many objects on the free list are the only object in the page. This
bloats SLQB's memory requirements and causes OOM to trigger.

This patch alters the allocation path. If the allocation from local
lists fails and the local node is memoryless, an attempt will be made to
allocate from the remote lists before going to the page allocator.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/slqb.c |   30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/mm/slqb.c b/mm/slqb.c
index 4d72be2..b73e7d0 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1513,16 +1513,30 @@ try_remote:
 	l = &c->list;
 	object = __cache_list_get_object(s, l);
 	if (unlikely(!object)) {
-		object = cache_list_get_page(s, l);
-		if (unlikely(!object)) {
-			object = __slab_alloc_page(s, gfpflags, node);
-#ifdef CONFIG_NUMA
+		int thisnode = numa_node_id();
+
+		/*
+		 * If the local node is memoryless, try remote alloc before
+		 * trying the page allocator. Otherwise, what happens is
+		 * objects are always freed to remote lists but the allocation
+		 * side always allocates a new page with only one object
+		 * used in each page
+		 */
+		if (unlikely(!node_state(thisnode, N_HIGH_MEMORY)))
+			object = __remote_slab_alloc(s, gfpflags, thisnode);
+
+		if (!object) {
+			object = cache_list_get_page(s, l);
 			if (unlikely(!object)) {
-				node = numa_node_id();
-				goto try_remote;
-			}
+				object = __slab_alloc_page(s, gfpflags, node);
+#ifdef CONFIG_NUMA
+				if (unlikely(!object)) {
+					node = numa_node_id();
+					goto try_remote;
+				}
 #endif
-			return object;
+				return object;
+			}
 		}
 	}
 	if (likely(object))

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-30 14:41             ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-30 14:41 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Nick Piggin, Christoph Lameter, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Tue, Sep 22, 2009 at 07:56:08PM +0100, Mel Gorman wrote:
> On Tue, Sep 22, 2009 at 09:54:33PM +0300, Pekka Enberg wrote:
> > Hi Mel,
> > 
> > On Tue, Sep 22, 2009 at 4:54 PM, Mel Gorman <mel@csn.ul.ie> wrote:
> > >> I don't understand how the memory leak happens from the above
> > >> description (or reading the code). page_to_nid() returns some crazy
> > >> value at free time?
> > >
> > > Nope, it isn't a leak as such, the allocator knows where the memory is.
> > > The problem is that is always frees remote but on allocation, it sees
> > > the per-cpu list is empty and calls the page allocator again. The remote
> > > lists just grow.
> > >
> > >> The remote list isn't drained properly?
> > >
> > > That is another way of looking at it. When the remote lists get to a
> > > watermark, they should drain. However, it's worth pointing out if it's
> > > repaired in this fashion, the performance of SLQB will suffer as it'll
> > > never reuse the local list of pages and instead always get cold pages
> > > from the allocator.
> > 
> > I worry about setting c->local_nid to the node of the allocated struct
> > kmem_cache_cpu. It seems like an arbitrary policy decision that's not
> > necessarily the best option and I'm not totally convinced it's correct
> > when cpusets are configured. SLUB seems to do the sane thing here by
> > using page allocator fallback (which respects cpusets AFAICT) and
> > recycling one slab slab at a time.
> > 
> > Can I persuade you into sending me a patch that fixes remote list
> > draining to get things working on PPC? I'd much rather wait for Nick's
> > input on the allocation policy and performance.
> > 
> 
> It'll be at least next week before I can revisit this again. I'm afraid
> I'm going offline from tomorrow until Tuesday.
> 

Ok, so I spent today looking at this again. The problem is not with faulty
drain logic as such. As frees always place an object on a remote list
and the allocation side is often (but not always) allocating a new page,
a significant number of objects in the free list are the only object
in a page. SLQB drains based on the number of objects on the free list,
not the number of pages. With many of the pages having only one object,
the freelists are pinning a lot more memory than expected.  For example,
a watermark to drain of 512 could be pinning 2MB of pages.

The drain logic could be extended to track not only the number of objects on
the free list but also the number of pages but I really don't think that is
desirable behaviour. I'm somewhat running out of sensible ideas for dealing
with this but here is another go anyway that might be more palatable than
tracking what a "local" node is within the slab.

This boots on 2.6.32-rc1 with the latest slqb-core git tree with
Kconfig modified to allow SLQB to be set on ppc64.

==== CUT HERE ====
SLQB: Allocate from the remote lists when the local node is memoryless and has no free objects

When SLQB is freeing an object, it checks if the object belongs to a
page within the local node. If it is not, the object is freed to a
remote list. When the remote list has too many objects, the list is
drained.

On allocation, the remote list is only used if a specific node is specified
and that node is not the local node. On memoryless nodes, there is a problem
in that the specified node will often not be the local node. The impact is
that many objects on the free list are the only object in the page. This
bloats SLQB's memory requirements and causes OOM to trigger.

This patch alters the allocation path. If the allocation from local
lists fails and the local node is memoryless, an attempt will be made to
allocate from the remote lists before going to the page allocator.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
--- 
 mm/slqb.c |   30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/mm/slqb.c b/mm/slqb.c
index 4d72be2..b73e7d0 100644
--- a/mm/slqb.c
+++ b/mm/slqb.c
@@ -1513,16 +1513,30 @@ try_remote:
 	l = &c->list;
 	object = __cache_list_get_object(s, l);
 	if (unlikely(!object)) {
-		object = cache_list_get_page(s, l);
-		if (unlikely(!object)) {
-			object = __slab_alloc_page(s, gfpflags, node);
-#ifdef CONFIG_NUMA
+		int thisnode = numa_node_id();
+
+		/*
+		 * If the local node is memoryless, try remote alloc before
+		 * trying the page allocator. Otherwise, what happens is
+		 * objects are always freed to remote lists but the allocation
+		 * side always allocates a new page with only one object
+		 * used in each page
+		 */
+		if (unlikely(!node_state(thisnode, N_HIGH_MEMORY)))
+			object = __remote_slab_alloc(s, gfpflags, thisnode);
+
+		if (!object) {
+			object = cache_list_get_page(s, l);
 			if (unlikely(!object)) {
-				node = numa_node_id();
-				goto try_remote;
-			}
+				object = __slab_alloc_page(s, gfpflags, node);
+#ifdef CONFIG_NUMA
+				if (unlikely(!object)) {
+					node = numa_node_id();
+					goto try_remote;
+				}
 #endif
-			return object;
+				return object;
+			}
 		}
 	}
 	if (likely(object))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-30 14:41             ` Mel Gorman
@ 2009-09-30 15:06               ` Christoph Lameter
  -1 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-09-30 15:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Wed, 30 Sep 2009, Mel Gorman wrote:

> Ok, so I spent today looking at this again. The problem is not with faulty
> drain logic as such. As frees always place an object on a remote list
> and the allocation side is often (but not always) allocating a new page,
> a significant number of objects in the free list are the only object
> in a page. SLQB drains based on the number of objects on the free list,
> not the number of pages. With many of the pages having only one object,
> the freelists are pinning a lot more memory than expected.  For example,
> a watermark to drain of 512 could be pinning 2MB of pages.

No good. So we are allocating new pages from somewhere allocating a
single object and putting them on the freelist where we do not find them
again. This is bad caching behavior as well.

> The drain logic could be extended to track not only the number of objects on
> the free list but also the number of pages but I really don't think that is
> desirable behaviour. I'm somewhat running out of sensible ideas for dealing
> with this but here is another go anyway that might be more palatable than
> tracking what a "local" node is within the slab.

SLUB avoids that issue by having a "current" page for a processor. It
allocates from the current page until its exhausted. It can use fast path
logic both for allocations and frees regardless of the pages origin. The
node fallback is handled by the page allocator and that one is only
involved when a new slab page is needed.

SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
order for free objects of the kmem_cache and then picks up from the
nearest node. Ugly but it works. SLQB would have to do something similar
since it also has the per node object bins that SLAB has.

The local node for a memoryless node may not exist at all since there may
be multiple nodes at the same distance to the memoryless node. So at
mininum you would have to manage a set of local nodes. If you have the set
then you also would need to consider memory policies. During bootup you
would have to simulate the interleave mode in effect. After bootup you
would have to use the tasks policy.

This all points to major NUMA issues in SLQB. This is not arch specific.
SLQB cannot handle memoryless nodes at this point.

> This patch alters the allocation path. If the allocation from local
> lists fails and the local node is memoryless, an attempt will be made to
> allocate from the remote lists before going to the page allocator.

Are the allocation attempts from the remote lists governed by memory
policies? Otherwise you may create imbalances on neighboring nodes.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-30 15:06               ` Christoph Lameter
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-09-30 15:06 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Wed, 30 Sep 2009, Mel Gorman wrote:

> Ok, so I spent today looking at this again. The problem is not with faulty
> drain logic as such. As frees always place an object on a remote list
> and the allocation side is often (but not always) allocating a new page,
> a significant number of objects in the free list are the only object
> in a page. SLQB drains based on the number of objects on the free list,
> not the number of pages. With many of the pages having only one object,
> the freelists are pinning a lot more memory than expected.  For example,
> a watermark to drain of 512 could be pinning 2MB of pages.

No good. So we are allocating new pages from somewhere allocating a
single object and putting them on the freelist where we do not find them
again. This is bad caching behavior as well.

> The drain logic could be extended to track not only the number of objects on
> the free list but also the number of pages but I really don't think that is
> desirable behaviour. I'm somewhat running out of sensible ideas for dealing
> with this but here is another go anyway that might be more palatable than
> tracking what a "local" node is within the slab.

SLUB avoids that issue by having a "current" page for a processor. It
allocates from the current page until its exhausted. It can use fast path
logic both for allocations and frees regardless of the pages origin. The
node fallback is handled by the page allocator and that one is only
involved when a new slab page is needed.

SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
order for free objects of the kmem_cache and then picks up from the
nearest node. Ugly but it works. SLQB would have to do something similar
since it also has the per node object bins that SLAB has.

The local node for a memoryless node may not exist at all since there may
be multiple nodes at the same distance to the memoryless node. So at
mininum you would have to manage a set of local nodes. If you have the set
then you also would need to consider memory policies. During bootup you
would have to simulate the interleave mode in effect. After bootup you
would have to use the tasks policy.

This all points to major NUMA issues in SLQB. This is not arch specific.
SLQB cannot handle memoryless nodes at this point.

> This patch alters the allocation path. If the allocation from local
> lists fails and the local node is memoryless, an attempt will be made to
> allocate from the remote lists before going to the page allocator.

Are the allocation attempts from the remote lists governed by memory
policies? Otherwise you may create imbalances on neighboring nodes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-30 15:06               ` Christoph Lameter
@ 2009-09-30 22:05                 ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-30 22:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Wed, Sep 30, 2009 at 11:06:04AM -0400, Christoph Lameter wrote:
> On Wed, 30 Sep 2009, Mel Gorman wrote:
> 
> > Ok, so I spent today looking at this again. The problem is not with faulty
> > drain logic as such. As frees always place an object on a remote list
> > and the allocation side is often (but not always) allocating a new page,
> > a significant number of objects in the free list are the only object
> > in a page. SLQB drains based on the number of objects on the free list,
> > not the number of pages. With many of the pages having only one object,
> > the freelists are pinning a lot more memory than expected.  For example,
> > a watermark to drain of 512 could be pinning 2MB of pages.
> 
> No good. So we are allocating new pages from somewhere allocating a
> single object and putting them on the freelist where we do not find them
> again.

Yes

> This is bad caching behavior as well.
> 

Yes, I suppose it would be as it's not using the hottest object. The
fact it OOM storms is a bit more important than poor caching behaviour
but hey :/

> > The drain logic could be extended to track not only the number of objects on
> > the free list but also the number of pages but I really don't think that is
> > desirable behaviour. I'm somewhat running out of sensible ideas for dealing
> > with this but here is another go anyway that might be more palatable than
> > tracking what a "local" node is within the slab.
> 
> SLUB avoids that issue by having a "current" page for a processor. It
> allocates from the current page until its exhausted. It can use fast path
> logic both for allocations and frees regardless of the pages origin. The
> node fallback is handled by the page allocator and that one is only
> involved when a new slab page is needed.
> 

This is essentially the "unqueued" nature of SLUB. It's objective "I have this
page here which I'm going to use until I can't use it no more and will depend
on the page allocator to sort my stuff out". I have to read up on SLUB up
more to see if it's compatible with SLQB or not though. In particular, how
does SLUB deal with frees from pages that are not the "current" page? SLQB
does not care what page the object belongs to as long as it's node-local
as the object is just shoved onto a LIFO for maximum hotness.

> SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> order for free objects of the kmem_cache and then picks up from the
> nearest node. Ugly but it works. SLQB would have to do something similar
> since it also has the per node object bins that SLAB has.
> 

In a real sense, this is what the patch ends up doing. When it fails to
get something locally but sees that the local node is memoryless, it
will check the remote node lists in zonelist order. I think that's
reasonable behaviour but I'm biased because I just want the damn machine
to boot again. What do you think? Pekka, Nick?

> The local node for a memoryless node may not exist at all since there may
> be multiple nodes at the same distance to the memoryless node. So at
> mininum you would have to manage a set of local nodes. If you have the set
> then you also would need to consider memory policies. During bootup you
> would have to simulate the interleave mode in effect. After bootup you
> would have to use the tasks policy.
> 

I think SLQBs treatment of memory policies needs to be handled as a separate
problem. It's less than perfect at the moment, more of that below.

> This all points to major NUMA issues in SLQB. This is not arch specific.
> SLQB cannot handle memoryless nodes at this point.
> 
> > This patch alters the allocation path. If the allocation from local
> > lists fails and the local node is memoryless, an attempt will be made to
> > allocate from the remote lists before going to the page allocator.
> 
> Are the allocation attempts from the remote lists governed by memory
> policies?

It does to some extent. When selecting a node zonelist, it takes the
current memory policy into account but at a glance, it does not appear
to obey a policy that restricts the available nodes.

> Otherwise you may create imbalances on neighboring nodes.
> 

I haven't thought about this aspect of things a whole lot to be honest.
It's not the problem at hand.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-30 22:05                 ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-09-30 22:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Wed, Sep 30, 2009 at 11:06:04AM -0400, Christoph Lameter wrote:
> On Wed, 30 Sep 2009, Mel Gorman wrote:
> 
> > Ok, so I spent today looking at this again. The problem is not with faulty
> > drain logic as such. As frees always place an object on a remote list
> > and the allocation side is often (but not always) allocating a new page,
> > a significant number of objects in the free list are the only object
> > in a page. SLQB drains based on the number of objects on the free list,
> > not the number of pages. With many of the pages having only one object,
> > the freelists are pinning a lot more memory than expected.  For example,
> > a watermark to drain of 512 could be pinning 2MB of pages.
> 
> No good. So we are allocating new pages from somewhere allocating a
> single object and putting them on the freelist where we do not find them
> again.

Yes

> This is bad caching behavior as well.
> 

Yes, I suppose it would be as it's not using the hottest object. The
fact it OOM storms is a bit more important than poor caching behaviour
but hey :/

> > The drain logic could be extended to track not only the number of objects on
> > the free list but also the number of pages but I really don't think that is
> > desirable behaviour. I'm somewhat running out of sensible ideas for dealing
> > with this but here is another go anyway that might be more palatable than
> > tracking what a "local" node is within the slab.
> 
> SLUB avoids that issue by having a "current" page for a processor. It
> allocates from the current page until its exhausted. It can use fast path
> logic both for allocations and frees regardless of the pages origin. The
> node fallback is handled by the page allocator and that one is only
> involved when a new slab page is needed.
> 

This is essentially the "unqueued" nature of SLUB. It's objective "I have this
page here which I'm going to use until I can't use it no more and will depend
on the page allocator to sort my stuff out". I have to read up on SLUB up
more to see if it's compatible with SLQB or not though. In particular, how
does SLUB deal with frees from pages that are not the "current" page? SLQB
does not care what page the object belongs to as long as it's node-local
as the object is just shoved onto a LIFO for maximum hotness.

> SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> order for free objects of the kmem_cache and then picks up from the
> nearest node. Ugly but it works. SLQB would have to do something similar
> since it also has the per node object bins that SLAB has.
> 

In a real sense, this is what the patch ends up doing. When it fails to
get something locally but sees that the local node is memoryless, it
will check the remote node lists in zonelist order. I think that's
reasonable behaviour but I'm biased because I just want the damn machine
to boot again. What do you think? Pekka, Nick?

> The local node for a memoryless node may not exist at all since there may
> be multiple nodes at the same distance to the memoryless node. So at
> mininum you would have to manage a set of local nodes. If you have the set
> then you also would need to consider memory policies. During bootup you
> would have to simulate the interleave mode in effect. After bootup you
> would have to use the tasks policy.
> 

I think SLQBs treatment of memory policies needs to be handled as a separate
problem. It's less than perfect at the moment, more of that below.

> This all points to major NUMA issues in SLQB. This is not arch specific.
> SLQB cannot handle memoryless nodes at this point.
> 
> > This patch alters the allocation path. If the allocation from local
> > lists fails and the local node is memoryless, an attempt will be made to
> > allocate from the remote lists before going to the page allocator.
> 
> Are the allocation attempts from the remote lists governed by memory
> policies?

It does to some extent. When selecting a node zonelist, it takes the
current memory policy into account but at a glance, it does not appear
to obey a policy that restricts the available nodes.

> Otherwise you may create imbalances on neighboring nodes.
> 

I haven't thought about this aspect of things a whole lot to be honest.
It's not the problem at hand.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-30 22:05                 ` Mel Gorman
@ 2009-09-30 23:45                   ` Christoph Lameter
  -1 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-09-30 23:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Wed, 30 Sep 2009, Mel Gorman wrote:

> > SLUB avoids that issue by having a "current" page for a processor. It
> > allocates from the current page until its exhausted. It can use fast path
> > logic both for allocations and frees regardless of the pages origin. The
> > node fallback is handled by the page allocator and that one is only
> > involved when a new slab page is needed.
> >
>
> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> page here which I'm going to use until I can't use it no more and will depend
> on the page allocator to sort my stuff out". I have to read up on SLUB up
> more to see if it's compatible with SLQB or not though. In particular, how
> does SLUB deal with frees from pages that are not the "current" page? SLQB
> does not care what page the object belongs to as long as it's node-local
> as the object is just shoved onto a LIFO for maximum hotness.

Frees are done directly to the target slab page if they are not to the
current active slab page. No centralized locks. Concurrent frees from
processors on the same node to multiple other nodes (or different pages
on the same node) can occur.

> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> > order for free objects of the kmem_cache and then picks up from the
> > nearest node. Ugly but it works. SLQB would have to do something similar
> > since it also has the per node object bins that SLAB has.
> >
>
> In a real sense, this is what the patch ends up doing. When it fails to
> get something locally but sees that the local node is memoryless, it
> will check the remote node lists in zonelist order. I think that's
> reasonable behaviour but I'm biased because I just want the damn machine
> to boot again. What do you think? Pekka, Nick?

Look at fallback_alloc() in slab. You can likely copy much of it. It
considers memory policies and cpuset constraints.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-09-30 23:45                   ` Christoph Lameter
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-09-30 23:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Wed, 30 Sep 2009, Mel Gorman wrote:

> > SLUB avoids that issue by having a "current" page for a processor. It
> > allocates from the current page until its exhausted. It can use fast path
> > logic both for allocations and frees regardless of the pages origin. The
> > node fallback is handled by the page allocator and that one is only
> > involved when a new slab page is needed.
> >
>
> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> page here which I'm going to use until I can't use it no more and will depend
> on the page allocator to sort my stuff out". I have to read up on SLUB up
> more to see if it's compatible with SLQB or not though. In particular, how
> does SLUB deal with frees from pages that are not the "current" page? SLQB
> does not care what page the object belongs to as long as it's node-local
> as the object is just shoved onto a LIFO for maximum hotness.

Frees are done directly to the target slab page if they are not to the
current active slab page. No centralized locks. Concurrent frees from
processors on the same node to multiple other nodes (or different pages
on the same node) can occur.

> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> > order for free objects of the kmem_cache and then picks up from the
> > nearest node. Ugly but it works. SLQB would have to do something similar
> > since it also has the per node object bins that SLAB has.
> >
>
> In a real sense, this is what the patch ends up doing. When it fails to
> get something locally but sees that the local node is memoryless, it
> will check the remote node lists in zonelist order. I think that's
> reasonable behaviour but I'm biased because I just want the damn machine
> to boot again. What do you think? Pekka, Nick?

Look at fallback_alloc() in slab. You can likely copy much of it. It
considers memory policies and cpuset constraints.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-30 23:45                   ` Christoph Lameter
@ 2009-10-01 10:40                     ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-10-01 10:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Wed, Sep 30, 2009 at 07:45:22PM -0400, Christoph Lameter wrote:
> On Wed, 30 Sep 2009, Mel Gorman wrote:
> 
> > > SLUB avoids that issue by having a "current" page for a processor. It
> > > allocates from the current page until its exhausted. It can use fast path
> > > logic both for allocations and frees regardless of the pages origin. The
> > > node fallback is handled by the page allocator and that one is only
> > > involved when a new slab page is needed.
> > >
> >
> > This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> > page here which I'm going to use until I can't use it no more and will depend
> > on the page allocator to sort my stuff out". I have to read up on SLUB up
> > more to see if it's compatible with SLQB or not though. In particular, how
> > does SLUB deal with frees from pages that are not the "current" page? SLQB
> > does not care what page the object belongs to as long as it's node-local
> > as the object is just shoved onto a LIFO for maximum hotness.
> 
> Frees are done directly to the target slab page if they are not to the
> current active slab page. No centralized locks. Concurrent frees from
> processors on the same node to multiple other nodes (or different pages
> on the same node) can occur.
> 

So as a total aside, SLQB has an advantage in that it always uses object
in LIFO order and is more likely to be cache hot. SLUB has an advantage
when one CPU allocates and another one frees because it potentially
avoids a cache line bounce. Might be something worth bearing in mind
when/if a comparison happens later.

> > > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> > > order for free objects of the kmem_cache and then picks up from the
> > > nearest node. Ugly but it works. SLQB would have to do something similar
> > > since it also has the per node object bins that SLAB has.
> > >
> >
> > In a real sense, this is what the patch ends up doing. When it fails to
> > get something locally but sees that the local node is memoryless, it
> > will check the remote node lists in zonelist order. I think that's
> > reasonable behaviour but I'm biased because I just want the damn machine
> > to boot again. What do you think? Pekka, Nick?
> 
> Look at fallback_alloc() in slab. You can likely copy much of it. It
> considers memory policies and cpuset constraints.
> 

True, it looks like some of the logic should be taken from there all right. Can
the treatment of memory policies be dealt with as a separate thread though? I'd
prefer to get memoryless nodes sorted out before considering the next two
problems (per-cpu instability on ppc64 and memory policy handling in SLQB).

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-10-01 10:40                     ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-10-01 10:40 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Wed, Sep 30, 2009 at 07:45:22PM -0400, Christoph Lameter wrote:
> On Wed, 30 Sep 2009, Mel Gorman wrote:
> 
> > > SLUB avoids that issue by having a "current" page for a processor. It
> > > allocates from the current page until its exhausted. It can use fast path
> > > logic both for allocations and frees regardless of the pages origin. The
> > > node fallback is handled by the page allocator and that one is only
> > > involved when a new slab page is needed.
> > >
> >
> > This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> > page here which I'm going to use until I can't use it no more and will depend
> > on the page allocator to sort my stuff out". I have to read up on SLUB up
> > more to see if it's compatible with SLQB or not though. In particular, how
> > does SLUB deal with frees from pages that are not the "current" page? SLQB
> > does not care what page the object belongs to as long as it's node-local
> > as the object is just shoved onto a LIFO for maximum hotness.
> 
> Frees are done directly to the target slab page if they are not to the
> current active slab page. No centralized locks. Concurrent frees from
> processors on the same node to multiple other nodes (or different pages
> on the same node) can occur.
> 

So as a total aside, SLQB has an advantage in that it always uses object
in LIFO order and is more likely to be cache hot. SLUB has an advantage
when one CPU allocates and another one frees because it potentially
avoids a cache line bounce. Might be something worth bearing in mind
when/if a comparison happens later.

> > > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> > > order for free objects of the kmem_cache and then picks up from the
> > > nearest node. Ugly but it works. SLQB would have to do something similar
> > > since it also has the per node object bins that SLAB has.
> > >
> >
> > In a real sense, this is what the patch ends up doing. When it fails to
> > get something locally but sees that the local node is memoryless, it
> > will check the remote node lists in zonelist order. I think that's
> > reasonable behaviour but I'm biased because I just want the damn machine
> > to boot again. What do you think? Pekka, Nick?
> 
> Look at fallback_alloc() in slab. You can likely copy much of it. It
> considers memory policies and cpuset constraints.
> 

True, it looks like some of the logic should be taken from there all right. Can
the treatment of memory policies be dealt with as a separate thread though? I'd
prefer to get memoryless nodes sorted out before considering the next two
problems (per-cpu instability on ppc64 and memory policy handling in SLQB).

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-10-01 10:40                     ` Mel Gorman
@ 2009-10-01 14:32                       ` Christoph Lameter
  -1 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-10-01 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, 1 Oct 2009, Mel Gorman wrote:

> > Frees are done directly to the target slab page if they are not to the
> > current active slab page. No centralized locks. Concurrent frees from
> > processors on the same node to multiple other nodes (or different pages
> > on the same node) can occur.
> >
>
> So as a total aside, SLQB has an advantage in that it always uses object
> in LIFO order and is more likely to be cache hot. SLUB has an advantage
> when one CPU allocates and another one frees because it potentially
> avoids a cache line bounce. Might be something worth bearing in mind
> when/if a comparison happens later.

SLQB may use cache hot objects regardless of their locality. SLUB
always serves objects that have the same locality first (same page).
SLAB returns objects via the alien caches to the remote node.
So object allocations with SLUB will generate less TLB pressure since they
are localized. SLUB objects are immediately returned to the remote node.
SLAB/SLQB keeps them around for reallocation or queue processing.

> > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > considers memory policies and cpuset constraints.
> >
> True, it looks like some of the logic should be taken from there all right. Can
> the treatment of memory policies be dealt with as a separate thread though? I'd
> prefer to get memoryless nodes sorted out before considering the next two
> problems (per-cpu instability on ppc64 and memory policy handling in SLQB).

Separate email thread? Ok.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-10-01 14:32                       ` Christoph Lameter
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-10-01 14:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, 1 Oct 2009, Mel Gorman wrote:

> > Frees are done directly to the target slab page if they are not to the
> > current active slab page. No centralized locks. Concurrent frees from
> > processors on the same node to multiple other nodes (or different pages
> > on the same node) can occur.
> >
>
> So as a total aside, SLQB has an advantage in that it always uses object
> in LIFO order and is more likely to be cache hot. SLUB has an advantage
> when one CPU allocates and another one frees because it potentially
> avoids a cache line bounce. Might be something worth bearing in mind
> when/if a comparison happens later.

SLQB may use cache hot objects regardless of their locality. SLUB
always serves objects that have the same locality first (same page).
SLAB returns objects via the alien caches to the remote node.
So object allocations with SLUB will generate less TLB pressure since they
are localized. SLUB objects are immediately returned to the remote node.
SLAB/SLQB keeps them around for reallocation or queue processing.

> > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > considers memory policies and cpuset constraints.
> >
> True, it looks like some of the logic should be taken from there all right. Can
> the treatment of memory policies be dealt with as a separate thread though? I'd
> prefer to get memoryless nodes sorted out before considering the next two
> problems (per-cpu instability on ppc64 and memory policy handling in SLQB).

Separate email thread? Ok.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-10-01 15:03                         ` Mel Gorman
@ 2009-10-01 15:03                           ` Christoph Lameter
  -1 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-10-01 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, 1 Oct 2009, Mel Gorman wrote:

> True, it might have been improved more if SLUB knew what local hugepage it
> resided within as the kernel portion of the address space is backed by huge
> TLB entries. Note that SLQB could have an advantage here early in boot as
> the page allocator will tend to give it back pages within a single huge TLB
> entry. It loses the advantage when the system has been running for a very long
> time but it might be enough to skew benchmark results on cold-booted systems.

The page allocator serves pages aligned to huge page boundaries as far as
I can remember. You can actually use huge pages in slub if you set the max
order to 9. So a page obtained from the page allocator is always aligned
properly.



^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-10-01 15:03                           ` Christoph Lameter
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Lameter @ 2009-10-01 15:03 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, 1 Oct 2009, Mel Gorman wrote:

> True, it might have been improved more if SLUB knew what local hugepage it
> resided within as the kernel portion of the address space is backed by huge
> TLB entries. Note that SLQB could have an advantage here early in boot as
> the page allocator will tend to give it back pages within a single huge TLB
> entry. It loses the advantage when the system has been running for a very long
> time but it might be enough to skew benchmark results on cold-booted systems.

The page allocator serves pages aligned to huge page boundaries as far as
I can remember. You can actually use huge pages in slub if you set the max
order to 9. So a page obtained from the page allocator is always aligned
properly.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-10-01 14:32                       ` Christoph Lameter
@ 2009-10-01 15:03                         ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-10-01 15:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, Oct 01, 2009 at 10:32:54AM -0400, Christoph Lameter wrote:
> On Thu, 1 Oct 2009, Mel Gorman wrote:
> 
> > > Frees are done directly to the target slab page if they are not to the
> > > current active slab page. No centralized locks. Concurrent frees from
> > > processors on the same node to multiple other nodes (or different pages
> > > on the same node) can occur.
> > >
> >
> > So as a total aside, SLQB has an advantage in that it always uses object
> > in LIFO order and is more likely to be cache hot. SLUB has an advantage
> > when one CPU allocates and another one frees because it potentially
> > avoids a cache line bounce. Might be something worth bearing in mind
> > when/if a comparison happens later.
> 
> SLQB may use cache hot objects regardless of their locality. SLUB
> always serves objects that have the same locality first (same page).
> SLAB returns objects via the alien caches to the remote node.
> So object allocations with SLUB will generate less TLB pressure since they
> are localized.

True, it might have been improved more if SLUB knew what local hugepage it
resided within as the kernel portion of the address space is backed by huge
TLB entries. Note that SLQB could have an advantage here early in boot as
the page allocator will tend to give it back pages within a single huge TLB
entry. It loses the advantage when the system has been running for a very long
time but it might be enough to skew benchmark results on cold-booted systems.

> SLUB objects are immediately returned to the remote node.
> SLAB/SLQB keeps them around for reallocation or queue processing.
> 
> > > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > > considers memory policies and cpuset constraints.
> > >
> > True, it looks like some of the logic should be taken from there all right. Can
> > the treatment of memory policies be dealt with as a separate thread though? I'd
> > prefer to get memoryless nodes sorted out before considering the next two
> > problems (per-cpu instability on ppc64 and memory policy handling in SLQB).
> 
> Separate email thread? Ok.
> 

Yes, but I'll be honest. It'll be at least two weeks before I can tackle
memory policy related issues in SLQB. It's not high on my list of
priorities. I'm more concerned with breakage on ppc64 and a patch that
forces it to be disabled. Minimally, I want this resolved before getting
distracted by another thread.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-10-01 15:03                         ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-10-01 15:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, Oct 01, 2009 at 10:32:54AM -0400, Christoph Lameter wrote:
> On Thu, 1 Oct 2009, Mel Gorman wrote:
> 
> > > Frees are done directly to the target slab page if they are not to the
> > > current active slab page. No centralized locks. Concurrent frees from
> > > processors on the same node to multiple other nodes (or different pages
> > > on the same node) can occur.
> > >
> >
> > So as a total aside, SLQB has an advantage in that it always uses object
> > in LIFO order and is more likely to be cache hot. SLUB has an advantage
> > when one CPU allocates and another one frees because it potentially
> > avoids a cache line bounce. Might be something worth bearing in mind
> > when/if a comparison happens later.
> 
> SLQB may use cache hot objects regardless of their locality. SLUB
> always serves objects that have the same locality first (same page).
> SLAB returns objects via the alien caches to the remote node.
> So object allocations with SLUB will generate less TLB pressure since they
> are localized.

True, it might have been improved more if SLUB knew what local hugepage it
resided within as the kernel portion of the address space is backed by huge
TLB entries. Note that SLQB could have an advantage here early in boot as
the page allocator will tend to give it back pages within a single huge TLB
entry. It loses the advantage when the system has been running for a very long
time but it might be enough to skew benchmark results on cold-booted systems.

> SLUB objects are immediately returned to the remote node.
> SLAB/SLQB keeps them around for reallocation or queue processing.
> 
> > > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > > considers memory policies and cpuset constraints.
> > >
> > True, it looks like some of the logic should be taken from there all right. Can
> > the treatment of memory policies be dealt with as a separate thread though? I'd
> > prefer to get memoryless nodes sorted out before considering the next two
> > problems (per-cpu instability on ppc64 and memory policy handling in SLQB).
> 
> Separate email thread? Ok.
> 

Yes, but I'll be honest. It'll be at least two weeks before I can tackle
memory policy related issues in SLQB. It's not high on my list of
priorities. I'm more concerned with breakage on ppc64 and a patch that
forces it to be disabled. Minimally, I want this resolved before getting
distracted by another thread.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-10-01 15:03                           ` Christoph Lameter
@ 2009-10-01 15:16                             ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-10-01 15:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, Oct 01, 2009 at 11:03:16AM -0400, Christoph Lameter wrote:
> On Thu, 1 Oct 2009, Mel Gorman wrote:
> 
> > True, it might have been improved more if SLUB knew what local hugepage it
> > resided within as the kernel portion of the address space is backed by huge
> > TLB entries. Note that SLQB could have an advantage here early in boot as
> > the page allocator will tend to give it back pages within a single huge TLB
> > entry. It loses the advantage when the system has been running for a very long
> > time but it might be enough to skew benchmark results on cold-booted systems.
> 
> The page allocator serves pages aligned to huge page boundaries as far as
> I can remember.

You're right, it does, particularly early in boot. It loses the advantage
when the system has been running a long time and memory is mostly full but
the same will apply to SLQB.

> You can actually use huge pages in slub if you set the max
> order to 9. So a page obtained from the page allocator is always aligned
> properly.
> 

Fair point.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-10-01 15:16                             ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-10-01 15:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, Oct 01, 2009 at 11:03:16AM -0400, Christoph Lameter wrote:
> On Thu, 1 Oct 2009, Mel Gorman wrote:
> 
> > True, it might have been improved more if SLUB knew what local hugepage it
> > resided within as the kernel portion of the address space is backed by huge
> > TLB entries. Note that SLQB could have an advantage here early in boot as
> > the page allocator will tend to give it back pages within a single huge TLB
> > entry. It loses the advantage when the system has been running for a very long
> > time but it might be enough to skew benchmark results on cold-booted systems.
> 
> The page allocator serves pages aligned to huge page boundaries as far as
> I can remember.

You're right, it does, particularly early in boot. It loses the advantage
when the system has been running a long time and memory is mostly full but
the same will apply to SLQB.

> You can actually use huge pages in slub if you set the max
> order to 9. So a page obtained from the page allocator is always aligned
> properly.
> 

Fair point.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-09-30 23:45                   ` Christoph Lameter
@ 2009-10-04 12:06                     ` Pekka Enberg
  -1 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-10-04 12:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, Oct 1, 2009 at 2:45 AM, Christoph Lameter
<cl@linux-foundation.org> wrote:
>> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
>> page here which I'm going to use until I can't use it no more and will depend
>> on the page allocator to sort my stuff out". I have to read up on SLUB up
>> more to see if it's compatible with SLQB or not though. In particular, how
>> does SLUB deal with frees from pages that are not the "current" page? SLQB
>> does not care what page the object belongs to as long as it's node-local
>> as the object is just shoved onto a LIFO for maximum hotness.
>
> Frees are done directly to the target slab page if they are not to the
> current active slab page. No centralized locks. Concurrent frees from
> processors on the same node to multiple other nodes (or different pages
> on the same node) can occur.
>
>> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
>> > order for free objects of the kmem_cache and then picks up from the
>> > nearest node. Ugly but it works. SLQB would have to do something similar
>> > since it also has the per node object bins that SLAB has.
>> >
>>
>> In a real sense, this is what the patch ends up doing. When it fails to
>> get something locally but sees that the local node is memoryless, it
>> will check the remote node lists in zonelist order. I think that's
>> reasonable behaviour but I'm biased because I just want the damn machine
>> to boot again. What do you think? Pekka, Nick?
>
> Look at fallback_alloc() in slab. You can likely copy much of it. It
> considers memory policies and cpuset constraints.

Sorry for the delay. I went ahead and merged Mel's patch to make
things boot on PPC. Fallback policy needs a bit more work as Christoph
says but I'd really love to have Nick's input on this.

Mel, do you have a Kconfig patch laying around somewhere to enable
SLQB on PPC and S390?

                        Pekka

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-10-04 12:06                     ` Pekka Enberg
  0 siblings, 0 replies; 42+ messages in thread
From: Pekka Enberg @ 2009-10-04 12:06 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Mel Gorman, Nick Piggin, heiko.carstens, sachinp, linux-kernel,
	linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Thu, Oct 1, 2009 at 2:45 AM, Christoph Lameter
<cl@linux-foundation.org> wrote:
>> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
>> page here which I'm going to use until I can't use it no more and will depend
>> on the page allocator to sort my stuff out". I have to read up on SLUB up
>> more to see if it's compatible with SLQB or not though. In particular, how
>> does SLUB deal with frees from pages that are not the "current" page? SLQB
>> does not care what page the object belongs to as long as it's node-local
>> as the object is just shoved onto a LIFO for maximum hotness.
>
> Frees are done directly to the target slab page if they are not to the
> current active slab page. No centralized locks. Concurrent frees from
> processors on the same node to multiple other nodes (or different pages
> on the same node) can occur.
>
>> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
>> > order for free objects of the kmem_cache and then picks up from the
>> > nearest node. Ugly but it works. SLQB would have to do something similar
>> > since it also has the per node object bins that SLAB has.
>> >
>>
>> In a real sense, this is what the patch ends up doing. When it fails to
>> get something locally but sees that the local node is memoryless, it
>> will check the remote node lists in zonelist order. I think that's
>> reasonable behaviour but I'm biased because I just want the damn machine
>> to boot again. What do you think? Pekka, Nick?
>
> Look at fallback_alloc() in slab. You can likely copy much of it. It
> considers memory policies and cpuset constraints.

Sorry for the delay. I went ahead and merged Mel's patch to make
things boot on PPC. Fallback policy needs a bit more work as Christoph
says but I'd really love to have Nick's input on this.

Mel, do you have a Kconfig patch laying around somewhere to enable
SLQB on PPC and S390?

                        Pekka

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
  2009-10-04 12:06                     ` Pekka Enberg
@ 2009-10-05  9:49                       ` Mel Gorman
  -1 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-10-05  9:49 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Nick Piggin, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Sun, Oct 04, 2009 at 03:06:45PM +0300, Pekka Enberg wrote:
> On Thu, Oct 1, 2009 at 2:45 AM, Christoph Lameter
> <cl@linux-foundation.org> wrote:
> >> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> >> page here which I'm going to use until I can't use it no more and will depend
> >> on the page allocator to sort my stuff out". I have to read up on SLUB up
> >> more to see if it's compatible with SLQB or not though. In particular, how
> >> does SLUB deal with frees from pages that are not the "current" page? SLQB
> >> does not care what page the object belongs to as long as it's node-local
> >> as the object is just shoved onto a LIFO for maximum hotness.
> >
> > Frees are done directly to the target slab page if they are not to the
> > current active slab page. No centralized locks. Concurrent frees from
> > processors on the same node to multiple other nodes (or different pages
> > on the same node) can occur.
> >
> >> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> >> > order for free objects of the kmem_cache and then picks up from the
> >> > nearest node. Ugly but it works. SLQB would have to do something similar
> >> > since it also has the per node object bins that SLAB has.
> >> >
> >>
> >> In a real sense, this is what the patch ends up doing. When it fails to
> >> get something locally but sees that the local node is memoryless, it
> >> will check the remote node lists in zonelist order. I think that's
> >> reasonable behaviour but I'm biased because I just want the damn machine
> >> to boot again. What do you think? Pekka, Nick?
> >
> > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > considers memory policies and cpuset constraints.
> 
> Sorry for the delay. I went ahead and merged Mel's patch to make
> things boot on PPC. Fallback policy needs a bit more work as Christoph
> says but I'd really love to have Nick's input on this.
> 
> Mel, do you have a Kconfig patch laying around somewhere to enable
> SLQB on PPC and S390?
> 

It's patch 4 of this series.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu
@ 2009-10-05  9:49                       ` Mel Gorman
  0 siblings, 0 replies; 42+ messages in thread
From: Mel Gorman @ 2009-10-05  9:49 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Christoph Lameter, Nick Piggin, heiko.carstens, sachinp,
	linux-kernel, linux-mm, Tejun Heo, Benjamin Herrenschmidt

On Sun, Oct 04, 2009 at 03:06:45PM +0300, Pekka Enberg wrote:
> On Thu, Oct 1, 2009 at 2:45 AM, Christoph Lameter
> <cl@linux-foundation.org> wrote:
> >> This is essentially the "unqueued" nature of SLUB. It's objective "I have this
> >> page here which I'm going to use until I can't use it no more and will depend
> >> on the page allocator to sort my stuff out". I have to read up on SLUB up
> >> more to see if it's compatible with SLQB or not though. In particular, how
> >> does SLUB deal with frees from pages that are not the "current" page? SLQB
> >> does not care what page the object belongs to as long as it's node-local
> >> as the object is just shoved onto a LIFO for maximum hotness.
> >
> > Frees are done directly to the target slab page if they are not to the
> > current active slab page. No centralized locks. Concurrent frees from
> > processors on the same node to multiple other nodes (or different pages
> > on the same node) can occur.
> >
> >> > SLAB deals with it in fallback_alloc(). It scans the nodes in zonelist
> >> > order for free objects of the kmem_cache and then picks up from the
> >> > nearest node. Ugly but it works. SLQB would have to do something similar
> >> > since it also has the per node object bins that SLAB has.
> >> >
> >>
> >> In a real sense, this is what the patch ends up doing. When it fails to
> >> get something locally but sees that the local node is memoryless, it
> >> will check the remote node lists in zonelist order. I think that's
> >> reasonable behaviour but I'm biased because I just want the damn machine
> >> to boot again. What do you think? Pekka, Nick?
> >
> > Look at fallback_alloc() in slab. You can likely copy much of it. It
> > considers memory policies and cpuset constraints.
> 
> Sorry for the delay. I went ahead and merged Mel's patch to make
> things boot on PPC. Fallback policy needs a bit more work as Christoph
> says but I'd really love to have Nick's input on this.
> 
> Mel, do you have a Kconfig patch laying around somewhere to enable
> SLQB on PPC and S390?
> 

It's patch 4 of this series.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2009-10-05  9:49 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-22 12:54 [PATCH 0/3] Fix SLQB on memoryless configurations V3 Mel Gorman
2009-09-22 12:54 ` Mel Gorman
2009-09-22 12:54 ` [PATCH 1/4] slqb: Do not use DEFINE_PER_CPU for per-node data Mel Gorman
2009-09-22 12:54   ` Mel Gorman
2009-09-22 18:55   ` Pekka Enberg
2009-09-22 18:55     ` Pekka Enberg
2009-09-22 12:54 ` [PATCH 2/4] slqb: Record what node is local to a kmem_cache_cpu Mel Gorman
2009-09-22 12:54   ` Mel Gorman
2009-09-22 13:38   ` Pekka Enberg
2009-09-22 13:38     ` Pekka Enberg
2009-09-22 13:54     ` Mel Gorman
2009-09-22 13:54       ` Mel Gorman
2009-09-22 18:54       ` Pekka Enberg
2009-09-22 18:54         ` Pekka Enberg
2009-09-22 18:56         ` Mel Gorman
2009-09-22 18:56           ` Mel Gorman
2009-09-30 14:41           ` Mel Gorman
2009-09-30 14:41             ` Mel Gorman
2009-09-30 15:06             ` Christoph Lameter
2009-09-30 15:06               ` Christoph Lameter
2009-09-30 22:05               ` Mel Gorman
2009-09-30 22:05                 ` Mel Gorman
2009-09-30 23:45                 ` Christoph Lameter
2009-09-30 23:45                   ` Christoph Lameter
2009-10-01 10:40                   ` Mel Gorman
2009-10-01 10:40                     ` Mel Gorman
2009-10-01 14:32                     ` Christoph Lameter
2009-10-01 14:32                       ` Christoph Lameter
2009-10-01 15:03                       ` Mel Gorman
2009-10-01 15:03                         ` Mel Gorman
2009-10-01 15:03                         ` Christoph Lameter
2009-10-01 15:03                           ` Christoph Lameter
2009-10-01 15:16                           ` Mel Gorman
2009-10-01 15:16                             ` Mel Gorman
2009-10-04 12:06                   ` Pekka Enberg
2009-10-04 12:06                     ` Pekka Enberg
2009-10-05  9:49                     ` Mel Gorman
2009-10-05  9:49                       ` Mel Gorman
2009-09-22 12:54 ` [PATCH 3/4] slqb: Allow SLQB to be used on PPC and S390 Mel Gorman
2009-09-22 12:54   ` Mel Gorman
2009-09-22 13:21 ` [PATCH 0/3] Fix SLQB on memoryless configurations V3 Mel Gorman
2009-09-22 13:21   ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.