All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] retry slab allocation after first failure
@ 2012-12-19 14:01 ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-19 14:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, Johannes Weiner, Mel Gorman, Andrew Morton,
	Dave Shrinnker, Michal Hocko, kamezawa.hiroyu, Pekka Enberg,
	Christoph Lameter, David Rientjes

Hi everybody,

First, an introduction: I am doing this work as part of a larger work to have a
good support for memcg targetted shrinkers. The winning approch so far is to
try reuse the generic LRU mechanism proposed by Dave Shrinnker (The Australian
xfs guy, in case you are wondering). But we still don't know that yet. As
usual, I am trying to flush work that seems independent of the main work per
se.  I would love to see them merged if they have merit on their own, but I
will also be happy to carry them in my personal tree if needed.

I used private patches to generate the results in here, that actually shrink
per-memcg.  It will take me some time to post them - but I hope not much.
However, it is trivial to functionally describe them: they shrink objects
belonging to a memcg instead of globally. Also, they only make the scenario I
am describing here more likely: they don't *create* those problems.

In a nutshell: When reclaim kicks in, we will first go through LRU and/or
page-cache pages. If we manage to free them, we can be sure at least one page
is free. After that, we go scan the object shrinkers.

With the shrinkers, however, the story is different. We shrink objects that
lives in pages, but that doesn't mean we will be able to free a whole page.
That heavily depends on the temporal characteristics of the workload being run.

When the kmem allocators find no space left in the existing pages, they will
resort to allocate a new page. Despite the architectural differences in the
multiple allocators they all work like this. What we can conclude from this
is that if an object allocation failed, we can be absolutely sure that a page
allocation also failed. However, it is very likely that the page allocator will
not give up before conducting reclaim.

The reclaim round, despite not being able to free a whole page, may very well
have been able to free objects spread around multiple pages. Which means that
at this point, a new object allocation would likely succeed.

I conducted a simple experiment to easily trigger this, described as follows.

1) I have a somewhat fresh Fedora installation (it is Fedora + a linux git tree
   + some random small stuff). I then run find, sitting in a kmem-limited memcg.

2) I bisect the values of memory.kmem.limit_in_bytes, until I find an amount of
   memory that is enough to run the workload without no memory allocation
   failures three times in a row (to account for random noise).

3) I do the same, using my still not posted targetted shrinking patches. At this
   point, we are more or less simulating a physical box with global reclaim
   under very high pressure. We're only tweaking the odds in favour of failure.

4) I do the same, using the patches I am now posting in here. Meaning: after
   direct reclaim failed and a page could not be freed, we try it again if our
   flags allow us.

The result of 2), is that this whole workload fits in ~65M. Since we don't ever
reclaim, this is a very good approximation of our working set size.

In 3), because direct reclaim triggers and sometimes manages to free some pages,
this fits in ~28M.

In 4), retrying once after failure, the workload fits in ~14M.

So I believe this introduces a more resilient behavior, and is good on its own.
While I agree that this is not a *likely* scenario upstream (it will be for
containers), if we honestly believed kernel allocations would not fail, we would
not be testing their return value =)

P.S.1: I know it is merge window and everything, so no need to rush.
P.S.2: I compiled it locally with slab, slub and slob. But I haven't passed it
through any thorough randconfig bomb yet.

Any comments appreciated.

Glauber Costa (3):
  slab: single entry-point for slab allocation
  slub: remove slab_alloc wrapper
  sl[auo]b: retry allocation once in case of failure.

 mm/slab.c | 90 ++++++++++++++++-----------------------------------------------
 mm/slab.h | 42 +++++++++++++++++++++++++++++
 mm/slob.c | 27 ++++++++++++++++---
 mm/slub.c | 34 +++++++++++++++---------
 4 files changed, 109 insertions(+), 84 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 0/3] retry slab allocation after first failure
@ 2012-12-19 14:01 ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-19 14:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, Johannes Weiner, Mel Gorman, Andrew Morton,
	Dave Shrinnker, Michal Hocko, kamezawa.hiroyu, Pekka Enberg,
	Christoph Lameter, David Rientjes

Hi everybody,

First, an introduction: I am doing this work as part of a larger work to have a
good support for memcg targetted shrinkers. The winning approch so far is to
try reuse the generic LRU mechanism proposed by Dave Shrinnker (The Australian
xfs guy, in case you are wondering). But we still don't know that yet. As
usual, I am trying to flush work that seems independent of the main work per
se.  I would love to see them merged if they have merit on their own, but I
will also be happy to carry them in my personal tree if needed.

I used private patches to generate the results in here, that actually shrink
per-memcg.  It will take me some time to post them - but I hope not much.
However, it is trivial to functionally describe them: they shrink objects
belonging to a memcg instead of globally. Also, they only make the scenario I
am describing here more likely: they don't *create* those problems.

In a nutshell: When reclaim kicks in, we will first go through LRU and/or
page-cache pages. If we manage to free them, we can be sure at least one page
is free. After that, we go scan the object shrinkers.

With the shrinkers, however, the story is different. We shrink objects that
lives in pages, but that doesn't mean we will be able to free a whole page.
That heavily depends on the temporal characteristics of the workload being run.

When the kmem allocators find no space left in the existing pages, they will
resort to allocate a new page. Despite the architectural differences in the
multiple allocators they all work like this. What we can conclude from this
is that if an object allocation failed, we can be absolutely sure that a page
allocation also failed. However, it is very likely that the page allocator will
not give up before conducting reclaim.

The reclaim round, despite not being able to free a whole page, may very well
have been able to free objects spread around multiple pages. Which means that
at this point, a new object allocation would likely succeed.

I conducted a simple experiment to easily trigger this, described as follows.

1) I have a somewhat fresh Fedora installation (it is Fedora + a linux git tree
   + some random small stuff). I then run find, sitting in a kmem-limited memcg.

2) I bisect the values of memory.kmem.limit_in_bytes, until I find an amount of
   memory that is enough to run the workload without no memory allocation
   failures three times in a row (to account for random noise).

3) I do the same, using my still not posted targetted shrinking patches. At this
   point, we are more or less simulating a physical box with global reclaim
   under very high pressure. We're only tweaking the odds in favour of failure.

4) I do the same, using the patches I am now posting in here. Meaning: after
   direct reclaim failed and a page could not be freed, we try it again if our
   flags allow us.

The result of 2), is that this whole workload fits in ~65M. Since we don't ever
reclaim, this is a very good approximation of our working set size.

In 3), because direct reclaim triggers and sometimes manages to free some pages,
this fits in ~28M.

In 4), retrying once after failure, the workload fits in ~14M.

So I believe this introduces a more resilient behavior, and is good on its own.
While I agree that this is not a *likely* scenario upstream (it will be for
containers), if we honestly believed kernel allocations would not fail, we would
not be testing their return value =)

P.S.1: I know it is merge window and everything, so no need to rush.
P.S.2: I compiled it locally with slab, slub and slob. But I haven't passed it
through any thorough randconfig bomb yet.

Any comments appreciated.

Glauber Costa (3):
  slab: single entry-point for slab allocation
  slub: remove slab_alloc wrapper
  sl[auo]b: retry allocation once in case of failure.

 mm/slab.c | 90 ++++++++++++++++-----------------------------------------------
 mm/slab.h | 42 +++++++++++++++++++++++++++++
 mm/slob.c | 27 ++++++++++++++++---
 mm/slub.c | 34 +++++++++++++++---------
 4 files changed, 109 insertions(+), 84 deletions(-)

-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/3] slab: single entry-point for slab allocation
  2012-12-19 14:01 ` Glauber Costa
@ 2012-12-19 14:01   ` Glauber Costa
  -1 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-19 14:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, Johannes Weiner, Mel Gorman, Andrew Morton,
	Dave Shrinnker, Michal Hocko, kamezawa.hiroyu, Pekka Enberg,
	Christoph Lameter, David Rientjes, Glauber Costa, Pekka Enberg

This patch modifies slab's slab_alloc so it becomes node-aware at
all times. The code sharing is already quite big. The main difference
is how to behave when nodeid is specified as -1.

This is always the case for the not node aware allocation entry point,
that would do unconditionally:

        if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
        {
                objp = alternate_node_alloc(cache, flags);
                if (objp)
                        goto out;
        }

meaning that it will allocate from the current node unless some
task flags are set, in which case we'll try to spread it around.

We can easily assume that any call to kmem_cache_alloc_node() that
passes -1 as node would not mind seeing this behavior. So what this
patch does is to add a check like that to the nodeid == -1 case of
slab_alloc_node, and then convert every caller to slab_alloc to
slab_alloc_node.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slab.c | 88 +++++++++++++++------------------------------------------------
 1 file changed, 21 insertions(+), 67 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index e7667a3..a98295f 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3475,33 +3475,23 @@ done:
  * Fallback to other node is possible if __GFP_THISNODE is not set.
  */
 static __always_inline void *
-slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
-		   unsigned long caller)
+__do_slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
-	unsigned long save_flags;
 	void *ptr;
 	int slab_node = numa_mem_id();
 
-	flags &= gfp_allowed_mask;
-
-	lockdep_trace_alloc(flags);
-
-	if (slab_should_failslab(cachep, flags))
-		return NULL;
-
-	cachep = memcg_kmem_get_cache(cachep, flags);
-
-	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
-
-	if (nodeid == NUMA_NO_NODE)
+	if (nodeid == NUMA_NO_NODE) {
+		if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
+			ptr = alternate_node_alloc(cachep, flags);
+			if (ptr)
+				return ptr;
+		}
 		nodeid = slab_node;
+	}
 
-	if (unlikely(!cachep->nodelists[nodeid])) {
+	if (unlikely(!cachep->nodelists[nodeid]))
 		/* Node not bootstrapped yet */
-		ptr = fallback_alloc(cachep, flags);
-		goto out;
-	}
+		return fallback_alloc(cachep, flags);
 
 	if (nodeid == slab_node) {
 		/*
@@ -3512,59 +3502,23 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 		 */
 		ptr = ____cache_alloc(cachep, flags);
 		if (ptr)
-			goto out;
+			return ptr;
 	}
 	/* ___cache_alloc_node can fall back to other nodes */
-	ptr = ____cache_alloc_node(cachep, flags, nodeid);
-  out:
-	local_irq_restore(save_flags);
-	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
-	kmemleak_alloc_recursive(ptr, cachep->object_size, 1, cachep->flags,
-				 flags);
-
-	if (likely(ptr))
-		kmemcheck_slab_alloc(cachep, flags, ptr, cachep->object_size);
-
-	if (unlikely((flags & __GFP_ZERO) && ptr))
-		memset(ptr, 0, cachep->object_size);
-
-	return ptr;
-}
-
-static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
-{
-	void *objp;
-
-	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
-		objp = alternate_node_alloc(cache, flags);
-		if (objp)
-			goto out;
-	}
-	objp = ____cache_alloc(cache, flags);
-
-	/*
-	 * We may just have run out of memory on the local node.
-	 * ____cache_alloc_node() knows how to locate memory on other nodes
-	 */
-	if (!objp)
-		objp = ____cache_alloc_node(cache, flags, numa_mem_id());
-
-  out:
-	return objp;
+	return ____cache_alloc_node(cachep, flags, nodeid);
 }
 #else
-
 static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+__do_slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
+
 	return ____cache_alloc(cachep, flags);
 }
-
 #endif /* CONFIG_NUMA */
 
 static __always_inline void *
-slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
+slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+		unsigned long caller)
 {
 	unsigned long save_flags;
 	void *objp;
@@ -3580,11 +3534,11 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
-	objp = __do_cache_alloc(cachep, flags);
+	objp = __do_slab_alloc_node(cachep, flags, nodeid);
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
-				 flags);
+				flags);
 	prefetchw(objp);
 
 	if (likely(objp))
@@ -3742,7 +3696,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
  */
 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
-	void *ret = slab_alloc(cachep, flags, _RET_IP_);
+	void *ret = slab_alloc_node(cachep, flags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret,
 			       cachep->object_size, cachep->size, flags);
@@ -3757,7 +3711,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
 {
 	void *ret;
 
-	ret = slab_alloc(cachep, flags, _RET_IP_);
+	ret = slab_alloc_node(cachep, flags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmalloc(_RET_IP_, ret,
 		      size, cachep->size, flags);
@@ -3850,7 +3804,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
 	cachep = __find_general_cachep(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
 		return cachep;
-	ret = slab_alloc(cachep, flags, caller);
+	ret = slab_alloc_node(cachep, flags, NUMA_NO_NODE, caller);
 
 	trace_kmalloc(caller, ret,
 		      size, cachep->size, flags);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 1/3] slab: single entry-point for slab allocation
@ 2012-12-19 14:01   ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-19 14:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, Johannes Weiner, Mel Gorman, Andrew Morton,
	Dave Shrinnker, Michal Hocko, kamezawa.hiroyu, Pekka Enberg,
	Christoph Lameter, David Rientjes, Glauber Costa, Pekka Enberg

This patch modifies slab's slab_alloc so it becomes node-aware at
all times. The code sharing is already quite big. The main difference
is how to behave when nodeid is specified as -1.

This is always the case for the not node aware allocation entry point,
that would do unconditionally:

        if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY)))
        {
                objp = alternate_node_alloc(cache, flags);
                if (objp)
                        goto out;
        }

meaning that it will allocate from the current node unless some
task flags are set, in which case we'll try to spread it around.

We can easily assume that any call to kmem_cache_alloc_node() that
passes -1 as node would not mind seeing this behavior. So what this
patch does is to add a check like that to the nodeid == -1 case of
slab_alloc_node, and then convert every caller to slab_alloc to
slab_alloc_node.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slab.c | 88 +++++++++++++++------------------------------------------------
 1 file changed, 21 insertions(+), 67 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index e7667a3..a98295f 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3475,33 +3475,23 @@ done:
  * Fallback to other node is possible if __GFP_THISNODE is not set.
  */
 static __always_inline void *
-slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
-		   unsigned long caller)
+__do_slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
-	unsigned long save_flags;
 	void *ptr;
 	int slab_node = numa_mem_id();
 
-	flags &= gfp_allowed_mask;
-
-	lockdep_trace_alloc(flags);
-
-	if (slab_should_failslab(cachep, flags))
-		return NULL;
-
-	cachep = memcg_kmem_get_cache(cachep, flags);
-
-	cache_alloc_debugcheck_before(cachep, flags);
-	local_irq_save(save_flags);
-
-	if (nodeid == NUMA_NO_NODE)
+	if (nodeid == NUMA_NO_NODE) {
+		if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
+			ptr = alternate_node_alloc(cachep, flags);
+			if (ptr)
+				return ptr;
+		}
 		nodeid = slab_node;
+	}
 
-	if (unlikely(!cachep->nodelists[nodeid])) {
+	if (unlikely(!cachep->nodelists[nodeid]))
 		/* Node not bootstrapped yet */
-		ptr = fallback_alloc(cachep, flags);
-		goto out;
-	}
+		return fallback_alloc(cachep, flags);
 
 	if (nodeid == slab_node) {
 		/*
@@ -3512,59 +3502,23 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 		 */
 		ptr = ____cache_alloc(cachep, flags);
 		if (ptr)
-			goto out;
+			return ptr;
 	}
 	/* ___cache_alloc_node can fall back to other nodes */
-	ptr = ____cache_alloc_node(cachep, flags, nodeid);
-  out:
-	local_irq_restore(save_flags);
-	ptr = cache_alloc_debugcheck_after(cachep, flags, ptr, caller);
-	kmemleak_alloc_recursive(ptr, cachep->object_size, 1, cachep->flags,
-				 flags);
-
-	if (likely(ptr))
-		kmemcheck_slab_alloc(cachep, flags, ptr, cachep->object_size);
-
-	if (unlikely((flags & __GFP_ZERO) && ptr))
-		memset(ptr, 0, cachep->object_size);
-
-	return ptr;
-}
-
-static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cache, gfp_t flags)
-{
-	void *objp;
-
-	if (unlikely(current->flags & (PF_SPREAD_SLAB | PF_MEMPOLICY))) {
-		objp = alternate_node_alloc(cache, flags);
-		if (objp)
-			goto out;
-	}
-	objp = ____cache_alloc(cache, flags);
-
-	/*
-	 * We may just have run out of memory on the local node.
-	 * ____cache_alloc_node() knows how to locate memory on other nodes
-	 */
-	if (!objp)
-		objp = ____cache_alloc_node(cache, flags, numa_mem_id());
-
-  out:
-	return objp;
+	return ____cache_alloc_node(cachep, flags, nodeid);
 }
 #else
-
 static __always_inline void *
-__do_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
+__do_slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
 {
+
 	return ____cache_alloc(cachep, flags);
 }
-
 #endif /* CONFIG_NUMA */
 
 static __always_inline void *
-slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
+slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
+		unsigned long caller)
 {
 	unsigned long save_flags;
 	void *objp;
@@ -3580,11 +3534,11 @@ slab_alloc(struct kmem_cache *cachep, gfp_t flags, unsigned long caller)
 
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
-	objp = __do_cache_alloc(cachep, flags);
+	objp = __do_slab_alloc_node(cachep, flags, nodeid);
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
-				 flags);
+				flags);
 	prefetchw(objp);
 
 	if (likely(objp))
@@ -3742,7 +3696,7 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
  */
 void *kmem_cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 {
-	void *ret = slab_alloc(cachep, flags, _RET_IP_);
+	void *ret = slab_alloc_node(cachep, flags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret,
 			       cachep->object_size, cachep->size, flags);
@@ -3757,7 +3711,7 @@ kmem_cache_alloc_trace(struct kmem_cache *cachep, gfp_t flags, size_t size)
 {
 	void *ret;
 
-	ret = slab_alloc(cachep, flags, _RET_IP_);
+	ret = slab_alloc_node(cachep, flags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmalloc(_RET_IP_, ret,
 		      size, cachep->size, flags);
@@ -3850,7 +3804,7 @@ static __always_inline void *__do_kmalloc(size_t size, gfp_t flags,
 	cachep = __find_general_cachep(size, flags);
 	if (unlikely(ZERO_OR_NULL_PTR(cachep)))
 		return cachep;
-	ret = slab_alloc(cachep, flags, caller);
+	ret = slab_alloc_node(cachep, flags, NUMA_NO_NODE, caller);
 
 	trace_kmalloc(caller, ret,
 		      size, cachep->size, flags);
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/3] slub: remove slab_alloc wrapper
  2012-12-19 14:01 ` Glauber Costa
@ 2012-12-19 14:01   ` Glauber Costa
  -1 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-19 14:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, Johannes Weiner, Mel Gorman, Andrew Morton,
	Dave Shrinnker, Michal Hocko, kamezawa.hiroyu, Pekka Enberg,
	Christoph Lameter, David Rientjes, Glauber Costa, Pekka Enberg

Being slab_alloc such a simple and unconditional wrapper around
slab_alloc_node, we should get rid of it for simplicity, patching
the callers directly.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slub.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ba2ca53..b72569c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2389,15 +2389,9 @@ redo:
 	return object;
 }
 
-static __always_inline void *slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, unsigned long addr)
-{
-	return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);
-}
-
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
-	void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+	void *ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size, s->size, gfpflags);
 
@@ -2408,7 +2402,7 @@ EXPORT_SYMBOL(kmem_cache_alloc);
 #ifdef CONFIG_TRACING
 void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
 {
-	void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+	void *ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
 	trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags);
 	return ret;
 }
@@ -3288,7 +3282,7 @@ void *__kmalloc(size_t size, gfp_t flags)
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, flags, _RET_IP_);
+	ret = slab_alloc_node(s, flags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmalloc(_RET_IP_, ret, size, s->size, flags);
 
@@ -3938,7 +3932,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, gfpflags, caller);
+	ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, caller);
 
 	/* Honor the call site pointer we received. */
 	trace_kmalloc(caller, ret, size, s->size, gfpflags);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/3] slub: remove slab_alloc wrapper
@ 2012-12-19 14:01   ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-19 14:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, Johannes Weiner, Mel Gorman, Andrew Morton,
	Dave Shrinnker, Michal Hocko, kamezawa.hiroyu, Pekka Enberg,
	Christoph Lameter, David Rientjes, Glauber Costa, Pekka Enberg

Being slab_alloc such a simple and unconditional wrapper around
slab_alloc_node, we should get rid of it for simplicity, patching
the callers directly.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slub.c | 14 ++++----------
 1 file changed, 4 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ba2ca53..b72569c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2389,15 +2389,9 @@ redo:
 	return object;
 }
 
-static __always_inline void *slab_alloc(struct kmem_cache *s,
-		gfp_t gfpflags, unsigned long addr)
-{
-	return slab_alloc_node(s, gfpflags, NUMA_NO_NODE, addr);
-}
-
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
-	void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+	void *ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmem_cache_alloc(_RET_IP_, ret, s->object_size, s->size, gfpflags);
 
@@ -2408,7 +2402,7 @@ EXPORT_SYMBOL(kmem_cache_alloc);
 #ifdef CONFIG_TRACING
 void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t gfpflags, size_t size)
 {
-	void *ret = slab_alloc(s, gfpflags, _RET_IP_);
+	void *ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
 	trace_kmalloc(_RET_IP_, ret, size, s->size, gfpflags);
 	return ret;
 }
@@ -3288,7 +3282,7 @@ void *__kmalloc(size_t size, gfp_t flags)
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, flags, _RET_IP_);
+	ret = slab_alloc_node(s, flags, NUMA_NO_NODE, _RET_IP_);
 
 	trace_kmalloc(_RET_IP_, ret, size, s->size, flags);
 
@@ -3938,7 +3932,7 @@ void *__kmalloc_track_caller(size_t size, gfp_t gfpflags, unsigned long caller)
 	if (unlikely(ZERO_OR_NULL_PTR(s)))
 		return s;
 
-	ret = slab_alloc(s, gfpflags, caller);
+	ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, caller);
 
 	/* Honor the call site pointer we received. */
 	trace_kmalloc(caller, ret, size, s->size, gfpflags);
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
  2012-12-19 14:01 ` Glauber Costa
@ 2012-12-19 14:01   ` Glauber Costa
  -1 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-19 14:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, Johannes Weiner, Mel Gorman, Andrew Morton,
	Dave Shrinnker, Michal Hocko, kamezawa.hiroyu, Pekka Enberg,
	Christoph Lameter, David Rientjes, Glauber Costa, Pekka Enberg

When we are out of space in the caches, we will try to allocate a new
page.  If we still fail, the page allocator will try to free pages
through direct reclaim. Which means that if an object allocation failed
we can be sure that no new pages could be given to us, even though
direct reclaim was likely invoked.

However, direct reclaim will also try to shrink objects from registered
shrinkers. They won't necessarily free a full page, but if our cache
happens to be one with a shrinker, this may very well open up the space
we need. So we retry the allocation in this case.

We can't know for sure if this happened. So the best we can do is try to
derive from our allocation flags how likely it is for direct reclaim to
have been called, and retry if we conclude that this is highly likely
(GFP_NOWAIT | GFP_FS | !GFP_NORETRY).

The common case is for the allocation to succeed. So we carefuly insert
a likely branch for that case.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
---
 mm/slab.c |  2 ++
 mm/slab.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 mm/slob.c | 27 +++++++++++++++++++++++----
 mm/slub.c | 26 ++++++++++++++++++++------
 4 files changed, 87 insertions(+), 10 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index a98295f..7e82f99 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3535,6 +3535,8 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
 	objp = __do_slab_alloc_node(cachep, flags, nodeid);
+	if (slab_should_retry(objp, flags))
+		objp = __do_slab_alloc_node(cachep, flags, nodeid);
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
diff --git a/mm/slab.h b/mm/slab.h
index 34a98d6..03d1590 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -104,6 +104,48 @@ void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s);
 ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 		       size_t count, loff_t *ppos);
 
+/*
+ * When we are out of space in the caches, we will try to allocate a new page.
+ * If we still fail, the page allocator will try to free pages through direct
+ * reclaim. Which means that if an object allocation failed we can be sure that
+ * no new pages could be given to us.
+ *
+ * However, direct reclaim will also try to shrink objects from registered
+ * shrinkers. They won't necessarily free a full page, but if our cache happens
+ * to be one with a shrinker, this may very well open up the space we need. So
+ * we retry the allocation in this case.
+ *
+ * We can't know for sure if this is the case. So the best we can do is try
+ * to derive from our allocation flags how likely it is for direct reclaim to
+ * have been called.
+ *
+ * Most of the time the allocation will succeed, and this will be just a branch
+ * with a very high hit ratio.
+ */
+static inline bool slab_should_retry(void *obj, gfp_t flags)
+{
+	if (likely(obj))
+		return false;
+
+	/*
+	 * those are obvious. We can't retry if the flags either explicitly
+	 * prohibit, or disallow waiting.
+	 */
+	if ((flags & __GFP_NORETRY) && !(flags & __GFP_WAIT))
+		return false;
+
+	/*
+	 * If this is not a __GFP_FS allocation, we are unlikely to have
+	 * reclaimed many objects - if at all. We may have succeeded in
+	 * allocating a new page, in which case the object allocation would
+	 * have succeeded, but most of the shrinkable objects would still be
+	 * in their caches. So retrying is likely futile.
+	 */
+	if (!(flags & __GFP_FS))
+		return false;
+	return true;
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool is_root_cache(struct kmem_cache *s)
 {
diff --git a/mm/slob.c b/mm/slob.c
index a99fdf7..f00127b 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -424,7 +424,7 @@ out:
  */
 
 static __always_inline void *
-__do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
+___do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
 {
 	unsigned int *m;
 	int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
@@ -462,6 +462,16 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
 	return ret;
 }
 
+
+static __always_inline void *
+__do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
+{
+	void *obj = ___do_kmalloc_node(size, gfp, node, caller);
+        if (slab_should_retry(obj, gfp))
+		obj = ___do_kmalloc_node(size, gfp, node, caller);
+	return obj;
+}
+
 void *__kmalloc_node(size_t size, gfp_t gfp, int node)
 {
 	return __do_kmalloc_node(size, gfp, node, _RET_IP_);
@@ -534,7 +544,9 @@ int __kmem_cache_create(struct kmem_cache *c, unsigned long flags)
 	return 0;
 }
 
-void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
+static __always_inline void *
+slab_alloc_node(struct kmem_cache *c, gfp_t flags, int node,
+		unsigned long caller)
 {
 	void *b;
 
@@ -544,12 +556,12 @@ void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 
 	if (c->size < PAGE_SIZE) {
 		b = slob_alloc(c->size, flags, c->align, node);
-		trace_kmem_cache_alloc_node(_RET_IP_, b, c->object_size,
+		trace_kmem_cache_alloc_node(caller, b, c->object_size,
 					    SLOB_UNITS(c->size) * SLOB_UNIT,
 					    flags, node);
 	} else {
 		b = slob_new_pages(flags, get_order(c->size), node);
-		trace_kmem_cache_alloc_node(_RET_IP_, b, c->object_size,
+		trace_kmem_cache_alloc_node(caller, b, c->object_size,
 					    PAGE_SIZE << get_order(c->size),
 					    flags, node);
 	}
@@ -562,6 +574,13 @@ void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node);
 
+void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
+{
+	void *obj = slab_alloc_node(c, flags, node, _RET_IP_);
+        if (slab_should_retry(obj, flags))
+		obj = slab_alloc_node(c, flags, node, _RET_IP_);
+	return obj;
+}
 static void __kmem_cache_free(void *b, int size)
 {
 	if (size < PAGE_SIZE)
diff --git a/mm/slub.c b/mm/slub.c
index b72569c..580dfa8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2318,18 +2318,15 @@ new_slab:
  *
  * Otherwise we can simply pick the next object from the lockless free list.
  */
-static __always_inline void *slab_alloc_node(struct kmem_cache *s,
-		gfp_t gfpflags, int node, unsigned long addr)
+static __always_inline void *
+do_slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node,
+		   unsigned long addr)
 {
 	void **object;
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	unsigned long tid;
 
-	if (slab_pre_alloc_hook(s, gfpflags))
-		return NULL;
-
-	s = memcg_kmem_get_cache(s, gfpflags);
 redo:
 
 	/*
@@ -2389,6 +2386,23 @@ redo:
 	return object;
 }
 
+static __always_inline void *
+slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags,
+		int node, unsigned long addr)
+{
+	void *obj;
+
+	if (slab_pre_alloc_hook(s, gfpflags))
+		return NULL;
+
+	s = memcg_kmem_get_cache(s, gfpflags);
+
+	obj = do_slab_alloc_node(s, gfpflags, node, addr);
+	if (slab_should_retry(obj, gfpflags))
+		obj = do_slab_alloc_node(s, gfpflags, node, addr);
+	return obj;
+}
+
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
 	void *ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
@ 2012-12-19 14:01   ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-19 14:01 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-fsdevel, Johannes Weiner, Mel Gorman, Andrew Morton,
	Dave Shrinnker, Michal Hocko, kamezawa.hiroyu, Pekka Enberg,
	Christoph Lameter, David Rientjes, Glauber Costa, Pekka Enberg

When we are out of space in the caches, we will try to allocate a new
page.  If we still fail, the page allocator will try to free pages
through direct reclaim. Which means that if an object allocation failed
we can be sure that no new pages could be given to us, even though
direct reclaim was likely invoked.

However, direct reclaim will also try to shrink objects from registered
shrinkers. They won't necessarily free a full page, but if our cache
happens to be one with a shrinker, this may very well open up the space
we need. So we retry the allocation in this case.

We can't know for sure if this happened. So the best we can do is try to
derive from our allocation flags how likely it is for direct reclaim to
have been called, and retry if we conclude that this is highly likely
(GFP_NOWAIT | GFP_FS | !GFP_NORETRY).

The common case is for the allocation to succeed. So we carefuly insert
a likely branch for that case.

Signed-off-by: Glauber Costa <glommer@parallels.com>
CC: Christoph Lameter <cl@linux.com>
CC: David Rientjes <rientjes@google.com>
CC: Pekka Enberg <penberg@cs.helsinki.fi>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Mel Gorman <mgorman@suse.de>
---
 mm/slab.c |  2 ++
 mm/slab.h | 42 ++++++++++++++++++++++++++++++++++++++++++
 mm/slob.c | 27 +++++++++++++++++++++++----
 mm/slub.c | 26 ++++++++++++++++++++------
 4 files changed, 87 insertions(+), 10 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index a98295f..7e82f99 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3535,6 +3535,8 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
 	cache_alloc_debugcheck_before(cachep, flags);
 	local_irq_save(save_flags);
 	objp = __do_slab_alloc_node(cachep, flags, nodeid);
+	if (slab_should_retry(objp, flags))
+		objp = __do_slab_alloc_node(cachep, flags, nodeid);
 	local_irq_restore(save_flags);
 	objp = cache_alloc_debugcheck_after(cachep, flags, objp, caller);
 	kmemleak_alloc_recursive(objp, cachep->object_size, 1, cachep->flags,
diff --git a/mm/slab.h b/mm/slab.h
index 34a98d6..03d1590 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -104,6 +104,48 @@ void slabinfo_show_stats(struct seq_file *m, struct kmem_cache *s);
 ssize_t slabinfo_write(struct file *file, const char __user *buffer,
 		       size_t count, loff_t *ppos);
 
+/*
+ * When we are out of space in the caches, we will try to allocate a new page.
+ * If we still fail, the page allocator will try to free pages through direct
+ * reclaim. Which means that if an object allocation failed we can be sure that
+ * no new pages could be given to us.
+ *
+ * However, direct reclaim will also try to shrink objects from registered
+ * shrinkers. They won't necessarily free a full page, but if our cache happens
+ * to be one with a shrinker, this may very well open up the space we need. So
+ * we retry the allocation in this case.
+ *
+ * We can't know for sure if this is the case. So the best we can do is try
+ * to derive from our allocation flags how likely it is for direct reclaim to
+ * have been called.
+ *
+ * Most of the time the allocation will succeed, and this will be just a branch
+ * with a very high hit ratio.
+ */
+static inline bool slab_should_retry(void *obj, gfp_t flags)
+{
+	if (likely(obj))
+		return false;
+
+	/*
+	 * those are obvious. We can't retry if the flags either explicitly
+	 * prohibit, or disallow waiting.
+	 */
+	if ((flags & __GFP_NORETRY) && !(flags & __GFP_WAIT))
+		return false;
+
+	/*
+	 * If this is not a __GFP_FS allocation, we are unlikely to have
+	 * reclaimed many objects - if at all. We may have succeeded in
+	 * allocating a new page, in which case the object allocation would
+	 * have succeeded, but most of the shrinkable objects would still be
+	 * in their caches. So retrying is likely futile.
+	 */
+	if (!(flags & __GFP_FS))
+		return false;
+	return true;
+}
+
 #ifdef CONFIG_MEMCG_KMEM
 static inline bool is_root_cache(struct kmem_cache *s)
 {
diff --git a/mm/slob.c b/mm/slob.c
index a99fdf7..f00127b 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -424,7 +424,7 @@ out:
  */
 
 static __always_inline void *
-__do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
+___do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
 {
 	unsigned int *m;
 	int align = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
@@ -462,6 +462,16 @@ __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
 	return ret;
 }
 
+
+static __always_inline void *
+__do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
+{
+	void *obj = ___do_kmalloc_node(size, gfp, node, caller);
+        if (slab_should_retry(obj, gfp))
+		obj = ___do_kmalloc_node(size, gfp, node, caller);
+	return obj;
+}
+
 void *__kmalloc_node(size_t size, gfp_t gfp, int node)
 {
 	return __do_kmalloc_node(size, gfp, node, _RET_IP_);
@@ -534,7 +544,9 @@ int __kmem_cache_create(struct kmem_cache *c, unsigned long flags)
 	return 0;
 }
 
-void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
+static __always_inline void *
+slab_alloc_node(struct kmem_cache *c, gfp_t flags, int node,
+		unsigned long caller)
 {
 	void *b;
 
@@ -544,12 +556,12 @@ void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 
 	if (c->size < PAGE_SIZE) {
 		b = slob_alloc(c->size, flags, c->align, node);
-		trace_kmem_cache_alloc_node(_RET_IP_, b, c->object_size,
+		trace_kmem_cache_alloc_node(caller, b, c->object_size,
 					    SLOB_UNITS(c->size) * SLOB_UNIT,
 					    flags, node);
 	} else {
 		b = slob_new_pages(flags, get_order(c->size), node);
-		trace_kmem_cache_alloc_node(_RET_IP_, b, c->object_size,
+		trace_kmem_cache_alloc_node(caller, b, c->object_size,
 					    PAGE_SIZE << get_order(c->size),
 					    flags, node);
 	}
@@ -562,6 +574,13 @@ void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node);
 
+void *kmem_cache_alloc_node(struct kmem_cache *c, gfp_t flags, int node)
+{
+	void *obj = slab_alloc_node(c, flags, node, _RET_IP_);
+        if (slab_should_retry(obj, flags))
+		obj = slab_alloc_node(c, flags, node, _RET_IP_);
+	return obj;
+}
 static void __kmem_cache_free(void *b, int size)
 {
 	if (size < PAGE_SIZE)
diff --git a/mm/slub.c b/mm/slub.c
index b72569c..580dfa8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2318,18 +2318,15 @@ new_slab:
  *
  * Otherwise we can simply pick the next object from the lockless free list.
  */
-static __always_inline void *slab_alloc_node(struct kmem_cache *s,
-		gfp_t gfpflags, int node, unsigned long addr)
+static __always_inline void *
+do_slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags, int node,
+		   unsigned long addr)
 {
 	void **object;
 	struct kmem_cache_cpu *c;
 	struct page *page;
 	unsigned long tid;
 
-	if (slab_pre_alloc_hook(s, gfpflags))
-		return NULL;
-
-	s = memcg_kmem_get_cache(s, gfpflags);
 redo:
 
 	/*
@@ -2389,6 +2386,23 @@ redo:
 	return object;
 }
 
+static __always_inline void *
+slab_alloc_node(struct kmem_cache *s, gfp_t gfpflags,
+		int node, unsigned long addr)
+{
+	void *obj;
+
+	if (slab_pre_alloc_hook(s, gfpflags))
+		return NULL;
+
+	s = memcg_kmem_get_cache(s, gfpflags);
+
+	obj = do_slab_alloc_node(s, gfpflags, node, addr);
+	if (slab_should_retry(obj, gfpflags))
+		obj = do_slab_alloc_node(s, gfpflags, node, addr);
+	return obj;
+}
+
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t gfpflags)
 {
 	void *ret = slab_alloc_node(s, gfpflags, NUMA_NO_NODE, _RET_IP_);
-- 
1.7.11.7

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
  2012-12-19 14:01   ` Glauber Costa
@ 2012-12-26  2:16     ` Kamezawa Hiroyuki
  -1 siblings, 0 replies; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  2:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, Pekka Enberg,
	Christoph Lameter, David Rientjes, Pekka Enberg

(2012/12/19 23:01), Glauber Costa wrote:
> When we are out of space in the caches, we will try to allocate a new
> page.  If we still fail, the page allocator will try to free pages
> through direct reclaim. Which means that if an object allocation failed
> we can be sure that no new pages could be given to us, even though
> direct reclaim was likely invoked.
> 
> However, direct reclaim will also try to shrink objects from registered
> shrinkers. They won't necessarily free a full page, but if our cache
> happens to be one with a shrinker, this may very well open up the space
> we need. So we retry the allocation in this case.
> 
> We can't know for sure if this happened. So the best we can do is try to
> derive from our allocation flags how likely it is for direct reclaim to
> have been called, and retry if we conclude that this is highly likely
> (GFP_NOWAIT | GFP_FS | !GFP_NORETRY).
> 
> The common case is for the allocation to succeed. So we carefuly insert
> a likely branch for that case.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: David Rientjes <rientjes@google.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Mel Gorman <mgorman@suse.de>
> ---
>   mm/slab.c |  2 ++
>   mm/slab.h | 42 ++++++++++++++++++++++++++++++++++++++++++
>   mm/slob.c | 27 +++++++++++++++++++++++----
>   mm/slub.c | 26 ++++++++++++++++++++------
>   4 files changed, 87 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index a98295f..7e82f99 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3535,6 +3535,8 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
>   	cache_alloc_debugcheck_before(cachep, flags);
>   	local_irq_save(save_flags);
>   	objp = __do_slab_alloc_node(cachep, flags, nodeid);
> +	if (slab_should_retry(objp, flags))
> +		objp = __do_slab_alloc_node(cachep, flags, nodeid);

3 questions. 

1. why can't we do retry in memcg's code (or kmem/memcg code) rather than slab.c ?
2. It should be retries even if memory allocator returns NULL page ?
3. What's relationship with oom-killer ? The first __do_slab_alloc() will not
   invoke oom-killer and returns NULL ?

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
@ 2012-12-26  2:16     ` Kamezawa Hiroyuki
  0 siblings, 0 replies; 27+ messages in thread
From: Kamezawa Hiroyuki @ 2012-12-26  2:16 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, Pekka Enberg,
	Christoph Lameter, David Rientjes, Pekka Enberg

(2012/12/19 23:01), Glauber Costa wrote:
> When we are out of space in the caches, we will try to allocate a new
> page.  If we still fail, the page allocator will try to free pages
> through direct reclaim. Which means that if an object allocation failed
> we can be sure that no new pages could be given to us, even though
> direct reclaim was likely invoked.
> 
> However, direct reclaim will also try to shrink objects from registered
> shrinkers. They won't necessarily free a full page, but if our cache
> happens to be one with a shrinker, this may very well open up the space
> we need. So we retry the allocation in this case.
> 
> We can't know for sure if this happened. So the best we can do is try to
> derive from our allocation flags how likely it is for direct reclaim to
> have been called, and retry if we conclude that this is highly likely
> (GFP_NOWAIT | GFP_FS | !GFP_NORETRY).
> 
> The common case is for the allocation to succeed. So we carefuly insert
> a likely branch for that case.
> 
> Signed-off-by: Glauber Costa <glommer@parallels.com>
> CC: Christoph Lameter <cl@linux.com>
> CC: David Rientjes <rientjes@google.com>
> CC: Pekka Enberg <penberg@cs.helsinki.fi>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: Mel Gorman <mgorman@suse.de>
> ---
>   mm/slab.c |  2 ++
>   mm/slab.h | 42 ++++++++++++++++++++++++++++++++++++++++++
>   mm/slob.c | 27 +++++++++++++++++++++++----
>   mm/slub.c | 26 ++++++++++++++++++++------
>   4 files changed, 87 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index a98295f..7e82f99 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3535,6 +3535,8 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
>   	cache_alloc_debugcheck_before(cachep, flags);
>   	local_irq_save(save_flags);
>   	objp = __do_slab_alloc_node(cachep, flags, nodeid);
> +	if (slab_should_retry(objp, flags))
> +		objp = __do_slab_alloc_node(cachep, flags, nodeid);

3 questions. 

1. why can't we do retry in memcg's code (or kmem/memcg code) rather than slab.c ?
2. It should be retries even if memory allocator returns NULL page ?
3. What's relationship with oom-killer ? The first __do_slab_alloc() will not
   invoke oom-killer and returns NULL ?

Thanks,
-Kame




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
  2012-12-26  2:16     ` Kamezawa Hiroyuki
@ 2012-12-26  7:55       ` Glauber Costa
  -1 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-26  7:55 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, Pekka Enberg,
	Christoph Lameter, David Rientjes, Pekka Enberg

Hello Kame,
>> diff --git a/mm/slab.c b/mm/slab.c
>> index a98295f..7e82f99 100644
>> --- a/mm/slab.c
>> +++ b/mm/slab.c
>> @@ -3535,6 +3535,8 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
>>   	cache_alloc_debugcheck_before(cachep, flags);
>>   	local_irq_save(save_flags);
>>   	objp = __do_slab_alloc_node(cachep, flags, nodeid);
>> +	if (slab_should_retry(objp, flags))
>> +		objp = __do_slab_alloc_node(cachep, flags, nodeid);
> 
> 3 questions. 
> 
> 1. why can't we do retry in memcg's code (or kmem/memcg code) rather than slab.c ?
Due to two main reasons:
 a. this is not memcg/kmemcg specific. I used kmemcg to make the
container very small, therefore, more likely. But it can also happen in
non-constrained systems.

 b. memcg hooks into the page allocation. This patchset deals with cases
in which we can't, really, allocate a new page. However, we are
confident that we could allocate a new *object* should we retry.

> 2. It should be retries even if memory allocator returns NULL page ?

Yes, this is the whole point of this exercise. When we return a NULL
page, we are almost certain to have called reclaim. Reclaim will call
shrink_slab(), that may free objects within a page. So if we retry, we
may now find space within the page, even if we can't have a full page.

> 3. What's relationship with oom-killer ? The first __do_slab_alloc() will not
>    invoke oom-killer and returns NULL ?
> 
Good question. In all my testing, I've never seen the oom killer be
invoked for failed slab allocations, for either slab or slub. What I
usually see is just the allocator giving up and flooding the log with
failure messages. It seemed logical to me, so I never really asked
myself why wasn't the oom killer invoked. (It usually is invoked right
after if I fire a user memory hog). Perhaps someone can shed a light on
the subject?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
@ 2012-12-26  7:55       ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2012-12-26  7:55 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, Pekka Enberg,
	Christoph Lameter, David Rientjes, Pekka Enberg

Hello Kame,
>> diff --git a/mm/slab.c b/mm/slab.c
>> index a98295f..7e82f99 100644
>> --- a/mm/slab.c
>> +++ b/mm/slab.c
>> @@ -3535,6 +3535,8 @@ slab_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid,
>>   	cache_alloc_debugcheck_before(cachep, flags);
>>   	local_irq_save(save_flags);
>>   	objp = __do_slab_alloc_node(cachep, flags, nodeid);
>> +	if (slab_should_retry(objp, flags))
>> +		objp = __do_slab_alloc_node(cachep, flags, nodeid);
> 
> 3 questions. 
> 
> 1. why can't we do retry in memcg's code (or kmem/memcg code) rather than slab.c ?
Due to two main reasons:
 a. this is not memcg/kmemcg specific. I used kmemcg to make the
container very small, therefore, more likely. But it can also happen in
non-constrained systems.

 b. memcg hooks into the page allocation. This patchset deals with cases
in which we can't, really, allocate a new page. However, we are
confident that we could allocate a new *object* should we retry.

> 2. It should be retries even if memory allocator returns NULL page ?

Yes, this is the whole point of this exercise. When we return a NULL
page, we are almost certain to have called reclaim. Reclaim will call
shrink_slab(), that may free objects within a page. So if we retry, we
may now find space within the page, even if we can't have a full page.

> 3. What's relationship with oom-killer ? The first __do_slab_alloc() will not
>    invoke oom-killer and returns NULL ?
> 
Good question. In all my testing, I've never seen the oom killer be
invoked for failed slab allocations, for either slab or slub. What I
usually see is just the allocator giving up and flooding the log with
failure messages. It seemed logical to me, so I never really asked
myself why wasn't the oom killer invoked. (It usually is invoked right
after if I fire a user memory hog). Perhaps someone can shed a light on
the subject?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] slub: remove slab_alloc wrapper
  2012-12-19 14:01   ` Glauber Costa
  (?)
@ 2013-01-02 16:03   ` Christoph Lameter
  -1 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2013-01-02 16:03 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes, Pekka Enberg

On Wed, 19 Dec 2012, Glauber Costa wrote:

> Being slab_alloc such a simple and unconditional wrapper around
> slab_alloc_node, we should get rid of it for simplicity, patching
> the callers directly.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
  2012-12-19 14:01   ` Glauber Costa
  (?)
  (?)
@ 2013-01-02 16:10   ` Christoph Lameter
  2013-01-09 10:25       ` Glauber Costa
  -1 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2013-01-02 16:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes, Pekka Enberg

On Wed, 19 Dec 2012, Glauber Costa wrote:

> When we are out of space in the caches, we will try to allocate a new
> page.  If we still fail, the page allocator will try to free pages
> through direct reclaim. Which means that if an object allocation failed
> we can be sure that no new pages could be given to us, even though
> direct reclaim was likely invoked.

Well this hits the hot allocation path with lots of additional checks
that also require the touching of more cachelines.

How much impact will this have?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
  2012-12-19 14:01 ` Glauber Costa
@ 2013-01-02 16:32   ` Christoph Lameter
  -1 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2013-01-02 16:32 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On Wed, 19 Dec 2012, Glauber Costa wrote:

> The reclaim round, despite not being able to free a whole page, may very well
> have been able to free objects spread around multiple pages. Which means that
> at this point, a new object allocation would likely succeed.

I think this is reasonable but the approach is too intrusive on the hot
paths. Page allocation from the slab allocators happens outside of the hot
allocation and free paths.

slub has call to slab_out_of_memory in __slab_alloc which may be a
reasonable point to insert the logic.

slab has this whole fallback_alloc function to deal with short on memory
situations. Insert the additional logic there.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
@ 2013-01-02 16:32   ` Christoph Lameter
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2013-01-02 16:32 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On Wed, 19 Dec 2012, Glauber Costa wrote:

> The reclaim round, despite not being able to free a whole page, may very well
> have been able to free objects spread around multiple pages. Which means that
> at this point, a new object allocation would likely succeed.

I think this is reasonable but the approach is too intrusive on the hot
paths. Page allocation from the slab allocators happens outside of the hot
allocation and free paths.

slub has call to slab_out_of_memory in __slab_alloc which may be a
reasonable point to insert the logic.

slab has this whole fallback_alloc function to deal with short on memory
situations. Insert the additional logic there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
  2012-12-19 14:01 ` Glauber Costa
@ 2013-01-02 16:33   ` Christoph Lameter
  -1 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2013-01-02 16:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On Wed, 19 Dec 2012, Glauber Costa wrote:
>Dave Shrinnker <david@fromorbit.com>,

Hehehe. Hopefully the email address at least works.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
@ 2013-01-02 16:33   ` Christoph Lameter
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2013-01-02 16:33 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On Wed, 19 Dec 2012, Glauber Costa wrote:
>Dave Shrinnker <david@fromorbit.com>,

Hehehe. Hopefully the email address at least works.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
  2013-01-02 16:32   ` Christoph Lameter
@ 2013-01-09 10:22     ` Glauber Costa
  -1 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2013-01-09 10:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

Hi,

On 01/02/2013 08:32 PM, Christoph Lameter wrote:
> On Wed, 19 Dec 2012, Glauber Costa wrote:
> 
>> The reclaim round, despite not being able to free a whole page, may very well
>> have been able to free objects spread around multiple pages. Which means that
>> at this point, a new object allocation would likely succeed.
> 
> I think this is reasonable but the approach is too intrusive on the hot
> paths. Page allocation from the slab allocators happens outside of the hot
> allocation and free paths.
> 
> slub has call to slab_out_of_memory in __slab_alloc which may be a
> reasonable point to insert the logic.
> 
> slab has this whole fallback_alloc function to deal with short on memory
> situations. Insert the additional logic there.
> 
I disagree and agree with you at the same time.

I disagree with you, because I don't see the trade-off as being so
simple, for two main reasons.

First, the logic of retrying is largely independent of the allocator,
and doing it in this level of abstraction allow us to move it to common
code as soon as we can. All the allocation decisions can be kept
internal to the underlying allocator, and we act only on the very high
level.

Also, at first sight it looks a bit ugly in the sense that being inwards
the allocator we will necessarily recurse (since the goal is to retry
the whole allocation). We want to keep the retries limited, so we'll
need to do flag passing - if we fail again, we will reach the same retry
code and we have to know that we need to stop. Can it be done? Sure. But
it looks like unnecessarily complicated at best, specially given that I
don't expect the cost of it to be big:

I can measure it as much as you want, but I can pretty much guarantee
you that the cost is near zero. The hot path, which is, when the
allocation succeeds, will load the address, which is guaranteed to be in
the cache, being the value we are returning (or very soon to be, since
someone almost always read the return value anyway). It will then issue
a strongly hinted branch, which unlike normal branches, should not have
any pipeline ill effect at all if we get it right (which we will).

So we are talking about one hot instruction here if the allocation
succeeds. And if it fails, it is no longer a hot path.

Now, I agree with you, because I now see I missed one detail: those
functions are all marked as __always_inline. Which means that we will
double the code size in every allocation, for every single case. So this
is bad. But it is also very easy to fix: We can have a noinline function
that calls the allocator function, and we call that function instead -
with proper comments.

So we are talking here:

hot path with one extra hot instruction executed, which is a hinted
branch with data in cache. Size increased by size of that instruction +
a call instruction that will only be executed in slow paths.

slow path:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
@ 2013-01-09 10:22     ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2013-01-09 10:22 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

Hi,

On 01/02/2013 08:32 PM, Christoph Lameter wrote:
> On Wed, 19 Dec 2012, Glauber Costa wrote:
> 
>> The reclaim round, despite not being able to free a whole page, may very well
>> have been able to free objects spread around multiple pages. Which means that
>> at this point, a new object allocation would likely succeed.
> 
> I think this is reasonable but the approach is too intrusive on the hot
> paths. Page allocation from the slab allocators happens outside of the hot
> allocation and free paths.
> 
> slub has call to slab_out_of_memory in __slab_alloc which may be a
> reasonable point to insert the logic.
> 
> slab has this whole fallback_alloc function to deal with short on memory
> situations. Insert the additional logic there.
> 
I disagree and agree with you at the same time.

I disagree with you, because I don't see the trade-off as being so
simple, for two main reasons.

First, the logic of retrying is largely independent of the allocator,
and doing it in this level of abstraction allow us to move it to common
code as soon as we can. All the allocation decisions can be kept
internal to the underlying allocator, and we act only on the very high
level.

Also, at first sight it looks a bit ugly in the sense that being inwards
the allocator we will necessarily recurse (since the goal is to retry
the whole allocation). We want to keep the retries limited, so we'll
need to do flag passing - if we fail again, we will reach the same retry
code and we have to know that we need to stop. Can it be done? Sure. But
it looks like unnecessarily complicated at best, specially given that I
don't expect the cost of it to be big:

I can measure it as much as you want, but I can pretty much guarantee
you that the cost is near zero. The hot path, which is, when the
allocation succeeds, will load the address, which is guaranteed to be in
the cache, being the value we are returning (or very soon to be, since
someone almost always read the return value anyway). It will then issue
a strongly hinted branch, which unlike normal branches, should not have
any pipeline ill effect at all if we get it right (which we will).

So we are talking about one hot instruction here if the allocation
succeeds. And if it fails, it is no longer a hot path.

Now, I agree with you, because I now see I missed one detail: those
functions are all marked as __always_inline. Which means that we will
double the code size in every allocation, for every single case. So this
is bad. But it is also very easy to fix: We can have a noinline function
that calls the allocator function, and we call that function instead -
with proper comments.

So we are talking here:

hot path with one extra hot instruction executed, which is a hinted
branch with data in cache. Size increased by size of that instruction +
a call instruction that will only be executed in slow paths.

slow path:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
  2013-01-02 16:10   ` Christoph Lameter
@ 2013-01-09 10:25       ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2013-01-09 10:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes, Pekka Enberg

On 01/02/2013 08:10 PM, Christoph Lameter wrote:
> On Wed, 19 Dec 2012, Glauber Costa wrote:
> 
>> When we are out of space in the caches, we will try to allocate a new
>> page.  If we still fail, the page allocator will try to free pages
>> through direct reclaim. Which means that if an object allocation failed
>> we can be sure that no new pages could be given to us, even though
>> direct reclaim was likely invoked.
> 
> Well this hits the hot allocation path with lots of additional checks
> that also require the touching of more cachelines.
> 
> How much impact will this have?
> 

speed: Hot path is one fast branch (with an almost-always correct hint).
slow path is full retry.

About the flag checking, I must say that they are already in the slow
path. The hot path is allocation succeeding. This is the first condition
to be tested, and we will bail out as soon as we see this value being
true. They can be moved inside a helper noinline function together with
the retry code itself.

size: the branch instruction + a call instruction, if done right (I got
this last part wrong in this set)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] sl[auo]b: retry allocation once in case of failure.
@ 2013-01-09 10:25       ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2013-01-09 10:25 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes, Pekka Enberg

On 01/02/2013 08:10 PM, Christoph Lameter wrote:
> On Wed, 19 Dec 2012, Glauber Costa wrote:
> 
>> When we are out of space in the caches, we will try to allocate a new
>> page.  If we still fail, the page allocator will try to free pages
>> through direct reclaim. Which means that if an object allocation failed
>> we can be sure that no new pages could be given to us, even though
>> direct reclaim was likely invoked.
> 
> Well this hits the hot allocation path with lots of additional checks
> that also require the touching of more cachelines.
> 
> How much impact will this have?
> 

speed: Hot path is one fast branch (with an almost-always correct hint).
slow path is full retry.

About the flag checking, I must say that they are already in the slow
path. The hot path is allocation succeeding. This is the first condition
to be tested, and we will bail out as soon as we see this value being
true. They can be moved inside a helper noinline function together with
the retry code itself.

size: the branch instruction + a call instruction, if done right (I got
this last part wrong in this set)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
  2013-01-09 10:22     ` Glauber Costa
@ 2013-01-09 15:38       ` Christoph Lameter
  -1 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2013-01-09 15:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On Wed, 9 Jan 2013, Glauber Costa wrote:

> I disagree with you, because I don't see the trade-off as being so
> simple, for two main reasons.

The problem is that many people do this kind of tradeoff every year and so
the additional logic accumulates in the hot paths which leads to gradual
decay of allocator performance. It is possible to put
this into the slow path for this round and so lets do it.

> First, the logic of retrying is largely independent of the allocator,
> and doing it in this level of abstraction allow us to move it to common
> code as soon as we can. All the allocation decisions can be kept
> internal to the underlying allocator, and we act only on the very high
> level.

Right now you have separate patches for the allocators. There is also a
way to abstract this in a different way: Both allocators have special
functions to deal with OOM conditions. This could be put into
slab_common.c too to have unified reporting of OOM conditionns and unified
fallback handling.

> I can measure it as much as you want, but I can pretty much guarantee
> you that the cost is near zero. The hot path, which is, when the

I hear the same argument every time around.

> Now, I agree with you, because I now see I missed one detail: those
> functions are all marked as __always_inline. Which means that we will
> double the code size in every allocation, for every single case. So this
> is bad. But it is also very easy to fix: We can have a noinline function
> that calls the allocator function, and we call that function instead -
> with proper comments.

There is a reason these functions are inline because the inlining allows
code generation for special cases (like !NUMA) to have optimized code.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
@ 2013-01-09 15:38       ` Christoph Lameter
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2013-01-09 15:38 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On Wed, 9 Jan 2013, Glauber Costa wrote:

> I disagree with you, because I don't see the trade-off as being so
> simple, for two main reasons.

The problem is that many people do this kind of tradeoff every year and so
the additional logic accumulates in the hot paths which leads to gradual
decay of allocator performance. It is possible to put
this into the slow path for this round and so lets do it.

> First, the logic of retrying is largely independent of the allocator,
> and doing it in this level of abstraction allow us to move it to common
> code as soon as we can. All the allocation decisions can be kept
> internal to the underlying allocator, and we act only on the very high
> level.

Right now you have separate patches for the allocators. There is also a
way to abstract this in a different way: Both allocators have special
functions to deal with OOM conditions. This could be put into
slab_common.c too to have unified reporting of OOM conditionns and unified
fallback handling.

> I can measure it as much as you want, but I can pretty much guarantee
> you that the cost is near zero. The hot path, which is, when the

I hear the same argument every time around.

> Now, I agree with you, because I now see I missed one detail: those
> functions are all marked as __always_inline. Which means that we will
> double the code size in every allocation, for every single case. So this
> is bad. But it is also very easy to fix: We can have a noinline function
> that calls the allocator function, and we call that function instead -
> with proper comments.

There is a reason these functions are inline because the inlining allows
code generation for special cases (like !NUMA) to have optimized code.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
  2013-01-09 15:38       ` Christoph Lameter
@ 2013-01-09 15:59         ` Glauber Costa
  -1 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2013-01-09 15:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On 01/09/2013 07:38 PM, Christoph Lameter wrote:
> On Wed, 9 Jan 2013, Glauber Costa wrote:
> 
>> I disagree with you, because I don't see the trade-off as being so
>> simple, for two main reasons.
> 
> The problem is that many people do this kind of tradeoff every year and so
> the additional logic accumulates in the hot paths which leads to gradual
> decay of allocator performance. It is possible to put
> this into the slow path for this round and so lets do it.
> 

>> First, the logic of retrying is largely independent of the allocator,
>> and doing it in this level of abstraction allow us to move it to common
>> code as soon as we can. All the allocation decisions can be kept
>> internal to the underlying allocator, and we act only on the very high
>> level.
> 
> Right now you have separate patches for the allocators. There is also a
> way to abstract this in a different way: Both allocators have special
> functions to deal with OOM conditions. This could be put into
> slab_common.c too to have unified reporting of OOM conditionns and unified
> fallback handling.

I will certainly look into that. But I still fear that regardless of
what we do here, the logic is still "retry the whole thing again". We
already know there is no free page, and without inspecting the internals
of the allocator it is hard to know where to start looking - therefore,
we need to retry from the start.

If it fails again, we will recurse to the same oom state. This means we
need to keep track of the state, at least, in which retry we are. If I
can keep it local, fine, I will argue no further. But if this state
needs to spread throughout the other paths, I think we have a lot more
to lose.

But as I said, I haven't tried this, I have only a rudimentary mental
image about it. I'll look into that, and let you know.

> 
>> I can measure it as much as you want, but I can pretty much guarantee
>> you that the cost is near zero. The hot path, which is, when the
> 
> I hear the same argument every time around.
> 
>> Now, I agree with you, because I now see I missed one detail: those
>> functions are all marked as __always_inline. Which means that we will
>> double the code size in every allocation, for every single case. So this
>> is bad. But it is also very easy to fix: We can have a noinline function
>> that calls the allocator function, and we call that function instead -
>> with proper comments.
> 
> There is a reason these functions are inline because the inlining allows
> code generation for special cases (like !NUMA) to have optimized code.
> 
Of course they are, and I am not proposing to make the allocator
functions non inline.

kmem_cache_alloc keeps being inline. It calls do_cache_alloc, that keeps
being inline. And upon retry, we call non_inline_cache_alloc, which is
basically just:

static noinline non_inline_cache_alloc(...)
{
    if (should_we_retry)
        return do_cache_alloc(...)
}

This way the hot path is still inlined. The retry path, is not. And the
impact on the inlined path is just a couple of instructions. Precisely,
three: cmp, jne, call.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
@ 2013-01-09 15:59         ` Glauber Costa
  0 siblings, 0 replies; 27+ messages in thread
From: Glauber Costa @ 2013-01-09 15:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On 01/09/2013 07:38 PM, Christoph Lameter wrote:
> On Wed, 9 Jan 2013, Glauber Costa wrote:
> 
>> I disagree with you, because I don't see the trade-off as being so
>> simple, for two main reasons.
> 
> The problem is that many people do this kind of tradeoff every year and so
> the additional logic accumulates in the hot paths which leads to gradual
> decay of allocator performance. It is possible to put
> this into the slow path for this round and so lets do it.
> 

>> First, the logic of retrying is largely independent of the allocator,
>> and doing it in this level of abstraction allow us to move it to common
>> code as soon as we can. All the allocation decisions can be kept
>> internal to the underlying allocator, and we act only on the very high
>> level.
> 
> Right now you have separate patches for the allocators. There is also a
> way to abstract this in a different way: Both allocators have special
> functions to deal with OOM conditions. This could be put into
> slab_common.c too to have unified reporting of OOM conditionns and unified
> fallback handling.

I will certainly look into that. But I still fear that regardless of
what we do here, the logic is still "retry the whole thing again". We
already know there is no free page, and without inspecting the internals
of the allocator it is hard to know where to start looking - therefore,
we need to retry from the start.

If it fails again, we will recurse to the same oom state. This means we
need to keep track of the state, at least, in which retry we are. If I
can keep it local, fine, I will argue no further. But if this state
needs to spread throughout the other paths, I think we have a lot more
to lose.

But as I said, I haven't tried this, I have only a rudimentary mental
image about it. I'll look into that, and let you know.

> 
>> I can measure it as much as you want, but I can pretty much guarantee
>> you that the cost is near zero. The hot path, which is, when the
> 
> I hear the same argument every time around.
> 
>> Now, I agree with you, because I now see I missed one detail: those
>> functions are all marked as __always_inline. Which means that we will
>> double the code size in every allocation, for every single case. So this
>> is bad. But it is also very easy to fix: We can have a noinline function
>> that calls the allocator function, and we call that function instead -
>> with proper comments.
> 
> There is a reason these functions are inline because the inlining allows
> code generation for special cases (like !NUMA) to have optimized code.
> 
Of course they are, and I am not proposing to make the allocator
functions non inline.

kmem_cache_alloc keeps being inline. It calls do_cache_alloc, that keeps
being inline. And upon retry, we call non_inline_cache_alloc, which is
basically just:

static noinline non_inline_cache_alloc(...)
{
    if (should_we_retry)
        return do_cache_alloc(...)
}

This way the hot path is still inlined. The retry path, is not. And the
impact on the inlined path is just a couple of instructions. Precisely,
three: cmp, jne, call.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] retry slab allocation after first failure
  2013-01-09 15:59         ` Glauber Costa
  (?)
@ 2013-01-09 18:59         ` Christoph Lameter
  -1 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2013-01-09 18:59 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-mm, linux-fsdevel, Johannes Weiner, Mel Gorman,
	Andrew Morton, Dave Shrinnker, Michal Hocko, kamezawa.hiroyu,
	Pekka Enberg, David Rientjes

On Wed, 9 Jan 2013, Glauber Costa wrote:

> I will certainly look into that. But I still fear that regardless of
> what we do here, the logic is still "retry the whole thing again". We
> already know there is no free page, and without inspecting the internals
> of the allocator it is hard to know where to start looking - therefore,
> we need to retry from the start.

The slow paths functions in the allocator can performa the retry of
"the whole thing".

slab's fallback alloc does precisely that. What you need to do there is to
just go back to the retry: label if the attempt to grow the slab fails.

slub's __slab_alloc() can use a similar approach by looping back to the
redo: label.

Either one is trivial to implement.

> If it fails again, we will recurse to the same oom state. This means we
> need to keep track of the state, at least, in which retry we are. If I
> can keep it local, fine, I will argue no further. But if this state
> needs to spread throughout the other paths, I think we have a lot more
> to lose.

Adding another state variable is not that intrusive in fallback_alloc or
__slab_alloc.

Alternatively we could extract the functionality to check the current
queues from the allocator and call them twice from the slow paths.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2013-01-09 18:59 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-19 14:01 [PATCH 0/3] retry slab allocation after first failure Glauber Costa
2012-12-19 14:01 ` Glauber Costa
2012-12-19 14:01 ` [PATCH 1/3] slab: single entry-point for slab allocation Glauber Costa
2012-12-19 14:01   ` Glauber Costa
2012-12-19 14:01 ` [PATCH 2/3] slub: remove slab_alloc wrapper Glauber Costa
2012-12-19 14:01   ` Glauber Costa
2013-01-02 16:03   ` Christoph Lameter
2012-12-19 14:01 ` [PATCH 3/3] sl[auo]b: retry allocation once in case of failure Glauber Costa
2012-12-19 14:01   ` Glauber Costa
2012-12-26  2:16   ` Kamezawa Hiroyuki
2012-12-26  2:16     ` Kamezawa Hiroyuki
2012-12-26  7:55     ` Glauber Costa
2012-12-26  7:55       ` Glauber Costa
2013-01-02 16:10   ` Christoph Lameter
2013-01-09 10:25     ` Glauber Costa
2013-01-09 10:25       ` Glauber Costa
2013-01-02 16:32 ` [PATCH 0/3] retry slab allocation after first failure Christoph Lameter
2013-01-02 16:32   ` Christoph Lameter
2013-01-09 10:22   ` Glauber Costa
2013-01-09 10:22     ` Glauber Costa
2013-01-09 15:38     ` Christoph Lameter
2013-01-09 15:38       ` Christoph Lameter
2013-01-09 15:59       ` Glauber Costa
2013-01-09 15:59         ` Glauber Costa
2013-01-09 18:59         ` Christoph Lameter
2013-01-02 16:33 ` Christoph Lameter
2013-01-02 16:33   ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.