[PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately
@ 2015-01-28 16:22 ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

Hi,

The kmem extension of the memory cgroup is almost usable now. There is,
in fact, the only serious issue left: per memcg kmem caches may pin the
owner cgroup for indefinitely long. This is, because a slab cache may
keep empty slab pages in its private structures to optimize performance,
while we take a css reference per each charged kmem page.

The issue is only relevant to SLUB, because SLAB periodically reaps
empty slabs. This patch set fixes this issue for SLUB. For details,
please see patch 3.

Changes in v2:
 - address Christoph's concerns regarding kmem_cache_shrink
 - fix race between put_cpu_partial reading ->cpu_partial and
   kmem_cache_shrink updating it as proposed by Joonsoo

v1: https://lkml.org/lkml/2015/1/26/317

Thanks,

Vladimir Davydov (3):
  slub: never fail to shrink cache
  slub: fix kmem_cache_shrink return value
  slub: make dead caches discard free slabs immediately

 mm/slab.c        |    4 +--
 mm/slab.h        |    2 +-
 mm/slab_common.c |   15 +++++++--
 mm/slob.c        |    2 +-
 mm/slub.c        |   94 +++++++++++++++++++++++++++++++++++-------------------
 5 files changed, 78 insertions(+), 39 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately
@ 2015-01-28 16:22 ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

Hi,

The kmem extension of the memory cgroup is almost usable now. There is,
in fact, the only serious issue left: per memcg kmem caches may pin the
owner cgroup for indefinitely long. This is, because a slab cache may
keep empty slab pages in its private structures to optimize performance,
while we take a css reference per each charged kmem page.

The issue is only relevant to SLUB, because SLAB periodically reaps
empty slabs. This patch set fixes this issue for SLUB. For details,
please see patch 3.

Changes in v2:
 - address Christoph's concerns regarding kmem_cache_shrink
 - fix race between put_cpu_partial reading ->cpu_partial and
   kmem_cache_shrink updating it as proposed by Joonsoo

v1: https://lkml.org/lkml/2015/1/26/317

Thanks,

Vladimir Davydov (3):
  slub: never fail to shrink cache
  slub: fix kmem_cache_shrink return value
  slub: make dead caches discard free slabs immediately

 mm/slab.c        |    4 +--
 mm/slab.h        |    2 +-
 mm/slab_common.c |   15 +++++++--
 mm/slob.c        |    2 +-
 mm/slub.c        |   94 +++++++++++++++++++++++++++++++++++-------------------
 5 files changed, 78 insertions(+), 39 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 16:22 ` Vladimir Davydov
@ 2015-01-28 16:22   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

SLUB's version of __kmem_cache_shrink() not only removes empty slabs,
but also tries to rearrange the partial lists to place slabs filled up
most to the head to cope with fragmentation. To achieve that, it
allocates a temporary array of lists used to sort slabs by the number of
objects in use. If the allocation fails, the whole procedure is aborted.

This is unacceptable for the kernel memory accounting extension of the
memory cgroup, where we want to make sure that kmem_cache_shrink()
successfully discarded empty slabs. Although the allocation failure is
utterly unlikely with the current page allocator implementation, which
retries GFP_KERNEL allocations of order <= 2 infinitely, it is better
not to rely on that.

This patch therefore makes __kmem_cache_shrink() allocate the array on
stack instead of calling kmalloc, which may fail. The array size is
chosen to be equal to 32, because most SLUB caches store not more than
32 objects per slab page. Slab pages with <= 32 free objects are sorted
using the array by the number of objects in use and promoted to the head
of the partial list, while slab pages with > 32 free objects are left in
the end of the list without any ordering imposed on them.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |   57 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1562955fe099..dbf9334b6a5c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3358,11 +3358,12 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+#define SHRINK_PROMOTE_MAX 32
+
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
+ * kmem_cache_shrink discards empty slabs and promotes the slabs filled
+ * up most to the head of the partial lists. New allocations will then
+ * fill those up and thus they can be removed from the partial lists.
  *
  * The slabs with the least items are placed last. This results in them
  * being allocated from last increasing the chance that the last objects
@@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 	struct page *page;
 	struct page *t;
-	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+	LIST_HEAD(discard);
+	struct list_head promote[SHRINK_PROMOTE_MAX];
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
+		INIT_LIST_HEAD(promote + i);
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		if (!n->nr_partial)
 			continue;
 
-		for (i = 0; i < objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
-		 * Build lists indexed by the items in use in each slab.
+		 * Build lists of slabs to discard or promote.
 		 *
 		 * Note that concurrent frees may occur while we hold the
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
-			if (!page->inuse)
+			int free = page->objects - page->inuse;
+
+			/* Do not reread page->inuse */
+			barrier();
+
+			/* We do not keep full slabs on the list */
+			BUG_ON(free <= 0);
+
+			if (free == page->objects) {
+				list_move(&page->lru, &discard);
 				n->nr_partial--;
+			} else if (free <= SHRINK_PROMOTE_MAX)
+				list_move(&page->lru, promote + free - 1);
 		}
 
 		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
+		 * Promote the slabs filled up most to the head of the
+		 * partial list.
 		 */
-		for (i = objects - 1; i > 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
+			list_splice_init(promote + i, &n->partial);
 
 		spin_unlock_irqrestore(&n->list_lock, flags);
 
 		/* Release empty slabs */
-		list_for_each_entry_safe(page, t, slabs_by_inuse, lru)
+		list_for_each_entry_safe(page, t, &discard, lru)
 			discard_slab(s, page);
 	}
 
-	kfree(slabs_by_inuse);
 	return 0;
 }
 
@@ -4682,12 +4688,9 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf)
 static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
-	if (buf[0] == '1') {
-		int rc = kmem_cache_shrink(s);
-
-		if (rc)
-			return rc;
-	} else
+	if (buf[0] == '1')
+		kmem_cache_shrink(s);
+	else
 		return -EINVAL;
 	return length;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-28 16:22   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

SLUB's version of __kmem_cache_shrink() not only removes empty slabs,
but also tries to rearrange the partial lists to place slabs filled up
most to the head to cope with fragmentation. To achieve that, it
allocates a temporary array of lists used to sort slabs by the number of
objects in use. If the allocation fails, the whole procedure is aborted.

This is unacceptable for the kernel memory accounting extension of the
memory cgroup, where we want to make sure that kmem_cache_shrink()
successfully discarded empty slabs. Although the allocation failure is
utterly unlikely with the current page allocator implementation, which
retries GFP_KERNEL allocations of order <= 2 infinitely, it is better
not to rely on that.

This patch therefore makes __kmem_cache_shrink() allocate the array on
stack instead of calling kmalloc, which may fail. The array size is
chosen to be equal to 32, because most SLUB caches store not more than
32 objects per slab page. Slab pages with <= 32 free objects are sorted
using the array by the number of objects in use and promoted to the head
of the partial list, while slab pages with > 32 free objects are left in
the end of the list without any ordering imposed on them.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |   57 ++++++++++++++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 27 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1562955fe099..dbf9334b6a5c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3358,11 +3358,12 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+#define SHRINK_PROMOTE_MAX 32
+
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
- * the remaining slabs by the number of items in use. The slabs with the
- * most items in use come first. New allocations will then fill those up
- * and thus they can be removed from the partial lists.
+ * kmem_cache_shrink discards empty slabs and promotes the slabs filled
+ * up most to the head of the partial lists. New allocations will then
+ * fill those up and thus they can be removed from the partial lists.
  *
  * The slabs with the least items are placed last. This results in them
  * being allocated from last increasing the chance that the last objects
@@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct kmem_cache_node *n;
 	struct page *page;
 	struct page *t;
-	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+	LIST_HEAD(discard);
+	struct list_head promote[SHRINK_PROMOTE_MAX];
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
+		INIT_LIST_HEAD(promote + i);
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		if (!n->nr_partial)
 			continue;
 
-		for (i = 0; i < objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
-		 * Build lists indexed by the items in use in each slab.
+		 * Build lists of slabs to discard or promote.
 		 *
 		 * Note that concurrent frees may occur while we hold the
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
-			if (!page->inuse)
+			int free = page->objects - page->inuse;
+
+			/* Do not reread page->inuse */
+			barrier();
+
+			/* We do not keep full slabs on the list */
+			BUG_ON(free <= 0);
+
+			if (free == page->objects) {
+				list_move(&page->lru, &discard);
 				n->nr_partial--;
+			} else if (free <= SHRINK_PROMOTE_MAX)
+				list_move(&page->lru, promote + free - 1);
 		}
 
 		/*
-		 * Rebuild the partial list with the slabs filled up most
-		 * first and the least used slabs at the end.
+		 * Promote the slabs filled up most to the head of the
+		 * partial list.
 		 */
-		for (i = objects - 1; i > 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
+			list_splice_init(promote + i, &n->partial);
 
 		spin_unlock_irqrestore(&n->list_lock, flags);
 
 		/* Release empty slabs */
-		list_for_each_entry_safe(page, t, slabs_by_inuse, lru)
+		list_for_each_entry_safe(page, t, &discard, lru)
 			discard_slab(s, page);
 	}
 
-	kfree(slabs_by_inuse);
 	return 0;
 }
 
@@ -4682,12 +4688,9 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf)
 static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
-	if (buf[0] == '1') {
-		int rc = kmem_cache_shrink(s);
-
-		if (rc)
-			return rc;
-	} else
+	if (buf[0] == '1')
+		kmem_cache_shrink(s);
+	else
 		return -EINVAL;
 	return length;
 }
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value
  2015-01-28 16:22 ` Vladimir Davydov
@ 2015-01-28 16:22   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

It is supposed to return 0 if the cache has no remaining objects and 1
otherwise, while currently it always returns 0. Fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index dbf9334b6a5c..63abe52c2951 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3379,6 +3379,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	LIST_HEAD(discard);
 	struct list_head promote[SHRINK_PROMOTE_MAX];
 	unsigned long flags;
+	int ret = 0;
 
 	for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 		INIT_LIST_HEAD(promote + i);
@@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
 			list_splice_init(promote + i, &n->partial);
 
+		if (n->nr_partial || slabs_node(s, node))
+			ret = 1;
+
 		spin_unlock_irqrestore(&n->list_lock, flags);
 
 		/* Release empty slabs */
@@ -3426,7 +3430,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 			discard_slab(s, page);
 	}
 
-	return 0;
+	return ret;
 }
 
 static int slab_mem_going_offline_callback(void *arg)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value
@ 2015-01-28 16:22   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

It is supposed to return 0 if the cache has no remaining objects and 1
otherwise, while currently it always returns 0. Fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index dbf9334b6a5c..63abe52c2951 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3379,6 +3379,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	LIST_HEAD(discard);
 	struct list_head promote[SHRINK_PROMOTE_MAX];
 	unsigned long flags;
+	int ret = 0;
 
 	for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 		INIT_LIST_HEAD(promote + i);
@@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
 			list_splice_init(promote + i, &n->partial);
 
+		if (n->nr_partial || slabs_node(s, node))
+			ret = 1;
+
 		spin_unlock_irqrestore(&n->list_lock, flags);
 
 		/* Release empty slabs */
@@ -3426,7 +3430,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 			discard_slab(s, page);
 	}
 
-	return 0;
+	return ret;
 }
 
 static int slab_mem_going_offline_callback(void *arg)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately
  2015-01-28 16:22 ` Vladimir Davydov
@ 2015-01-28 16:22   ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

To speed up further allocations SLUB may store empty slabs in per
cpu/node partial lists instead of freeing them immediately. This
prevents per memcg caches destruction, because kmem caches created for a
memory cgroup are only destroyed after the last page charged to the
cgroup is freed.

To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial()
so that it would drop frozen empty slabs immediately if cpu_partial = 0.

The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to
a partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable
preemption for put_cpu_partial() to achieve that though.

The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since
SLUB does not keep free slabs on a separate list. Besides, we currently
do not even call per-memcg shrinkers for offline memcgs. Overall, it
would introduce much more complexity to both SLUB and memcg than this
small patch.

Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.

[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c        |    4 ++--
 mm/slab.h        |    2 +-
 mm/slab_common.c |   15 +++++++++++++--
 mm/slob.c        |    2 +-
 mm/slub.c        |   31 ++++++++++++++++++++++++++-----
 5 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 7894017bc160..c4b89eaf4c96 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2382,7 +2382,7 @@ out:
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 {
 	int ret = 0;
 	int node;
@@ -2404,7 +2404,7 @@ int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
 	int i;
 	struct kmem_cache_node *n;
-	int rc = __kmem_cache_shrink(cachep);
+	int rc = __kmem_cache_shrink(cachep, false);
 
 	if (rc)
 		return rc;
diff --git a/mm/slab.h b/mm/slab.h
index 0a56d76ac0e9..4c3ac12dd644 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -138,7 +138,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS)
 
 int __kmem_cache_shutdown(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+int __kmem_cache_shrink(struct kmem_cache *, bool);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 0dd9eb4e0f87..9395842cfc6b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -549,10 +549,13 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 {
 	int idx;
 	struct memcg_cache_array *arr;
-	struct kmem_cache *s;
+	struct kmem_cache *s, *c;
 
 	idx = memcg_cache_id(memcg);
 
+	get_online_cpus();
+	get_online_mems();
+
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		if (!is_root_cache(s))
@@ -560,9 +563,17 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 
 		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
 						lockdep_is_held(&slab_mutex));
+		c = arr->entries[idx];
+		if (!c)
+			continue;
+
+		__kmem_cache_shrink(c, true);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
+
+	put_online_mems();
+	put_online_cpus();
 }
 
 void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
@@ -649,7 +660,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 
 	get_online_cpus();
 	get_online_mems();
-	ret = __kmem_cache_shrink(cachep);
+	ret = __kmem_cache_shrink(cachep, false);
 	put_online_mems();
 	put_online_cpus();
 	return ret;
diff --git a/mm/slob.c b/mm/slob.c
index 96a86206a26b..94a7fede6d48 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -618,7 +618,7 @@ int __kmem_cache_shutdown(struct kmem_cache *c)
 	return 0;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
 {
 	return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 63abe52c2951..a91789f59ab6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 	int pages;
 	int pobjects;
 
+	preempt_disable();
 	do {
 		pages = 0;
 		pobjects = 0;
@@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
+	if (unlikely(!s->cpu_partial)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+		local_irq_restore(flags);
+	}
+	preempt_enable();
 #endif
 }
 
@@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 {
 	int node;
 	int i;
@@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	unsigned long flags;
 	int ret = 0;
 
+	if (deactivate) {
+		/*
+		 * Disable empty slabs caching. Used to avoid pinning offline
+		 * memory cgroups by kmem pages that can be freed.
+		 */
+		s->cpu_partial = 0;
+		s->min_partial = 0;
+
+		/*
+		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
+		 * so we have to make sure the change is visible.
+		 */
+		kick_all_cpus_sync();
+	}
+
 	for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 		INIT_LIST_HEAD(promote + i);
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
-		if (!n->nr_partial)
-			continue;
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -3439,7 +3460,7 @@ static int slab_mem_going_offline_callback(void *arg)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s);
+		__kmem_cache_shrink(s, false);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately
@ 2015-01-28 16:22   ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 16:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

To speed up further allocations SLUB may store empty slabs in per
cpu/node partial lists instead of freeing them immediately. This
prevents per memcg caches destruction, because kmem caches created for a
memory cgroup are only destroyed after the last page charged to the
cgroup is freed.

To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial()
so that it would drop frozen empty slabs immediately if cpu_partial = 0.

The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to
a partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable
preemption for put_cpu_partial() to achieve that though.

The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since
SLUB does not keep free slabs on a separate list. Besides, we currently
do not even call per-memcg shrinkers for offline memcgs. Overall, it
would introduce much more complexity to both SLUB and memcg than this
small patch.

Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.

[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c        |    4 ++--
 mm/slab.h        |    2 +-
 mm/slab_common.c |   15 +++++++++++++--
 mm/slob.c        |    2 +-
 mm/slub.c        |   31 ++++++++++++++++++++++++++-----
 5 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 7894017bc160..c4b89eaf4c96 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2382,7 +2382,7 @@ out:
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 {
 	int ret = 0;
 	int node;
@@ -2404,7 +2404,7 @@ int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
 	int i;
 	struct kmem_cache_node *n;
-	int rc = __kmem_cache_shrink(cachep);
+	int rc = __kmem_cache_shrink(cachep, false);
 
 	if (rc)
 		return rc;
diff --git a/mm/slab.h b/mm/slab.h
index 0a56d76ac0e9..4c3ac12dd644 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -138,7 +138,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS)
 
 int __kmem_cache_shutdown(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+int __kmem_cache_shrink(struct kmem_cache *, bool);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 0dd9eb4e0f87..9395842cfc6b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -549,10 +549,13 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 {
 	int idx;
 	struct memcg_cache_array *arr;
-	struct kmem_cache *s;
+	struct kmem_cache *s, *c;
 
 	idx = memcg_cache_id(memcg);
 
+	get_online_cpus();
+	get_online_mems();
+
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		if (!is_root_cache(s))
@@ -560,9 +563,17 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 
 		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
 						lockdep_is_held(&slab_mutex));
+		c = arr->entries[idx];
+		if (!c)
+			continue;
+
+		__kmem_cache_shrink(c, true);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
+
+	put_online_mems();
+	put_online_cpus();
 }
 
 void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
@@ -649,7 +660,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 
 	get_online_cpus();
 	get_online_mems();
-	ret = __kmem_cache_shrink(cachep);
+	ret = __kmem_cache_shrink(cachep, false);
 	put_online_mems();
 	put_online_cpus();
 	return ret;
diff --git a/mm/slob.c b/mm/slob.c
index 96a86206a26b..94a7fede6d48 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -618,7 +618,7 @@ int __kmem_cache_shutdown(struct kmem_cache *c)
 	return 0;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
 {
 	return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 63abe52c2951..a91789f59ab6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 	int pages;
 	int pobjects;
 
+	preempt_disable();
 	do {
 		pages = 0;
 		pobjects = 0;
@@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
+	if (unlikely(!s->cpu_partial)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+		local_irq_restore(flags);
+	}
+	preempt_enable();
 #endif
 }
 
@@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 {
 	int node;
 	int i;
@@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	unsigned long flags;
 	int ret = 0;
 
+	if (deactivate) {
+		/*
+		 * Disable empty slabs caching. Used to avoid pinning offline
+		 * memory cgroups by kmem pages that can be freed.
+		 */
+		s->cpu_partial = 0;
+		s->min_partial = 0;
+
+		/*
+		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
+		 * so we have to make sure the change is visible.
+		 */
+		kick_all_cpus_sync();
+	}
+
 	for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 		INIT_LIST_HEAD(promote + i);
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
-		if (!n->nr_partial)
-			continue;
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -3439,7 +3460,7 @@ static int slab_mem_going_offline_callback(void *arg)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s);
+		__kmem_cache_shrink(s, false);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 16:22   ` Vladimir Davydov
@ 2015-01-28 16:31     ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 16:31 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> This patch therefore makes __kmem_cache_shrink() allocate the array on
> stack instead of calling kmalloc, which may fail. The array size is
> chosen to be equal to 32, because most SLUB caches store not more than
> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
> using the array by the number of objects in use and promoted to the head
> of the partial list, while slab pages with > 32 free objects are left in
> the end of the list without any ordering imposed on them.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-28 16:31     ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 16:31 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> This patch therefore makes __kmem_cache_shrink() allocate the array on
> stack instead of calling kmalloc, which may fail. The array size is
> chosen to be equal to 32, because most SLUB caches store not more than
> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
> using the array by the number of objects in use and promoted to the head
> of the partial list, while slab pages with > 32 free objects are left in
> the end of the list without any ordering imposed on them.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value
  2015-01-28 16:22   ` Vladimir Davydov
@ 2015-01-28 16:33     ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 16:33 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> @@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
>  			list_splice_init(promote + i, &n->partial);
>
> +		if (n->nr_partial || slabs_node(s, node))

The total number of slabs obtained via slabs_node always contains the
number of partial ones. So no need to check n->nr_partial.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value
@ 2015-01-28 16:33     ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 16:33 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> @@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
>  			list_splice_init(promote + i, &n->partial);
>
> +		if (n->nr_partial || slabs_node(s, node))

The total number of slabs obtained via slabs_node always contains the
number of partial ones. So no need to check n->nr_partial.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 16:22   ` Vladimir Davydov
@ 2015-01-28 16:37     ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 16:37 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> +			/* We do not keep full slabs on the list */
> +			BUG_ON(free <= 0);

Well sorry we do actually keep a number of empty slabs on the partial
lists. See the min_partial field in struct kmem_cache.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-28 16:37     ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 16:37 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> +			/* We do not keep full slabs on the list */
> +			BUG_ON(free <= 0);

Well sorry we do actually keep a number of empty slabs on the partial
lists. See the min_partial field in struct kmem_cache.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 16:37     ` Christoph Lameter
@ 2015-01-28 17:32       ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 17:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, Jan 28, 2015 at 10:37:09AM -0600, Christoph Lameter wrote:
> On Wed, 28 Jan 2015, Vladimir Davydov wrote:
> 
> > +			/* We do not keep full slabs on the list */
> > +			BUG_ON(free <= 0);
> 
> Well sorry we do actually keep a number of empty slabs on the partial
> lists. See the min_partial field in struct kmem_cache.

It's not about empty slabs, it's about full slabs: free == 0 means slab
is full.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-28 17:32       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 17:32 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, Jan 28, 2015 at 10:37:09AM -0600, Christoph Lameter wrote:
> On Wed, 28 Jan 2015, Vladimir Davydov wrote:
> 
> > +			/* We do not keep full slabs on the list */
> > +			BUG_ON(free <= 0);
> 
> Well sorry we do actually keep a number of empty slabs on the partial
> lists. See the min_partial field in struct kmem_cache.

It's not about empty slabs, it's about full slabs: free == 0 means slab
is full.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value
  2015-01-28 16:33     ` Christoph Lameter
@ 2015-01-28 17:46       ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 17:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, Jan 28, 2015 at 10:33:50AM -0600, Christoph Lameter wrote:
> On Wed, 28 Jan 2015, Vladimir Davydov wrote:
> 
> > @@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
> >  			list_splice_init(promote + i, &n->partial);
> >
> > +		if (n->nr_partial || slabs_node(s, node))
> 
> The total number of slabs obtained via slabs_node always contains the
> number of partial ones. So no need to check n->nr_partial.

Yeah, right. In addition to that I misplaced the check - it should go
after discard_slab, where we decrement nr_slabs. Here goes the updated
patch:

From: Vladimir Davydov <vdavydov@parallels.com>
Subject: [PATCH] slub: fix kmem_cache_shrink return value

It is supposed to return 0 if the cache has no remaining objects and 1
otherwise, while currently it always returns 0. Fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index dbf9334b6a5c..5626588db884 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3379,6 +3379,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	LIST_HEAD(discard);
 	struct list_head promote[SHRINK_PROMOTE_MAX];
 	unsigned long flags;
+	int ret = 0;
 
 	for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 		INIT_LIST_HEAD(promote + i);
@@ -3424,9 +3425,12 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		/* Release empty slabs */
 		list_for_each_entry_safe(page, t, &discard, lru)
 			discard_slab(s, page);
+
+		if (slabs_node(s, node))
+			ret = 1;
 	}
 
-	return 0;
+	return ret;
 }
 
 static int slab_mem_going_offline_callback(void *arg)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value
@ 2015-01-28 17:46       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-28 17:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, Jan 28, 2015 at 10:33:50AM -0600, Christoph Lameter wrote:
> On Wed, 28 Jan 2015, Vladimir Davydov wrote:
> 
> > @@ -3419,6 +3420,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  		for (i = SHRINK_PROMOTE_MAX - 1; i >= 0; i--)
> >  			list_splice_init(promote + i, &n->partial);
> >
> > +		if (n->nr_partial || slabs_node(s, node))
> 
> The total number of slabs obtained via slabs_node always contains the
> number of partial ones. So no need to check n->nr_partial.

Yeah, right. In addition to that I misplaced the check - it should go
after discard_slab, where we decrement nr_slabs. Here goes the updated
patch:

From: Vladimir Davydov <vdavydov@parallels.com>
Subject: [PATCH] slub: fix kmem_cache_shrink return value

It is supposed to return 0 if the cache has no remaining objects and 1
otherwise, while currently it always returns 0. Fix it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index dbf9334b6a5c..5626588db884 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3379,6 +3379,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	LIST_HEAD(discard);
 	struct list_head promote[SHRINK_PROMOTE_MAX];
 	unsigned long flags;
+	int ret = 0;
 
 	for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
 		INIT_LIST_HEAD(promote + i);
@@ -3424,9 +3425,12 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		/* Release empty slabs */
 		list_for_each_entry_safe(page, t, &discard, lru)
 			discard_slab(s, page);
+
+		if (slabs_node(s, node))
+			ret = 1;
 	}
 
-	return 0;
+	return ret;
 }
 
 static int slab_mem_going_offline_callback(void *arg)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 16:31     ` Christoph Lameter
@ 2015-01-28 18:29       ` Pekka Enberg
  -1 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2015-01-28 18:29 UTC (permalink / raw)
  To: Christoph Lameter, Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On 1/28/15 6:31 PM, Christoph Lameter wrote:
> On Wed, 28 Jan 2015, Vladimir Davydov wrote:
>
>> This patch therefore makes __kmem_cache_shrink() allocate the array on
>> stack instead of calling kmalloc, which may fail. The array size is
>> chosen to be equal to 32, because most SLUB caches store not more than
>> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
>> using the array by the number of objects in use and promoted to the head
>> of the partial list, while slab pages with > 32 free objects are left in
>> the end of the list without any ordering imposed on them.
> Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Pekka Enberg <penberg@kernel.org>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-28 18:29       ` Pekka Enberg
  0 siblings, 0 replies; 52+ messages in thread
From: Pekka Enberg @ 2015-01-28 18:29 UTC (permalink / raw)
  To: Christoph Lameter, Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On 1/28/15 6:31 PM, Christoph Lameter wrote:
> On Wed, 28 Jan 2015, Vladimir Davydov wrote:
>
>> This patch therefore makes __kmem_cache_shrink() allocate the array on
>> stack instead of calling kmalloc, which may fail. The array size is
>> chosen to be equal to 32, because most SLUB caches store not more than
>> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
>> using the array by the number of objects in use and promoted to the head
>> of the partial list, while slab pages with > 32 free objects are left in
>> the end of the list without any ordering imposed on them.
> Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Pekka Enberg <penberg@kernel.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value
  2015-01-28 17:46       ` Vladimir Davydov
@ 2015-01-28 19:19         ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 19:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> Yeah, right. In addition to that I misplaced the check - it should go
> after discard_slab, where we decrement nr_slabs. Here goes the updated
> patch:
>
> From: Vladimir Davydov <vdavydov@parallels.com>
> Subject: [PATCH] slub: fix kmem_cache_shrink return value
>
> It is supposed to return 0 if the cache has no remaining objects and 1
> otherwise, while currently it always returns 0. Fix it.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value
@ 2015-01-28 19:19         ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 19:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> Yeah, right. In addition to that I misplaced the check - it should go
> after discard_slab, where we decrement nr_slabs. Here goes the updated
> patch:
>
> From: Vladimir Davydov <vdavydov@parallels.com>
> Subject: [PATCH] slub: fix kmem_cache_shrink return value
>
> It is supposed to return 0 if the cache has no remaining objects and 1
> otherwise, while currently it always returns 0. Fix it.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 17:32       ` Vladimir Davydov
@ 2015-01-28 19:20         ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 19:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> On Wed, Jan 28, 2015 at 10:37:09AM -0600, Christoph Lameter wrote:
> > On Wed, 28 Jan 2015, Vladimir Davydov wrote:
> >
> > > +			/* We do not keep full slabs on the list */
> > > +			BUG_ON(free <= 0);
> >
> > Well sorry we do actually keep a number of empty slabs on the partial
> > lists. See the min_partial field in struct kmem_cache.
>
> It's not about empty slabs, it's about full slabs: free == 0 means slab
> is full.

Correct. I already acked the patch.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-28 19:20         ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 19:20 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Vladimir Davydov wrote:

> On Wed, Jan 28, 2015 at 10:37:09AM -0600, Christoph Lameter wrote:
> > On Wed, 28 Jan 2015, Vladimir Davydov wrote:
> >
> > > +			/* We do not keep full slabs on the list */
> > > +			BUG_ON(free <= 0);
> >
> > Well sorry we do actually keep a number of empty slabs on the partial
> > lists. See the min_partial field in struct kmem_cache.
>
> It's not about empty slabs, it's about full slabs: free == 0 means slab
> is full.

Correct. I already acked the patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 16:22   ` Vladimir Davydov
@ 2015-01-28 21:57     ` Andrew Morton
  -1 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2015-01-28 21:57 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:

> SLUB's version of __kmem_cache_shrink() not only removes empty slabs,
> but also tries to rearrange the partial lists to place slabs filled up
> most to the head to cope with fragmentation. To achieve that, it
> allocates a temporary array of lists used to sort slabs by the number of
> objects in use. If the allocation fails, the whole procedure is aborted.
> 
> This is unacceptable for the kernel memory accounting extension of the
> memory cgroup, where we want to make sure that kmem_cache_shrink()
> successfully discarded empty slabs. Although the allocation failure is
> utterly unlikely with the current page allocator implementation, which
> retries GFP_KERNEL allocations of order <= 2 infinitely, it is better
> not to rely on that.
> 
> This patch therefore makes __kmem_cache_shrink() allocate the array on
> stack instead of calling kmalloc, which may fail. The array size is
> chosen to be equal to 32, because most SLUB caches store not more than
> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
> using the array by the number of objects in use and promoted to the head
> of the partial list, while slab pages with > 32 free objects are left in
> the end of the list without any ordering imposed on them.
> 
> ...
>
> @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  	struct kmem_cache_node *n;
>  	struct page *page;
>  	struct page *t;
> -	int objects = oo_objects(s->max);
> -	struct list_head *slabs_by_inuse =
> -		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
> +	LIST_HEAD(discard);
> +	struct list_head promote[SHRINK_PROMOTE_MAX];

512 bytes of stack.  The call paths leading to __kmem_cache_shrink()
are many and twisty.  How do we know this isn't a problem?

The logic behind choosing "32" sounds rather rubbery.  What goes wrong
if we use, say, "4"?


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-28 21:57     ` Andrew Morton
  0 siblings, 0 replies; 52+ messages in thread
From: Andrew Morton @ 2015-01-28 21:57 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:

> SLUB's version of __kmem_cache_shrink() not only removes empty slabs,
> but also tries to rearrange the partial lists to place slabs filled up
> most to the head to cope with fragmentation. To achieve that, it
> allocates a temporary array of lists used to sort slabs by the number of
> objects in use. If the allocation fails, the whole procedure is aborted.
> 
> This is unacceptable for the kernel memory accounting extension of the
> memory cgroup, where we want to make sure that kmem_cache_shrink()
> successfully discarded empty slabs. Although the allocation failure is
> utterly unlikely with the current page allocator implementation, which
> retries GFP_KERNEL allocations of order <= 2 infinitely, it is better
> not to rely on that.
> 
> This patch therefore makes __kmem_cache_shrink() allocate the array on
> stack instead of calling kmalloc, which may fail. The array size is
> chosen to be equal to 32, because most SLUB caches store not more than
> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
> using the array by the number of objects in use and promoted to the head
> of the partial list, while slab pages with > 32 free objects are left in
> the end of the list without any ordering imposed on them.
> 
> ...
>
> @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  	struct kmem_cache_node *n;
>  	struct page *page;
>  	struct page *t;
> -	int objects = oo_objects(s->max);
> -	struct list_head *slabs_by_inuse =
> -		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
> +	LIST_HEAD(discard);
> +	struct list_head promote[SHRINK_PROMOTE_MAX];

512 bytes of stack.  The call paths leading to __kmem_cache_shrink()
are many and twisty.  How do we know this isn't a problem?

The logic behind choosing "32" sounds rather rubbery.  What goes wrong
if we use, say, "4"?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 21:57     ` Andrew Morton
@ 2015-01-28 22:56       ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 22:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Andrew Morton wrote:

> The logic behind choosing "32" sounds rather rubbery.  What goes wrong
> if we use, say, "4"?

Like more fragmentation of slab pages.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-28 22:56       ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-28 22:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Wed, 28 Jan 2015, Andrew Morton wrote:

> The logic behind choosing "32" sounds rather rubbery.  What goes wrong
> if we use, say, "4"?

Like more fragmentation of slab pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 21:57     ` Andrew Morton
@ 2015-01-29  8:07       ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-29  8:07 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner,
	Michal Hocko, linux-mm, linux-kernel

On Wed, Jan 28, 2015 at 01:57:52PM -0800, Andrew Morton wrote:
> On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
> > @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  	struct kmem_cache_node *n;
> >  	struct page *page;
> >  	struct page *t;
> > -	int objects = oo_objects(s->max);
> > -	struct list_head *slabs_by_inuse =
> > -		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
> > +	LIST_HEAD(discard);
> > +	struct list_head promote[SHRINK_PROMOTE_MAX];
> 
> 512 bytes of stack.  The call paths leading to __kmem_cache_shrink()
> are many and twisty.  How do we know this isn't a problem?

Because currently __kmem_cache_shrink is only called just from a couple
of places, each of which isn't supposed to have a great stack depth
AFAIU, namely:

- slab_mem_going_offline_callback - MEM_GOING_OFFLINE handler
- shrink_store - invoked upon write to /sys/kernel/slab/cache/shrink
- acpi_os_purge_cache - only called on acpi init
- memcg_deactivate_kmem_caches - called from cgroup_destroy_wq

> The logic behind choosing "32" sounds rather rubbery.  What goes wrong
> if we use, say, "4"?

We could, but kmem_cache_shrink would cope with fragmentation less
efficiently.

Come to think of it, do we really need to optimize slab placement in
kmem_cache_shrink? None of its users except shrink_store expects it -
they just want to purge the cache before destruction, that's it. May be,
we'd better move slab placement optimization to a separate SLUB's
private function that would be called only by shrink_store, where we can
put up with kmalloc failures? Christoph, what do you think?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-29  8:07       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-29  8:07 UTC (permalink / raw)
  To: Andrew Morton, Christoph Lameter
  Cc: Joonsoo Kim, Pekka Enberg, David Rientjes, Johannes Weiner,
	Michal Hocko, linux-mm, linux-kernel

On Wed, Jan 28, 2015 at 01:57:52PM -0800, Andrew Morton wrote:
> On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
> > @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  	struct kmem_cache_node *n;
> >  	struct page *page;
> >  	struct page *t;
> > -	int objects = oo_objects(s->max);
> > -	struct list_head *slabs_by_inuse =
> > -		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
> > +	LIST_HEAD(discard);
> > +	struct list_head promote[SHRINK_PROMOTE_MAX];
> 
> 512 bytes of stack.  The call paths leading to __kmem_cache_shrink()
> are many and twisty.  How do we know this isn't a problem?

Because currently __kmem_cache_shrink is only called just from a couple
of places, each of which isn't supposed to have a great stack depth
AFAIU, namely:

- slab_mem_going_offline_callback - MEM_GOING_OFFLINE handler
- shrink_store - invoked upon write to /sys/kernel/slab/cache/shrink
- acpi_os_purge_cache - only called on acpi init
- memcg_deactivate_kmem_caches - called from cgroup_destroy_wq

> The logic behind choosing "32" sounds rather rubbery.  What goes wrong
> if we use, say, "4"?

We could, but kmem_cache_shrink would cope with fragmentation less
efficiently.

Come to think of it, do we really need to optimize slab placement in
kmem_cache_shrink? None of its users except shrink_store expects it -
they just want to purge the cache before destruction, that's it. May be,
we'd better move slab placement optimization to a separate SLUB's
private function that would be called only by shrink_store, where we can
put up with kmalloc failures? Christoph, what do you think?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 21:57     ` Andrew Morton
@ 2015-01-29  8:32       ` Balbir Singh
  -1 siblings, 0 replies; 52+ messages in thread
From: Balbir Singh @ 2015-01-29  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

On Thu, Jan 29, 2015 at 3:27 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
>> SLUB's version of __kmem_cache_shrink() not only removes empty slabs,
>> but also tries to rearrange the partial lists to place slabs filled up
>> most to the head to cope with fragmentation. To achieve that, it
>> allocates a temporary array of lists used to sort slabs by the number of
>> objects in use. If the allocation fails, the whole procedure is aborted.
>>
>> This is unacceptable for the kernel memory accounting extension of the
>> memory cgroup, where we want to make sure that kmem_cache_shrink()
>> successfully discarded empty slabs. Although the allocation failure is
>> utterly unlikely with the current page allocator implementation, which
>> retries GFP_KERNEL allocations of order <= 2 infinitely, it is better
>> not to rely on that.
>>
>> This patch therefore makes __kmem_cache_shrink() allocate the array on
>> stack instead of calling kmalloc, which may fail. The array size is
>> chosen to be equal to 32, because most SLUB caches store not more than
>> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
>> using the array by the number of objects in use and promoted to the head
>> of the partial list, while slab pages with > 32 free objects are left in
>> the end of the list without any ordering imposed on them.
>>
>> ...
>>
>> @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>>       struct kmem_cache_node *n;
>>       struct page *page;
>>       struct page *t;
>> -     int objects = oo_objects(s->max);
>> -     struct list_head *slabs_by_inuse =
>> -             kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
>> +     LIST_HEAD(discard);
>> +     struct list_head promote[SHRINK_PROMOTE_MAX];
>
> 512 bytes of stack.  The call paths leading to __kmem_cache_shrink()
> are many and twisty.  How do we know this isn't a problem?
>
> The logic behind choosing "32" sounds rather rubbery.  What goes wrong
> if we use, say, "4"?

This much space in the stack may be fertile grounds for kernel stack
overflow code execution :) Another perspective could be that there
should be allocations that are not penalized to a particular cgroup
(from an accounting perspective), should come from the reserved  pool.

Balbir Singh.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-29  8:32       ` Balbir Singh
  0 siblings, 0 replies; 52+ messages in thread
From: Balbir Singh @ 2015-01-29  8:32 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vladimir Davydov, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

On Thu, Jan 29, 2015 at 3:27 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 28 Jan 2015 19:22:49 +0300 Vladimir Davydov <vdavydov@parallels.com> wrote:
>
>> SLUB's version of __kmem_cache_shrink() not only removes empty slabs,
>> but also tries to rearrange the partial lists to place slabs filled up
>> most to the head to cope with fragmentation. To achieve that, it
>> allocates a temporary array of lists used to sort slabs by the number of
>> objects in use. If the allocation fails, the whole procedure is aborted.
>>
>> This is unacceptable for the kernel memory accounting extension of the
>> memory cgroup, where we want to make sure that kmem_cache_shrink()
>> successfully discarded empty slabs. Although the allocation failure is
>> utterly unlikely with the current page allocator implementation, which
>> retries GFP_KERNEL allocations of order <= 2 infinitely, it is better
>> not to rely on that.
>>
>> This patch therefore makes __kmem_cache_shrink() allocate the array on
>> stack instead of calling kmalloc, which may fail. The array size is
>> chosen to be equal to 32, because most SLUB caches store not more than
>> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
>> using the array by the number of objects in use and promoted to the head
>> of the partial list, while slab pages with > 32 free objects are left in
>> the end of the list without any ordering imposed on them.
>>
>> ...
>>
>> @@ -3375,51 +3376,56 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>>       struct kmem_cache_node *n;
>>       struct page *page;
>>       struct page *t;
>> -     int objects = oo_objects(s->max);
>> -     struct list_head *slabs_by_inuse =
>> -             kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
>> +     LIST_HEAD(discard);
>> +     struct list_head promote[SHRINK_PROMOTE_MAX];
>
> 512 bytes of stack.  The call paths leading to __kmem_cache_shrink()
> are many and twisty.  How do we know this isn't a problem?
>
> The logic behind choosing "32" sounds rather rubbery.  What goes wrong
> if we use, say, "4"?

This much space in the stack may be fertile grounds for kernel stack
overflow code execution :) Another perspective could be that there
should be allocations that are not penalized to a particular cgroup
(from an accounting perspective), should come from the reserved  pool.

Balbir Singh.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-29  8:07       ` Vladimir Davydov
@ 2015-01-29 15:55         ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-29 15:55 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, 29 Jan 2015, Vladimir Davydov wrote:

> Come to think of it, do we really need to optimize slab placement in
> kmem_cache_shrink? None of its users except shrink_store expects it -
> they just want to purge the cache before destruction, that's it. May be,
> we'd better move slab placement optimization to a separate SLUB's
> private function that would be called only by shrink_store, where we can
> put up with kmalloc failures? Christoph, what do you think?

The slabinfo tool invokes kmem_cache_shrink to optimize placement.

Run

	slabinfo -s

which can then be used to reduce the fragmentation.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-29 15:55         ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-29 15:55 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, 29 Jan 2015, Vladimir Davydov wrote:

> Come to think of it, do we really need to optimize slab placement in
> kmem_cache_shrink? None of its users except shrink_store expects it -
> they just want to purge the cache before destruction, that's it. May be,
> we'd better move slab placement optimization to a separate SLUB's
> private function that would be called only by shrink_store, where we can
> put up with kmalloc failures? Christoph, what do you think?

The slabinfo tool invokes kmem_cache_shrink to optimize placement.

Run

	slabinfo -s

which can then be used to reduce the fragmentation.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-29 15:55         ` Christoph Lameter
@ 2015-01-29 16:17           ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-29 16:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, Jan 29, 2015 at 09:55:56AM -0600, Christoph Lameter wrote:
> On Thu, 29 Jan 2015, Vladimir Davydov wrote:
> 
> > Come to think of it, do we really need to optimize slab placement in
> > kmem_cache_shrink? None of its users except shrink_store expects it -
> > they just want to purge the cache before destruction, that's it. May be,
> > we'd better move slab placement optimization to a separate SLUB's
> > private function that would be called only by shrink_store, where we can
> > put up with kmalloc failures? Christoph, what do you think?
> 
> The slabinfo tool invokes kmem_cache_shrink to optimize placement.
> 
> Run
> 
> 	slabinfo -s
> 
> which can then be used to reduce the fragmentation.

Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e.
invokes shrink_store(), and I don't propose to remove slab placement
optimization from there. What I propose is to move slab placement
optimization from kmem_cache_shrink() to shrink_store(), because other
users of kmem_cache_shrink() don't seem to need it at all - they just
want to release empty slabs. Such a change wouldn't affect the behavior
of `slabinfo -s` at all.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-29 16:17           ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-29 16:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, Jan 29, 2015 at 09:55:56AM -0600, Christoph Lameter wrote:
> On Thu, 29 Jan 2015, Vladimir Davydov wrote:
> 
> > Come to think of it, do we really need to optimize slab placement in
> > kmem_cache_shrink? None of its users except shrink_store expects it -
> > they just want to purge the cache before destruction, that's it. May be,
> > we'd better move slab placement optimization to a separate SLUB's
> > private function that would be called only by shrink_store, where we can
> > put up with kmalloc failures? Christoph, what do you think?
> 
> The slabinfo tool invokes kmem_cache_shrink to optimize placement.
> 
> Run
> 
> 	slabinfo -s
> 
> which can then be used to reduce the fragmentation.

Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e.
invokes shrink_store(), and I don't propose to remove slab placement
optimization from there. What I propose is to move slab placement
optimization from kmem_cache_shrink() to shrink_store(), because other
users of kmem_cache_shrink() don't seem to need it at all - they just
want to release empty slabs. Such a change wouldn't affect the behavior
of `slabinfo -s` at all.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-29 16:17           ` Vladimir Davydov
@ 2015-01-29 16:22             ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-29 16:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, 29 Jan 2015, Vladimir Davydov wrote:

> Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e.
> invokes shrink_store(), and I don't propose to remove slab placement
> optimization from there. What I propose is to move slab placement
> optimization from kmem_cache_shrink() to shrink_store(), because other
> users of kmem_cache_shrink() don't seem to need it at all - they just
> want to release empty slabs. Such a change wouldn't affect the behavior
> of `slabinfo -s` at all.

Well we have to go through the chain of partial slabs anyways so its easy
to do the optimization at that point.


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-29 16:22             ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-29 16:22 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, 29 Jan 2015, Vladimir Davydov wrote:

> Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e.
> invokes shrink_store(), and I don't propose to remove slab placement
> optimization from there. What I propose is to move slab placement
> optimization from kmem_cache_shrink() to shrink_store(), because other
> users of kmem_cache_shrink() don't seem to need it at all - they just
> want to release empty slabs. Such a change wouldn't affect the behavior
> of `slabinfo -s` at all.

Well we have to go through the chain of partial slabs anyways so its easy
to do the optimization at that point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-29 16:22             ` Christoph Lameter
@ 2015-01-29 18:21               ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-29 18:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, Jan 29, 2015 at 10:22:16AM -0600, Christoph Lameter wrote:
> On Thu, 29 Jan 2015, Vladimir Davydov wrote:
> 
> > Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e.
> > invokes shrink_store(), and I don't propose to remove slab placement
> > optimization from there. What I propose is to move slab placement
> > optimization from kmem_cache_shrink() to shrink_store(), because other
> > users of kmem_cache_shrink() don't seem to need it at all - they just
> > want to release empty slabs. Such a change wouldn't affect the behavior
> > of `slabinfo -s` at all.
> 
> Well we have to go through the chain of partial slabs anyways so its easy
> to do the optimization at that point.

That's true, but we can introduce a separate function that would both
release empty slabs and optimize slab placement, like the patch below
does. It would increase the code size a bit though, so I don't insist.

diff --git a/mm/slub.c b/mm/slub.c
index 1562955fe099..2cd401d82a41 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3359,7 +3359,7 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
+ * shrink_slab_cache removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up
  * and thus they can be removed from the partial lists.
@@ -3368,7 +3368,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+static int shrink_slab_cache(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3423,6 +3423,32 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	return 0;
 }
 
+static int __kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+	struct kmem_cache_node *n;
+	struct page *page, *t;
+	LIST_HEAD(discard);
+	unsigned long flags;
+	int ret = 0;
+
+	flush_all(s);
+	for_each_kmem_cache_node(s, node, n) {
+		spin_lock_irqsave(&n->list_lock, flags);
+		list_for_each_entry_safe(page, t, &n->partial, lru)
+			if (!page->inuse)
+				list_move(&page->lru, &discard);
+		spin_unlock_irqrestore(&n->list_lock, flags);
+
+		list_for_each_entry_safe(page, t, &discard, lru)
+			discard_slab(s, page);
+
+		if (slabs_node(s, node))
+			ret = 1;
+	}
+	return ret;
+}
+
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
@@ -4683,7 +4709,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
 	if (buf[0] == '1') {
-		int rc = kmem_cache_shrink(s);
+		int rc = shrink_slab_cache(s);
 
 		if (rc)
 			return rc;

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-29 18:21               ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-01-29 18:21 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, Jan 29, 2015 at 10:22:16AM -0600, Christoph Lameter wrote:
> On Thu, 29 Jan 2015, Vladimir Davydov wrote:
> 
> > Yeah, but the tool just writes 1 to /sys/kernel/slab/cache/shrink, i.e.
> > invokes shrink_store(), and I don't propose to remove slab placement
> > optimization from there. What I propose is to move slab placement
> > optimization from kmem_cache_shrink() to shrink_store(), because other
> > users of kmem_cache_shrink() don't seem to need it at all - they just
> > want to release empty slabs. Such a change wouldn't affect the behavior
> > of `slabinfo -s` at all.
> 
> Well we have to go through the chain of partial slabs anyways so its easy
> to do the optimization at that point.

That's true, but we can introduce a separate function that would both
release empty slabs and optimize slab placement, like the patch below
does. It would increase the code size a bit though, so I don't insist.

diff --git a/mm/slub.c b/mm/slub.c
index 1562955fe099..2cd401d82a41 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3359,7 +3359,7 @@ void kfree(const void *x)
 EXPORT_SYMBOL(kfree);
 
 /*
- * kmem_cache_shrink removes empty slabs from the partial lists and sorts
+ * shrink_slab_cache removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
  * most items in use come first. New allocations will then fill those up
  * and thus they can be removed from the partial lists.
@@ -3368,7 +3368,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+static int shrink_slab_cache(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3423,6 +3423,32 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	return 0;
 }
 
+static int __kmem_cache_shrink(struct kmem_cache *s)
+{
+	int node;
+	struct kmem_cache_node *n;
+	struct page *page, *t;
+	LIST_HEAD(discard);
+	unsigned long flags;
+	int ret = 0;
+
+	flush_all(s);
+	for_each_kmem_cache_node(s, node, n) {
+		spin_lock_irqsave(&n->list_lock, flags);
+		list_for_each_entry_safe(page, t, &n->partial, lru)
+			if (!page->inuse)
+				list_move(&page->lru, &discard);
+		spin_unlock_irqrestore(&n->list_lock, flags);
+
+		list_for_each_entry_safe(page, t, &discard, lru)
+			discard_slab(s, page);
+
+		if (slabs_node(s, node))
+			ret = 1;
+	}
+	return ret;
+}
+
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
@@ -4683,7 +4709,7 @@ static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
 	if (buf[0] == '1') {
-		int rc = kmem_cache_shrink(s);
+		int rc = shrink_slab_cache(s);
 
 		if (rc)
 			return rc;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-29 18:21               ` Vladimir Davydov
@ 2015-01-29 19:10                 ` Christoph Lameter
  -1 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-29 19:10 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, 29 Jan 2015, Vladimir Davydov wrote:

> > Well we have to go through the chain of partial slabs anyways so its easy
> > to do the optimization at that point.
>
> That's true, but we can introduce a separate function that would both
> release empty slabs and optimize slab placement, like the patch below
> does. It would increase the code size a bit though, so I don't insist.

It would also change what slabinfo -s does.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-01-29 19:10                 ` Christoph Lameter
  0 siblings, 0 replies; 52+ messages in thread
From: Christoph Lameter @ 2015-01-29 19:10 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Thu, 29 Jan 2015, Vladimir Davydov wrote:

> > Well we have to go through the chain of partial slabs anyways so its easy
> > to do the optimization at that point.
>
> That's true, but we can introduce a separate function that would both
> release empty slabs and optimize slab placement, like the patch below
> does. It would increase the code size a bit though, so I don't insist.

It would also change what slabinfo -s does.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-01-28 16:22   ` Vladimir Davydov
@ 2015-02-15  3:55     ` Sasha Levin
  -1 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2015-02-15  3:55 UTC (permalink / raw)
  To: Vladimir Davydov, Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

Hi Vladimir,

On 01/28/2015 11:22 AM, Vladimir Davydov wrote:
> SLUB's version of __kmem_cache_shrink() not only removes empty slabs,
> but also tries to rearrange the partial lists to place slabs filled up
> most to the head to cope with fragmentation. To achieve that, it
> allocates a temporary array of lists used to sort slabs by the number of
> objects in use. If the allocation fails, the whole procedure is aborted.
> 
> This is unacceptable for the kernel memory accounting extension of the
> memory cgroup, where we want to make sure that kmem_cache_shrink()
> successfully discarded empty slabs. Although the allocation failure is
> utterly unlikely with the current page allocator implementation, which
> retries GFP_KERNEL allocations of order <= 2 infinitely, it is better
> not to rely on that.
> 
> This patch therefore makes __kmem_cache_shrink() allocate the array on
> stack instead of calling kmalloc, which may fail. The array size is
> chosen to be equal to 32, because most SLUB caches store not more than
> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
> using the array by the number of objects in use and promoted to the head
> of the partial list, while slab pages with > 32 free objects are left in
> the end of the list without any ordering imposed on them.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

It seems that this patch causes shrink to corrupt memory:

# echo 1 > /sys/kernel/slab/kmalloc-64/shrink
#
[   60.331433] =============================================================================
[   60.333052] BUG kmalloc-64 (Not tainted): Padding overwritten. 0xffff880051018f50-0xffff880051018fff
[   60.335714] -----------------------------------------------------------------------------
[   60.335714]
[   60.338530] Disabling lock debugging due to kernel taint
[   60.340140] INFO: Slab 0xffffea0001440600 objects=32767 used=65408 fp=0x          (null) flags=0x5fffff80000000
[   60.343095] CPU: 0 PID: 8634 Comm: sh Tainted: G    B           3.19.0-next-20150213-sasha-00045-g897c679-dirty #1920
[   60.345315]  ffff88001fc0ee40 00000000542c5de6 ffff88001114f908 ffffffffabb74948
[   60.346454]  0000000000000000 ffffea0001440600 ffff88001114f9e8 ffffffffa1764b4f
[   60.347582]  0000000000000007 ffff880000000028 ffff88001114f9f8 ffff88001114f9a8
[   60.349842] Call Trace:
[   60.350243]  [<ffffffffabb74948>] dump_stack+0x4f/0x7b
[   60.350975]  [<ffffffffa1764b4f>] slab_err+0xaf/0xd0
[   60.351714]  [<ffffffffa307cce0>] ? memchr_inv+0x2c0/0x360
[   60.352592]  [<ffffffffa17671b0>] slab_pad_check+0x120/0x1c0
[   60.353418]  [<ffffffffa17673a4>] __free_slab+0x154/0x1f0
[   60.354209]  [<ffffffffa141c365>] ? trace_hardirqs_on_caller+0x475/0x610
[   60.355168]  [<ffffffffa1767478>] discard_slab+0x38/0x60
[   60.355909]  [<ffffffffa176d478>] __kmem_cache_shrink+0x258/0x300
[   60.356801]  [<ffffffffa1764770>] ? print_tracking+0x70/0x70
[   60.357621]  [<ffffffffa1764770>] ? print_tracking+0x70/0x70
[   60.358448]  [<ffffffffa16c8460>] kmem_cache_shrink+0x20/0x30
[   60.359279]  [<ffffffffa17665db>] shrink_store+0x1b/0x30
[   60.360048]  [<ffffffffa17647af>] slab_attr_store+0x3f/0xf0
[   60.360951]  [<ffffffffa1764770>] ? print_tracking+0x70/0x70
[   60.361778]  [<ffffffffa193c56a>] sysfs_kf_write+0x11a/0x180
[   60.362601]  [<ffffffffa193c450>] ? sysfs_file_ops+0x170/0x170
[   60.363447]  [<ffffffffa1939ea1>] kernfs_fop_write+0x271/0x3b0
[   60.364348]  [<ffffffffa17ba386>] vfs_write+0x186/0x5d0
[   60.365112]  [<ffffffffa17bd146>] SyS_write+0x126/0x270
[   60.365837]  [<ffffffffa17bd020>] ? SyS_read+0x270/0x270
[   60.366608]  [<ffffffffa141c365>] ? trace_hardirqs_on_caller+0x475/0x610
[   60.367580]  [<ffffffffa3091d7b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[   60.368566]  [<ffffffffabbea3ad>] system_call_fastpath+0x16/0x1b
[   60.369435] Padding ffff880051018f50: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[   60.370750] Padding ffff880051018f60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[   60.372077] Padding ffff880051018f70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[   60.373450] Padding ffff880051018f80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[   60.374763] Padding ffff880051018f90: bb bb bb bb bb bb bb bb c8 8d 01 51 00 88 ff ff  ...........Q....
[   60.376083] Padding ffff880051018fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.377457] Padding ffff880051018fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.378768] Padding ffff880051018fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.380084] Padding ffff880051018fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.381475] Padding ffff880051018fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.382798] Padding ffff880051018ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.384121] FIX kmalloc-64: Restoring 0xffff880051018f50-0xffff880051018fff=0x5a
[...]

And basically a lot more of the above.



Thanks,
Sasha

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-02-15  3:55     ` Sasha Levin
  0 siblings, 0 replies; 52+ messages in thread
From: Sasha Levin @ 2015-02-15  3:55 UTC (permalink / raw)
  To: Vladimir Davydov, Andrew Morton
  Cc: Christoph Lameter, Joonsoo Kim, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

Hi Vladimir,

On 01/28/2015 11:22 AM, Vladimir Davydov wrote:
> SLUB's version of __kmem_cache_shrink() not only removes empty slabs,
> but also tries to rearrange the partial lists to place slabs filled up
> most to the head to cope with fragmentation. To achieve that, it
> allocates a temporary array of lists used to sort slabs by the number of
> objects in use. If the allocation fails, the whole procedure is aborted.
> 
> This is unacceptable for the kernel memory accounting extension of the
> memory cgroup, where we want to make sure that kmem_cache_shrink()
> successfully discarded empty slabs. Although the allocation failure is
> utterly unlikely with the current page allocator implementation, which
> retries GFP_KERNEL allocations of order <= 2 infinitely, it is better
> not to rely on that.
> 
> This patch therefore makes __kmem_cache_shrink() allocate the array on
> stack instead of calling kmalloc, which may fail. The array size is
> chosen to be equal to 32, because most SLUB caches store not more than
> 32 objects per slab page. Slab pages with <= 32 free objects are sorted
> using the array by the number of objects in use and promoted to the head
> of the partial list, while slab pages with > 32 free objects are left in
> the end of the list without any ordering imposed on them.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>

It seems that this patch causes shrink to corrupt memory:

# echo 1 > /sys/kernel/slab/kmalloc-64/shrink
#
[   60.331433] =============================================================================
[   60.333052] BUG kmalloc-64 (Not tainted): Padding overwritten. 0xffff880051018f50-0xffff880051018fff
[   60.335714] -----------------------------------------------------------------------------
[   60.335714]
[   60.338530] Disabling lock debugging due to kernel taint
[   60.340140] INFO: Slab 0xffffea0001440600 objects=32767 used=65408 fp=0x          (null) flags=0x5fffff80000000
[   60.343095] CPU: 0 PID: 8634 Comm: sh Tainted: G    B           3.19.0-next-20150213-sasha-00045-g897c679-dirty #1920
[   60.345315]  ffff88001fc0ee40 00000000542c5de6 ffff88001114f908 ffffffffabb74948
[   60.346454]  0000000000000000 ffffea0001440600 ffff88001114f9e8 ffffffffa1764b4f
[   60.347582]  0000000000000007 ffff880000000028 ffff88001114f9f8 ffff88001114f9a8
[   60.349842] Call Trace:
[   60.350243]  [<ffffffffabb74948>] dump_stack+0x4f/0x7b
[   60.350975]  [<ffffffffa1764b4f>] slab_err+0xaf/0xd0
[   60.351714]  [<ffffffffa307cce0>] ? memchr_inv+0x2c0/0x360
[   60.352592]  [<ffffffffa17671b0>] slab_pad_check+0x120/0x1c0
[   60.353418]  [<ffffffffa17673a4>] __free_slab+0x154/0x1f0
[   60.354209]  [<ffffffffa141c365>] ? trace_hardirqs_on_caller+0x475/0x610
[   60.355168]  [<ffffffffa1767478>] discard_slab+0x38/0x60
[   60.355909]  [<ffffffffa176d478>] __kmem_cache_shrink+0x258/0x300
[   60.356801]  [<ffffffffa1764770>] ? print_tracking+0x70/0x70
[   60.357621]  [<ffffffffa1764770>] ? print_tracking+0x70/0x70
[   60.358448]  [<ffffffffa16c8460>] kmem_cache_shrink+0x20/0x30
[   60.359279]  [<ffffffffa17665db>] shrink_store+0x1b/0x30
[   60.360048]  [<ffffffffa17647af>] slab_attr_store+0x3f/0xf0
[   60.360951]  [<ffffffffa1764770>] ? print_tracking+0x70/0x70
[   60.361778]  [<ffffffffa193c56a>] sysfs_kf_write+0x11a/0x180
[   60.362601]  [<ffffffffa193c450>] ? sysfs_file_ops+0x170/0x170
[   60.363447]  [<ffffffffa1939ea1>] kernfs_fop_write+0x271/0x3b0
[   60.364348]  [<ffffffffa17ba386>] vfs_write+0x186/0x5d0
[   60.365112]  [<ffffffffa17bd146>] SyS_write+0x126/0x270
[   60.365837]  [<ffffffffa17bd020>] ? SyS_read+0x270/0x270
[   60.366608]  [<ffffffffa141c365>] ? trace_hardirqs_on_caller+0x475/0x610
[   60.367580]  [<ffffffffa3091d7b>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[   60.368566]  [<ffffffffabbea3ad>] system_call_fastpath+0x16/0x1b
[   60.369435] Padding ffff880051018f50: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[   60.370750] Padding ffff880051018f60: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[   60.372077] Padding ffff880051018f70: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[   60.373450] Padding ffff880051018f80: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[   60.374763] Padding ffff880051018f90: bb bb bb bb bb bb bb bb c8 8d 01 51 00 88 ff ff  ...........Q....
[   60.376083] Padding ffff880051018fa0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.377457] Padding ffff880051018fb0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.378768] Padding ffff880051018fc0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.380084] Padding ffff880051018fd0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.381475] Padding ffff880051018fe0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.382798] Padding ffff880051018ff0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   60.384121] FIX kmalloc-64: Restoring 0xffff880051018f50-0xffff880051018fff=0x5a
[...]

And basically a lot more of the above.



Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
  2015-02-15  3:55     ` Sasha Levin
@ 2015-02-15  9:47       ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-02-15  9:47 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

Hi,

On Sat, Feb 14, 2015 at 10:55:15PM -0500, Sasha Levin wrote:
> It seems that this patch causes shrink to corrupt memory:

Yes, it does :-(

The fix can be found here: https://lkml.org/lkml/2015/2/11/347

It must have already been merged to the -mm tree:

On Thu, Feb 12, 2015 at 02:14:54PM -0800, akpm@linux-foundation.org wrote:
> 
> The patch titled
>      Subject: slub: kmem_cache_shrink: fix crash due to uninitialized discard list
> has been removed from the -mm tree.  Its filename was
>      slub-never-fail-to-shrink-cache-init-discard-list-after-freeing-slabs.patch
> 
> This patch was dropped because it was folded into slub-never-fail-to-shrink-cache.patch

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 1/3] slub: never fail to shrink cache
@ 2015-02-15  9:47       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2015-02-15  9:47 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

Hi,

On Sat, Feb 14, 2015 at 10:55:15PM -0500, Sasha Levin wrote:
> It seems that this patch causes shrink to corrupt memory:

Yes, it does :-(

The fix can be found here: https://lkml.org/lkml/2015/2/11/347

It must have already been merged to the -mm tree:

On Thu, Feb 12, 2015 at 02:14:54PM -0800, akpm@linux-foundation.org wrote:
> 
> The patch titled
>      Subject: slub: kmem_cache_shrink: fix crash due to uninitialized discard list
> has been removed from the -mm tree.  Its filename was
>      slub-never-fail-to-shrink-cache-init-discard-list-after-freeing-slabs.patch
> 
> This patch was dropped because it was folded into slub-never-fail-to-shrink-cache.patch

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately
  2015-01-28 16:22   ` Vladimir Davydov
@ 2016-04-01  9:04     ` Peter Zijlstra
  -1 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2016-04-01  9:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

On Wed, Jan 28, 2015 at 07:22:51PM +0300, Vladimir Davydov wrote:
> +++ b/mm/slub.c
> @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>  	int pages;
>  	int pobjects;
>  
> +	preempt_disable();
>  	do {
>  		pages = 0;
>  		pobjects = 0;
> @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>  
>  	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>  								!= oldpage);
> +	if (unlikely(!s->cpu_partial)) {
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> +		local_irq_restore(flags);
> +	}
> +	preempt_enable();
>  #endif
>  }
>  
> @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree);
>   * being allocated from last increasing the chance that the last objects
>   * are freed in them.
>   */
> -int __kmem_cache_shrink(struct kmem_cache *s)
> +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
>  {
>  	int node;
>  	int i;
> @@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  	unsigned long flags;
>  	int ret = 0;
>  
> +	if (deactivate) {
> +		/*
> +		 * Disable empty slabs caching. Used to avoid pinning offline
> +		 * memory cgroups by kmem pages that can be freed.
> +		 */
> +		s->cpu_partial = 0;
> +		s->min_partial = 0;
> +
> +		/*
> +		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
> +		 * so we have to make sure the change is visible.
> +		 */
> +		kick_all_cpus_sync();
> +	}

Argh! what the heck! and without a single mention in the changelog.

Why are you spraying IPIs across the entire machine? Why isn't
synchronize_sched() good enough, that would allow you to get rid of the
local_irq_save/restore as well.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately
@ 2016-04-01  9:04     ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2016-04-01  9:04 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

On Wed, Jan 28, 2015 at 07:22:51PM +0300, Vladimir Davydov wrote:
> +++ b/mm/slub.c
> @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>  	int pages;
>  	int pobjects;
>  
> +	preempt_disable();
>  	do {
>  		pages = 0;
>  		pobjects = 0;
> @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>  
>  	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>  								!= oldpage);
> +	if (unlikely(!s->cpu_partial)) {
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> +		local_irq_restore(flags);
> +	}
> +	preempt_enable();
>  #endif
>  }
>  
> @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree);
>   * being allocated from last increasing the chance that the last objects
>   * are freed in them.
>   */
> -int __kmem_cache_shrink(struct kmem_cache *s)
> +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
>  {
>  	int node;
>  	int i;
> @@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  	unsigned long flags;
>  	int ret = 0;
>  
> +	if (deactivate) {
> +		/*
> +		 * Disable empty slabs caching. Used to avoid pinning offline
> +		 * memory cgroups by kmem pages that can be freed.
> +		 */
> +		s->cpu_partial = 0;
> +		s->min_partial = 0;
> +
> +		/*
> +		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
> +		 * so we have to make sure the change is visible.
> +		 */
> +		kick_all_cpus_sync();
> +	}

Argh! what the heck! and without a single mention in the changelog.

Why are you spraying IPIs across the entire machine? Why isn't
synchronize_sched() good enough, that would allow you to get rid of the
local_irq_save/restore as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately
  2016-04-01  9:04     ` Peter Zijlstra
@ 2016-04-01 10:55       ` Vladimir Davydov
  -1 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2016-04-01 10:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

On Fri, Apr 01, 2016 at 11:04:41AM +0200, Peter Zijlstra wrote:
> On Wed, Jan 28, 2015 at 07:22:51PM +0300, Vladimir Davydov wrote:
> > +++ b/mm/slub.c
> > @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> >  	int pages;
> >  	int pobjects;
> >  
> > +	preempt_disable();
> >  	do {
> >  		pages = 0;
> >  		pobjects = 0;
> > @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> >  
> >  	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> >  								!= oldpage);
> > +	if (unlikely(!s->cpu_partial)) {
> > +		unsigned long flags;
> > +
> > +		local_irq_save(flags);
> > +		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> > +		local_irq_restore(flags);
> > +	}
> > +	preempt_enable();
> >  #endif
> >  }
> >  
> > @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree);
> >   * being allocated from last increasing the chance that the last objects
> >   * are freed in them.
> >   */
> > -int __kmem_cache_shrink(struct kmem_cache *s)
> > +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
> >  {
> >  	int node;
> >  	int i;
> > @@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  	unsigned long flags;
> >  	int ret = 0;
> >  
> > +	if (deactivate) {
> > +		/*
> > +		 * Disable empty slabs caching. Used to avoid pinning offline
> > +		 * memory cgroups by kmem pages that can be freed.
> > +		 */
> > +		s->cpu_partial = 0;
> > +		s->min_partial = 0;
> > +
> > +		/*
> > +		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
> > +		 * so we have to make sure the change is visible.
> > +		 */
> > +		kick_all_cpus_sync();
> > +	}
> 
> Argh! what the heck! and without a single mention in the changelog.

This function is only called when a memory cgroup is removed, which is
rather a rare event. I didn't think it would cause any pain. Sorry.

> 
> Why are you spraying IPIs across the entire machine? Why isn't
> synchronize_sched() good enough, that would allow you to get rid of the
> local_irq_save/restore as well.

synchronize_sched() is slower. Calling it for every per memcg kmem cache
would slow down cleanup on cgroup removal. The latter is async, so I'm
not sure if it would be a problem though. I think we can try to replace
kick_all_cpus_sync() with synchronize_sched() here.

Regarding local_irq_save/restore - synchronize_sched() wouldn't allow us
to get rid of them, because unfreeze_partials() must be called with irqs
disabled.

Come to think of it, kick_all_cpus_sync() is used as a memory barrier
here, so as to make sure that after it's finished all cpus will use the
new ->cpu_partial value, which makes me wonder if we could replace it
with a simple smp_mb. I mean, this_cpu_cmpxchg(), which is used by
put_cpu_partial to add a page to per-cpu partial list, must issue a full
memory barrier (am I correct?), so we have two possibilities here:

Case 1: smp_mb is called before this_cpu_cmpxchg is called on another
cpu executing put_cpu_partial. In this case, put_cpu_partial will see
cpu_partial == 0 and hence call the second unfreeze_partials, flushing
per cpu partial list.

Case 2: smp_mb is called after this_cpu_cmpxchg. Then
__kmem_cache_shrink ->flush_all -> has_cpu_slab should see (thanks to
the barriers) that there's a slab on a per-cpu list and so flush it
(provided it hasn't already been flushed by put_cpu_partial).

In any case, after __kmem_cache_shrink has finished, we are guaranteed
to not have any slabs on per cpu partial lists.

Does it make sense?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately
@ 2016-04-01 10:55       ` Vladimir Davydov
  0 siblings, 0 replies; 52+ messages in thread
From: Vladimir Davydov @ 2016-04-01 10:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

On Fri, Apr 01, 2016 at 11:04:41AM +0200, Peter Zijlstra wrote:
> On Wed, Jan 28, 2015 at 07:22:51PM +0300, Vladimir Davydov wrote:
> > +++ b/mm/slub.c
> > @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> >  	int pages;
> >  	int pobjects;
> >  
> > +	preempt_disable();
> >  	do {
> >  		pages = 0;
> >  		pobjects = 0;
> > @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> >  
> >  	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> >  								!= oldpage);
> > +	if (unlikely(!s->cpu_partial)) {
> > +		unsigned long flags;
> > +
> > +		local_irq_save(flags);
> > +		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> > +		local_irq_restore(flags);
> > +	}
> > +	preempt_enable();
> >  #endif
> >  }
> >  
> > @@ -3369,7 +3378,7 @@ EXPORT_SYMBOL(kfree);
> >   * being allocated from last increasing the chance that the last objects
> >   * are freed in them.
> >   */
> > -int __kmem_cache_shrink(struct kmem_cache *s)
> > +int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
> >  {
> >  	int node;
> >  	int i;
> > @@ -3381,14 +3390,26 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  	unsigned long flags;
> >  	int ret = 0;
> >  
> > +	if (deactivate) {
> > +		/*
> > +		 * Disable empty slabs caching. Used to avoid pinning offline
> > +		 * memory cgroups by kmem pages that can be freed.
> > +		 */
> > +		s->cpu_partial = 0;
> > +		s->min_partial = 0;
> > +
> > +		/*
> > +		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
> > +		 * so we have to make sure the change is visible.
> > +		 */
> > +		kick_all_cpus_sync();
> > +	}
> 
> Argh! what the heck! and without a single mention in the changelog.

This function is only called when a memory cgroup is removed, which is
rather a rare event. I didn't think it would cause any pain. Sorry.

> 
> Why are you spraying IPIs across the entire machine? Why isn't
> synchronize_sched() good enough, that would allow you to get rid of the
> local_irq_save/restore as well.

synchronize_sched() is slower. Calling it for every per memcg kmem cache
would slow down cleanup on cgroup removal. The latter is async, so I'm
not sure if it would be a problem though. I think we can try to replace
kick_all_cpus_sync() with synchronize_sched() here.

Regarding local_irq_save/restore - synchronize_sched() wouldn't allow us
to get rid of them, because unfreeze_partials() must be called with irqs
disabled.

Come to think of it, kick_all_cpus_sync() is used as a memory barrier
here, so as to make sure that after it's finished all cpus will use the
new ->cpu_partial value, which makes me wonder if we could replace it
with a simple smp_mb. I mean, this_cpu_cmpxchg(), which is used by
put_cpu_partial to add a page to per-cpu partial list, must issue a full
memory barrier (am I correct?), so we have two possibilities here:

Case 1: smp_mb is called before this_cpu_cmpxchg is called on another
cpu executing put_cpu_partial. In this case, put_cpu_partial will see
cpu_partial == 0 and hence call the second unfreeze_partials, flushing
per cpu partial list.

Case 2: smp_mb is called after this_cpu_cmpxchg. Then
__kmem_cache_shrink ->flush_all -> has_cpu_slab should see (thanks to
the barriers) that there's a slab on a per-cpu list and so flush it
(provided it hasn't already been flushed by put_cpu_partial).

In any case, after __kmem_cache_shrink has finished, we are guaranteed
to not have any slabs on per cpu partial lists.

Does it make sense?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately
  2016-04-01 10:55       ` Vladimir Davydov
@ 2016-04-01 11:41         ` Peter Zijlstra
  -1 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2016-04-01 11:41 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

On Fri, Apr 01, 2016 at 01:55:40PM +0300, Vladimir Davydov wrote:
> > > +	if (deactivate) {
> > > +		/*
> > > +		 * Disable empty slabs caching. Used to avoid pinning offline
> > > +		 * memory cgroups by kmem pages that can be freed.
> > > +		 */
> > > +		s->cpu_partial = 0;
> > > +		s->min_partial = 0;
> > > +
> > > +		/*
> > > +		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
> > > +		 * so we have to make sure the change is visible.
> > > +		 */
> > > +		kick_all_cpus_sync();
> > > +	}
> > 
> > Argh! what the heck! and without a single mention in the changelog.
> 
> This function is only called when a memory cgroup is removed, which is
> rather a rare event. I didn't think it would cause any pain. Sorry.

Suppose you have a bunch of CPUs running HPC/RT code and someone causes
the admin CPUs to create/destroy a few cgroups.

> > Why are you spraying IPIs across the entire machine? Why isn't
> > synchronize_sched() good enough, that would allow you to get rid of the
> > local_irq_save/restore as well.
> 
> synchronize_sched() is slower. Calling it for every per memcg kmem cache
> would slow down cleanup on cgroup removal.

Right, but who cares? cgroup removal isn't a fast path by any standard.

> Regarding local_irq_save/restore - synchronize_sched() wouldn't allow us
> to get rid of them, because unfreeze_partials() must be called with irqs
> disabled.

OK, I figured it was because it needed to be serialized against this
kick_all_cpus_sync() IPI.

> Come to think of it, kick_all_cpus_sync() is used as a memory barrier
> here, so as to make sure that after it's finished all cpus will use the
> new ->cpu_partial value, which makes me wonder if we could replace it
> with a simple smp_mb. I mean, this_cpu_cmpxchg(), which is used by
> put_cpu_partial to add a page to per-cpu partial list, must issue a full
> memory barrier (am I correct?), so we have two possibilities here:

Nope, this_cpu_cmpxchg() does not imply a memory barrier.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately
@ 2016-04-01 11:41         ` Peter Zijlstra
  0 siblings, 0 replies; 52+ messages in thread
From: Peter Zijlstra @ 2016-04-01 11:41 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Christoph Lameter, Joonsoo Kim, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko, linux-mm,
	linux-kernel

On Fri, Apr 01, 2016 at 01:55:40PM +0300, Vladimir Davydov wrote:
> > > +	if (deactivate) {
> > > +		/*
> > > +		 * Disable empty slabs caching. Used to avoid pinning offline
> > > +		 * memory cgroups by kmem pages that can be freed.
> > > +		 */
> > > +		s->cpu_partial = 0;
> > > +		s->min_partial = 0;
> > > +
> > > +		/*
> > > +		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
> > > +		 * so we have to make sure the change is visible.
> > > +		 */
> > > +		kick_all_cpus_sync();
> > > +	}
> > 
> > Argh! what the heck! and without a single mention in the changelog.
> 
> This function is only called when a memory cgroup is removed, which is
> rather a rare event. I didn't think it would cause any pain. Sorry.

Suppose you have a bunch of CPUs running HPC/RT code and someone causes
the admin CPUs to create/destroy a few cgroups.

> > Why are you spraying IPIs across the entire machine? Why isn't
> > synchronize_sched() good enough, that would allow you to get rid of the
> > local_irq_save/restore as well.
> 
> synchronize_sched() is slower. Calling it for every per memcg kmem cache
> would slow down cleanup on cgroup removal.

Right, but who cares? cgroup removal isn't a fast path by any standard.

> Regarding local_irq_save/restore - synchronize_sched() wouldn't allow us
> to get rid of them, because unfreeze_partials() must be called with irqs
> disabled.

OK, I figured it was because it needed to be serialized against this
kick_all_cpus_sync() IPI.

> Come to think of it, kick_all_cpus_sync() is used as a memory barrier
> here, so as to make sure that after it's finished all cpus will use the
> new ->cpu_partial value, which makes me wonder if we could replace it
> with a simple smp_mb. I mean, this_cpu_cmpxchg(), which is used by
> put_cpu_partial to add a page to per-cpu partial list, must issue a full
> memory barrier (am I correct?), so we have two possibilities here:

Nope, this_cpu_cmpxchg() does not imply a memory barrier.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2016-04-01 11:41 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-28 16:22 [PATCH -mm v2 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
2015-01-28 16:22 ` Vladimir Davydov
2015-01-28 16:22 ` [PATCH -mm v2 1/3] slub: never fail to shrink cache Vladimir Davydov
2015-01-28 16:22   ` Vladimir Davydov
2015-01-28 16:31   ` Christoph Lameter
2015-01-28 16:31     ` Christoph Lameter
2015-01-28 18:29     ` Pekka Enberg
2015-01-28 18:29       ` Pekka Enberg
2015-01-28 16:37   ` Christoph Lameter
2015-01-28 16:37     ` Christoph Lameter
2015-01-28 17:32     ` Vladimir Davydov
2015-01-28 17:32       ` Vladimir Davydov
2015-01-28 19:20       ` Christoph Lameter
2015-01-28 19:20         ` Christoph Lameter
2015-01-28 21:57   ` Andrew Morton
2015-01-28 21:57     ` Andrew Morton
2015-01-28 22:56     ` Christoph Lameter
2015-01-28 22:56       ` Christoph Lameter
2015-01-29  8:07     ` Vladimir Davydov
2015-01-29  8:07       ` Vladimir Davydov
2015-01-29 15:55       ` Christoph Lameter
2015-01-29 15:55         ` Christoph Lameter
2015-01-29 16:17         ` Vladimir Davydov
2015-01-29 16:17           ` Vladimir Davydov
2015-01-29 16:22           ` Christoph Lameter
2015-01-29 16:22             ` Christoph Lameter
2015-01-29 18:21             ` Vladimir Davydov
2015-01-29 18:21               ` Vladimir Davydov
2015-01-29 19:10               ` Christoph Lameter
2015-01-29 19:10                 ` Christoph Lameter
2015-01-29  8:32     ` Balbir Singh
2015-01-29  8:32       ` Balbir Singh
2015-02-15  3:55   ` Sasha Levin
2015-02-15  3:55     ` Sasha Levin
2015-02-15  9:47     ` Vladimir Davydov
2015-02-15  9:47       ` Vladimir Davydov
2015-01-28 16:22 ` [PATCH -mm v2 2/3] slub: fix kmem_cache_shrink return value Vladimir Davydov
2015-01-28 16:22   ` Vladimir Davydov
2015-01-28 16:33   ` Christoph Lameter
2015-01-28 16:33     ` Christoph Lameter
2015-01-28 17:46     ` Vladimir Davydov
2015-01-28 17:46       ` Vladimir Davydov
2015-01-28 19:19       ` Christoph Lameter
2015-01-28 19:19         ` Christoph Lameter
2015-01-28 16:22 ` [PATCH -mm v2 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
2015-01-28 16:22   ` Vladimir Davydov
2016-04-01  9:04   ` Peter Zijlstra
2016-04-01  9:04     ` Peter Zijlstra
2016-04-01 10:55     ` Vladimir Davydov
2016-04-01 10:55       ` Vladimir Davydov
2016-04-01 11:41       ` Peter Zijlstra
2016-04-01 11:41         ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.