[PATCH -mm 0/3] slub: make dead caches discard free slabs immediately

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -mm 0/3] slub: make dead caches discard free slabs immediately
@ 2015-01-26 12:55 Vladimir Davydov
  2015-01-26 12:55 ` [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails Vladimir Davydov
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 12:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

Hi,

The kmem extension of the memory cgroup is almost usable now. There is,
in fact, the only serious issue left: per memcg kmem caches may pin the
owner cgroup for indefinitely long. This is, because a slab cache may
keep empty slab pages in its private structures to optimize performance,
while we take a css reference per each charged kmem page.

The issue is only relevant to SLUB, because SLAB periodically reaps
empty slabs. This patch set fixes this issue for SLUB. For details,
please see patch 3.

Thanks,

Vladimir Davydov (3):
  slub: don't fail kmem_cache_shrink if slab placement optimization
    fails
  slab: zap kmem_cache_shrink return value
  slub: make dead caches discard free slabs immediately

 include/linux/slab.h |    2 +-
 mm/slab.c            |    9 +++++++--
 mm/slab.h            |    2 +-
 mm/slab_common.c     |   21 +++++++++++++-------
 mm/slob.c            |    3 +--
 mm/slub.c            |   53 +++++++++++++++++++++++++++++++++++---------------
 6 files changed, 61 insertions(+), 29 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-26 12:55 [PATCH -mm 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
@ 2015-01-26 12:55 ` Vladimir Davydov
  2015-01-26 15:48   ` Christoph Lameter
  2015-01-26 12:55 ` [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value Vladimir Davydov
  2015-01-26 12:55 ` [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
  2 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 12:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
but also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.

This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip
the slab placement optimization step if the allocation fails, but still
get rid of empty slabs.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 5ed1a73e2ec8..770bea3ed445 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3376,12 +3376,19 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct page *page;
 	struct page *t;
 	int objects = oo_objects(s->max);
+	struct list_head empty_slabs;
 	struct list_head *slabs_by_inuse =
 		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	if (!slabs_by_inuse) {
+		/*
+		 * Do not abort if we failed to allocate a temporary array.
+		 * Just skip the slab placement optimization then.
+		 */
+		slabs_by_inuse = &empty_slabs;
+		objects = 1;
+	}
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
@@ -3400,7 +3407,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
+			if (page->inuse < objects)
+				list_move(&page->lru,
+					  slabs_by_inuse + page->inuse);
 			if (!page->inuse)
 				n->nr_partial--;
 		}
@@ -3419,7 +3428,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 			discard_slab(s, page);
 	}
 
-	kfree(slabs_by_inuse);
+	if (slabs_by_inuse != &empty_slabs)
+		kfree(slabs_by_inuse);
 	return 0;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 12:55 [PATCH -mm 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
  2015-01-26 12:55 ` [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails Vladimir Davydov
@ 2015-01-26 12:55 ` Vladimir Davydov
  2015-01-26 15:49   ` Christoph Lameter
  2015-01-26 12:55 ` [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
  2 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 12:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

The kmem_cache_shrink() return value is inconsistent: for SLAB it
returns 0 iff the cache is empty, while for SLUB and SLOB it always
returns 0. So let's zap it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 +-
 mm/slab.c            |    9 +++++++--
 mm/slab.h            |    2 +-
 mm/slab_common.c     |    8 ++------
 mm/slob.c            |    3 +--
 mm/slub.c            |   12 ++++--------
 6 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ed2ffaab59ea..18430ed916b1 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -116,7 +116,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			unsigned long,
 			void (*)(void *));
 void kmem_cache_destroy(struct kmem_cache *);
-int kmem_cache_shrink(struct kmem_cache *);
+void kmem_cache_shrink(struct kmem_cache *);
 
 void memcg_create_kmem_cache(struct mem_cgroup *, struct kmem_cache *);
 void memcg_deactivate_kmem_caches(struct mem_cgroup *);
diff --git a/mm/slab.c b/mm/slab.c
index 7894017bc160..279c44d6d8e1 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2382,7 +2382,7 @@ out:
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+static int __cache_shrink(struct kmem_cache *cachep)
 {
 	int ret = 0;
 	int node;
@@ -2400,11 +2400,16 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
+void __kmem_cache_shrink(struct kmem_cache *cachep)
+{
+	__cache_shrink(cachep);
+}
+
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
 	int i;
 	struct kmem_cache_node *n;
-	int rc = __kmem_cache_shrink(cachep);
+	int rc = __cache_shrink(cachep);
 
 	if (rc)
 		return rc;
diff --git a/mm/slab.h b/mm/slab.h
index 0a56d76ac0e9..c036e520d2cf 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -138,7 +138,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS)
 
 int __kmem_cache_shutdown(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+void __kmem_cache_shrink(struct kmem_cache *);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 0dd9eb4e0f87..6803639fdff0 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -641,18 +641,14 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  * @cachep: The cache to shrink.
  *
  * Releases as many slabs as possible for a cache.
- * To help debugging, a zero exit status indicates all slabs were released.
  */
-int kmem_cache_shrink(struct kmem_cache *cachep)
+void kmem_cache_shrink(struct kmem_cache *cachep)
 {
-	int ret;
-
 	get_online_cpus();
 	get_online_mems();
-	ret = __kmem_cache_shrink(cachep);
+	__kmem_cache_shrink(cachep);
 	put_online_mems();
 	put_online_cpus();
-	return ret;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
diff --git a/mm/slob.c b/mm/slob.c
index 96a86206a26b..043a14b6ccbe 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -618,9 +618,8 @@ int __kmem_cache_shutdown(struct kmem_cache *c)
 	return 0;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+void __kmem_cache_shrink(struct kmem_cache *c)
 {
-	return 0;
 }
 
 struct kmem_cache kmem_cache_boot = {
diff --git a/mm/slub.c b/mm/slub.c
index 770bea3ed445..c09d93dde40e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3368,7 +3368,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+void __kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3430,7 +3430,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 
 	if (slabs_by_inuse != &empty_slabs)
 		kfree(slabs_by_inuse);
-	return 0;
 }
 
 static int slab_mem_going_offline_callback(void *arg)
@@ -4696,12 +4695,9 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf)
 static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
-	if (buf[0] == '1') {
-		int rc = kmem_cache_shrink(s);
-
-		if (rc)
-			return rc;
-	} else
+	if (buf[0] == '1')
+		kmem_cache_shrink(s);
+	else
 		return -EINVAL;
 	return length;
 }
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately
  2015-01-26 12:55 [PATCH -mm 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
  2015-01-26 12:55 ` [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails Vladimir Davydov
  2015-01-26 12:55 ` [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value Vladimir Davydov
@ 2015-01-26 12:55 ` Vladimir Davydov
  2015-01-27  8:00   ` Joonsoo Kim
  2 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 12:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

To speed up further allocations SLUB may store empty slabs in per
cpu/node partial lists instead of freeing them immediately. This
prevents per memcg caches destruction, because kmem caches created for a
memory cgroup are only destroyed after the last page charged to the
cgroup is freed.

To fix this issue, this patch resurrects approach first proposed in [1].
It forbids SLUB to cache empty slabs after the memory cgroup that the
cache belongs to was destroyed. It is achieved by setting kmem_cache's
cpu_partial and min_partial constants to 0 and tuning put_cpu_partial()
so that it would drop frozen empty slabs immediately if cpu_partial = 0.

The runtime overhead is minimal. From all the hot functions, we only
touch relatively cold put_cpu_partial(): we make it call
unfreeze_partials() after freezing a slab that belongs to an offline
memory cgroup. Since slab freezing exists to avoid moving slabs from/to
a partial list on free/alloc, and there can't be allocations from dead
caches, it shouldn't cause any overhead. We do have to disable
preemption for put_cpu_partial() to achieve that though.

The original patch was accepted well and even merged to the mm tree.
However, I decided to withdraw it due to changes happening to the memcg
core at that time. I had an idea of introducing per-memcg shrinkers for
kmem caches, but now, as memcg has finally settled down, I do not see it
as an option, because SLUB shrinker would be too costly to call since
SLUB does not keep free slabs on a separate list. Besides, we currently
do not even call per-memcg shrinkers for offline memcgs. Overall, it
would introduce much more complexity to both SLUB and memcg than this
small patch.

Regarding to SLAB, there's no problem with it, because it shrinks
per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
longer keep entries for offline cgroups in per-memcg arrays (such as
memcg_cache_params->memcg_caches), so we do not have to bother if a
per-memcg cache will be shrunk a bit later than it could be.

[1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c        |    2 +-
 mm/slab.h        |    2 +-
 mm/slab_common.c |   15 +++++++++++++--
 mm/slob.c        |    2 +-
 mm/slub.c        |   25 ++++++++++++++++++++-----
 5 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 279c44d6d8e1..f0514df07b85 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2400,7 +2400,7 @@ static int __cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
-void __kmem_cache_shrink(struct kmem_cache *cachep)
+void __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 {
 	__cache_shrink(cachep);
 }
diff --git a/mm/slab.h b/mm/slab.h
index c036e520d2cf..041260197984 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -138,7 +138,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS)
 
 int __kmem_cache_shutdown(struct kmem_cache *);
-void __kmem_cache_shrink(struct kmem_cache *);
+void __kmem_cache_shrink(struct kmem_cache *, bool);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 6803639fdff0..472ab7fcffd4 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -549,10 +549,13 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 {
 	int idx;
 	struct memcg_cache_array *arr;
-	struct kmem_cache *s;
+	struct kmem_cache *s, *c;
 
 	idx = memcg_cache_id(memcg);
 
+	get_online_cpus();
+	get_online_mems();
+
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		if (!is_root_cache(s))
@@ -560,9 +563,17 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 
 		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
 						lockdep_is_held(&slab_mutex));
+		c = arr->entries[idx];
+		if (!c)
+			continue;
+
+		__kmem_cache_shrink(c, true);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
+
+	put_online_mems();
+	put_online_cpus();
 }
 
 void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
@@ -646,7 +657,7 @@ void kmem_cache_shrink(struct kmem_cache *cachep)
 {
 	get_online_cpus();
 	get_online_mems();
-	__kmem_cache_shrink(cachep);
+	__kmem_cache_shrink(cachep, false);
 	put_online_mems();
 	put_online_cpus();
 }
diff --git a/mm/slob.c b/mm/slob.c
index 043a14b6ccbe..e63ff9d926dc 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -618,7 +618,7 @@ int __kmem_cache_shutdown(struct kmem_cache *c)
 	return 0;
 }
 
-void __kmem_cache_shrink(struct kmem_cache *c)
+void __kmem_cache_shrink(struct kmem_cache *c, bool deactivate)
 {
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index c09d93dde40e..6f57824af019 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 	int pages;
 	int pobjects;
 
+	preempt_disable();
 	do {
 		pages = 0;
 		pobjects = 0;
@@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
+	if (unlikely(!s->cpu_partial)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+		local_irq_restore(flags);
+	}
+	preempt_enable();
 #endif
 }
 
@@ -3368,7 +3377,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-void __kmem_cache_shrink(struct kmem_cache *s)
+void __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 {
 	int node;
 	int i;
@@ -3381,6 +3390,15 @@ void __kmem_cache_shrink(struct kmem_cache *s)
 		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
 	unsigned long flags;
 
+	if (deactivate) {
+		/*
+		 * Disable empty slabs caching. Used to avoid pinning offline
+		 * memory cgroups by freeable kmem pages.
+		 */
+		s->cpu_partial = 0;
+		s->min_partial = 0;
+	}
+
 	if (!slabs_by_inuse) {
 		/*
 		 * Do not abort if we failed to allocate a temporary array.
@@ -3392,9 +3410,6 @@ void __kmem_cache_shrink(struct kmem_cache *s)
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
-		if (!n->nr_partial)
-			continue;
-
 		for (i = 0; i < objects; i++)
 			INIT_LIST_HEAD(slabs_by_inuse + i);
 
@@ -3438,7 +3453,7 @@ static int slab_mem_going_offline_callback(void *arg)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s);
+		__kmem_cache_shrink(s, false);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-26 12:55 ` [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails Vladimir Davydov
@ 2015-01-26 15:48   ` Christoph Lameter
  2015-01-26 17:01     ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-01-26 15:48 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, 26 Jan 2015, Vladimir Davydov wrote:

> SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
> but also sorts slabs by the number of objects in-use to cope with
> fragmentation. To achieve that, it tries to allocate a temporary array.
> If it fails, it will abort the whole procedure.

I do not think its worth optimizing this. If we cannot allocate even a
small object then the system is in an extremely bad state anyways.

> @@ -3400,7 +3407,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  		 * list_lock. page->inuse here is the upper limit.
>  		 */
>  		list_for_each_entry_safe(page, t, &n->partial, lru) {
> -			list_move(&page->lru, slabs_by_inuse + page->inuse);
> +			if (page->inuse < objects)
> +				list_move(&page->lru,
> +					  slabs_by_inuse + page->inuse);
>  			if (!page->inuse)
>  				n->nr_partial--;
>  		}

The condition is always true. A page that has page->inuse == objects
would not be on the partial list.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 12:55 ` [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value Vladimir Davydov
@ 2015-01-26 15:49   ` Christoph Lameter
  2015-01-26 17:04     ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-01-26 15:49 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, 26 Jan 2015, Vladimir Davydov wrote:

> @@ -2400,11 +2400,16 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
>  	return (ret ? 1 : 0);
>  }
>
> +void __kmem_cache_shrink(struct kmem_cache *cachep)
> +{
> +	__cache_shrink(cachep);
> +}
> +

Why do we need this wrapper? Rename __cache_shrink to __kmem_cache_shrink
instead?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-26 15:48   ` Christoph Lameter
@ 2015-01-26 17:01     ` Vladimir Davydov
  2015-01-26 18:24       ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 17:01 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

Hi Christoph,

On Mon, Jan 26, 2015 at 09:48:00AM -0600, Christoph Lameter wrote:
> On Mon, 26 Jan 2015, Vladimir Davydov wrote:
> 
> > SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
> > but also sorts slabs by the number of objects in-use to cope with
> > fragmentation. To achieve that, it tries to allocate a temporary array.
> > If it fails, it will abort the whole procedure.
> 
> I do not think its worth optimizing this. If we cannot allocate even a
> small object then the system is in an extremely bad state anyways.

Hmm, I've just checked my /proc/slabinfo and seen that I have 512
objects per slab at max, so that the temporary array will be 2 pages at
max. So you're right - this kmalloc will never fail on my system, simply
because we never fail GFP_KERNEL allocations of order < 3. However,
theoretically we can have as much as MAX_OBJS_PER_PAGE=32767 objects per
slab, which would result in a huge allocation.

Anyways, I think that silently relying on the fact that the allocator
never fails small allocations is kind of unreliable. What if this
behavior will change one day? So I'd prefer to either make
kmem_cache_shrink fall back to using a variable on stack in case of the
kmalloc failure, like this patch does, or place an explicit BUG_ON after
it. The latter looks dangerous to me, because, as I mentioned above, I'm
not sure that we always have less than 2048 objects per slab.

> 
> > @@ -3400,7 +3407,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  		 * list_lock. page->inuse here is the upper limit.
> >  		 */
> >  		list_for_each_entry_safe(page, t, &n->partial, lru) {
> > -			list_move(&page->lru, slabs_by_inuse + page->inuse);
> > +			if (page->inuse < objects)
> > +				list_move(&page->lru,
> > +					  slabs_by_inuse + page->inuse);
> >  			if (!page->inuse)
> >  				n->nr_partial--;
> >  		}
> 
> The condition is always true. A page that has page->inuse == objects
> would not be on the partial list.
> 

This is in case we failed to allocate the slabs_by_inuse array. We only
have a list for empty slabs then (on stack).

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 15:49   ` Christoph Lameter
@ 2015-01-26 17:04     ` Vladimir Davydov
  2015-01-26 18:26       ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 17:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, Jan 26, 2015 at 09:49:47AM -0600, Christoph Lameter wrote:
> On Mon, 26 Jan 2015, Vladimir Davydov wrote:
> 
> > @@ -2400,11 +2400,16 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
> >  	return (ret ? 1 : 0);
> >  }
> >
> > +void __kmem_cache_shrink(struct kmem_cache *cachep)
> > +{
> > +	__cache_shrink(cachep);
> > +}
> > +
> 
> Why do we need this wrapper? Rename __cache_shrink to __kmem_cache_shrink
> instead?
> 

__cache_shrink() is used not only in __kmem_cache_shrink(), but also in
SLAB's __kmem_cache_shutdown(), where we do need its return value to
check if the cache is empty.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-26 17:01     ` Vladimir Davydov
@ 2015-01-26 18:24       ` Christoph Lameter
  2015-01-26 19:36         ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-01-26 18:24 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, 26 Jan 2015, Vladimir Davydov wrote:

> Anyways, I think that silently relying on the fact that the allocator
> never fails small allocations is kind of unreliable. What if this

We are not doing that though. If the allocation fails we do nothing.

> > > +			if (page->inuse < objects)
> > > +				list_move(&page->lru,
> > > +					  slabs_by_inuse + page->inuse);
> > >  			if (!page->inuse)
> > >  				n->nr_partial--;
> > >  		}
> >
> > The condition is always true. A page that has page->inuse == objects
> > would not be on the partial list.
> >
>
> This is in case we failed to allocate the slabs_by_inuse array. We only
> have a list for empty slabs then (on stack).

Ok in that case objects == 1. If you want to do this maybe do it in a more
general way?

You could allocate an array on the stack to deal with the common cases. I
believe an array of 32 objects would be fine to allocate and cover most of
the slab caches on the system? Would eliminate most of the allocations in
kmem_cache_shrink.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 17:04     ` Vladimir Davydov
@ 2015-01-26 18:26       ` Christoph Lameter
  2015-01-26 19:48         ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-01-26 18:26 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, 26 Jan 2015, Vladimir Davydov wrote:

> __cache_shrink() is used not only in __kmem_cache_shrink(), but also in
> SLAB's __kmem_cache_shutdown(), where we do need its return value to
> check if the cache is empty.

It could be useful to know if a slab is empty. So maybe leave
kmem_cache_shrink the way it is and instead fix up slub to return the
proper value?


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-26 18:24       ` Christoph Lameter
@ 2015-01-26 19:36         ` Vladimir Davydov
  2015-01-26 19:53           ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 19:36 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, Jan 26, 2015 at 12:24:49PM -0600, Christoph Lameter wrote:
> On Mon, 26 Jan 2015, Vladimir Davydov wrote:
> 
> > Anyways, I think that silently relying on the fact that the allocator
> > never fails small allocations is kind of unreliable. What if this
> 
> We are not doing that though. If the allocation fails we do nothing.

Yeah, that's correct, but memcg/kmem wants it to always free empty slabs
(see patch 3 for details), so I'm trying to be punctual and eliminate
any possibility of failure, because a failure (if it ever happened)
would result in a permanent memory leak (pinned mem_cgroup + its
kmem_caches).

> 
> > > > +			if (page->inuse < objects)
> > > > +				list_move(&page->lru,
> > > > +					  slabs_by_inuse + page->inuse);
> > > >  			if (!page->inuse)
> > > >  				n->nr_partial--;
> > > >  		}
> > >
> > > The condition is always true. A page that has page->inuse == objects
> > > would not be on the partial list.
> > >
> >
> > This is in case we failed to allocate the slabs_by_inuse array. We only
> > have a list for empty slabs then (on stack).
> 
> Ok in that case objects == 1. If you want to do this maybe do it in a more
> general way?
> 
> You could allocate an array on the stack to deal with the common cases. I
> believe an array of 32 objects would be fine to allocate and cover most of
> the slab caches on the system? Would eliminate most of the allocations in
> kmem_cache_shrink.

We could do that, but IMO that would only complicate the code w/o
yielding any real benefits. This function is slow and called rarely
anyway, so I don't think there is any point to optimize out a page
allocation here.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 18:26       ` Christoph Lameter
@ 2015-01-26 19:48         ` Vladimir Davydov
  2015-01-26 19:55           ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 19:48 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, Jan 26, 2015 at 12:26:57PM -0600, Christoph Lameter wrote:
> On Mon, 26 Jan 2015, Vladimir Davydov wrote:
> 
> > __cache_shrink() is used not only in __kmem_cache_shrink(), but also in
> > SLAB's __kmem_cache_shutdown(), where we do need its return value to
> > check if the cache is empty.
> 
> It could be useful to know if a slab is empty. So maybe leave
> kmem_cache_shrink the way it is and instead fix up slub to return the
> proper value?

Hmm, why? The return value has existed since this function was
introduced, but nobody seems to have ever used it outside the slab core.
Besides, this check is racy, so IMO we shouldn't encourage users of the
API to rely on it. That said, I believe we should drop the return value
for now. If anybody ever needs it, we can reintroduce it.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-26 19:36         ` Vladimir Davydov
@ 2015-01-26 19:53           ` Christoph Lameter
  2015-01-27 12:58             ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-01-26 19:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, 26 Jan 2015, Vladimir Davydov wrote:

> We could do that, but IMO that would only complicate the code w/o
> yielding any real benefits. This function is slow and called rarely
> anyway, so I don't think there is any point to optimize out a page
> allocation here.

I think you already have the code there. Simply allow the sizeing of the
empty_page[] array. And rename it.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 19:48         ` Vladimir Davydov
@ 2015-01-26 19:55           ` Christoph Lameter
  2015-01-26 20:16             ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-01-26 19:55 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, 26 Jan 2015, Vladimir Davydov wrote:

> Hmm, why? The return value has existed since this function was
> introduced, but nobody seems to have ever used it outside the slab core.
> Besides, this check is racy, so IMO we shouldn't encourage users of the
> API to rely on it. That said, I believe we should drop the return value
> for now. If anybody ever needs it, we can reintroduce it.

The check is only racy if you have concurrent users. It is not racy if a
subsystem shuts down access to the slabs and then checks if everything is
clean before closing the cache.

Slab creation and destruction are not serialized. It is the responsibility
of the subsystem to make sure that there are no concurrent users and that
there are no objects remaining before destroying a slab.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 19:55           ` Christoph Lameter
@ 2015-01-26 20:16             ` Vladimir Davydov
  2015-01-26 20:28               ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 20:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, Jan 26, 2015 at 01:55:14PM -0600, Christoph Lameter wrote:
> On Mon, 26 Jan 2015, Vladimir Davydov wrote:
> 
> > Hmm, why? The return value has existed since this function was
> > introduced, but nobody seems to have ever used it outside the slab core.
> > Besides, this check is racy, so IMO we shouldn't encourage users of the
> > API to rely on it. That said, I believe we should drop the return value
> > for now. If anybody ever needs it, we can reintroduce it.
> 
> The check is only racy if you have concurrent users. It is not racy if a
> subsystem shuts down access to the slabs and then checks if everything is
> clean before closing the cache.
>
> Slab creation and destruction are not serialized. It is the responsibility
> of the subsystem to make sure that there are no concurrent users and that
> there are no objects remaining before destroying a slab.

Right, but I just don't see why a subsystem using a kmem_cache would
need to check whether there are any objects left in the cache. I mean,
it should somehow keep track of the objects it's allocated anyway, e.g.
by linking them in a list. That means it must already have a way to
check if it is safe to destroy its cache or not.

Suppose we leave the return value as is. A subsystem, right before going
to destroy a cache, calls kmem_cache_shrink, which returns 1 (slab is
not empty). What is it supposed to do then?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 20:16             ` Vladimir Davydov
@ 2015-01-26 20:28               ` Christoph Lameter
  2015-01-26 20:43                 ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-01-26 20:28 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, 26 Jan 2015, Vladimir Davydov wrote:

> Right, but I just don't see why a subsystem using a kmem_cache would
> need to check whether there are any objects left in the cache. I mean,
> it should somehow keep track of the objects it's allocated anyway, e.g.
> by linking them in a list. That means it must already have a way to
> check if it is safe to destroy its cache or not.

The acpi subsystem did that at some point.

> Suppose we leave the return value as is. A subsystem, right before going
> to destroy a cache, calls kmem_cache_shrink, which returns 1 (slab is
> not empty). What is it supposed to do then?

That is up to the subsystem. If it has a means of tracking down the
missing object then it can deal with it. If not then it cannot shutdown
the cache and do a proper recovery action.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value
  2015-01-26 20:28               ` Christoph Lameter
@ 2015-01-26 20:43                 ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-26 20:43 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, Jan 26, 2015 at 02:28:33PM -0600, Christoph Lameter wrote:
> On Mon, 26 Jan 2015, Vladimir Davydov wrote:
> 
> > Right, but I just don't see why a subsystem using a kmem_cache would
> > need to check whether there are any objects left in the cache. I mean,
> > it should somehow keep track of the objects it's allocated anyway, e.g.
> > by linking them in a list. That means it must already have a way to
> > check if it is safe to destroy its cache or not.
> 
> The acpi subsystem did that at some point.
> 
> > Suppose we leave the return value as is. A subsystem, right before going
> > to destroy a cache, calls kmem_cache_shrink, which returns 1 (slab is
> > not empty). What is it supposed to do then?
> 
> That is up to the subsystem. If it has a means of tracking down the
> missing object then it can deal with it. If not then it cannot shutdown
> the cache and do a proper recovery action.

Hmm, we could make kmem_cache_destroy return EBUSY for the purpose.
However, since it spits warnings on failure, which is reasonable, we
have this check in kmem_cache_shrink...

Anyways, I see your point now, thank you for pointing it out. I will fix
SLUB's __kmem_cache_shrink retval instead of removing it altogether in
the next iteration.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately
  2015-01-26 12:55 ` [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
@ 2015-01-27  8:00   ` Joonsoo Kim
  2015-01-27  8:23     ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Joonsoo Kim @ 2015-01-27  8:00 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, Jan 26, 2015 at 03:55:29PM +0300, Vladimir Davydov wrote:
> To speed up further allocations SLUB may store empty slabs in per
> cpu/node partial lists instead of freeing them immediately. This
> prevents per memcg caches destruction, because kmem caches created for a
> memory cgroup are only destroyed after the last page charged to the
> cgroup is freed.
> 
> To fix this issue, this patch resurrects approach first proposed in [1].
> It forbids SLUB to cache empty slabs after the memory cgroup that the
> cache belongs to was destroyed. It is achieved by setting kmem_cache's
> cpu_partial and min_partial constants to 0 and tuning put_cpu_partial()
> so that it would drop frozen empty slabs immediately if cpu_partial = 0.
> 
> The runtime overhead is minimal. From all the hot functions, we only
> touch relatively cold put_cpu_partial(): we make it call
> unfreeze_partials() after freezing a slab that belongs to an offline
> memory cgroup. Since slab freezing exists to avoid moving slabs from/to
> a partial list on free/alloc, and there can't be allocations from dead
> caches, it shouldn't cause any overhead. We do have to disable
> preemption for put_cpu_partial() to achieve that though.
> 
> The original patch was accepted well and even merged to the mm tree.
> However, I decided to withdraw it due to changes happening to the memcg
> core at that time. I had an idea of introducing per-memcg shrinkers for
> kmem caches, but now, as memcg has finally settled down, I do not see it
> as an option, because SLUB shrinker would be too costly to call since
> SLUB does not keep free slabs on a separate list. Besides, we currently
> do not even call per-memcg shrinkers for offline memcgs. Overall, it
> would introduce much more complexity to both SLUB and memcg than this
> small patch.
> 
> Regarding to SLAB, there's no problem with it, because it shrinks
> per-cpu/node caches periodically. Thanks to list_lru reparenting, we no
> longer keep entries for offline cgroups in per-memcg arrays (such as
> memcg_cache_params->memcg_caches), so we do not have to bother if a
> per-memcg cache will be shrunk a bit later than it could be.
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  mm/slab.c        |    2 +-
>  mm/slab.h        |    2 +-
>  mm/slab_common.c |   15 +++++++++++++--
>  mm/slob.c        |    2 +-
>  mm/slub.c        |   25 ++++++++++++++++++++-----
>  5 files changed, 36 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index 279c44d6d8e1..f0514df07b85 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -2400,7 +2400,7 @@ static int __cache_shrink(struct kmem_cache *cachep)
>  	return (ret ? 1 : 0);
>  }
>  
> -void __kmem_cache_shrink(struct kmem_cache *cachep)
> +void __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
>  {
>  	__cache_shrink(cachep);
>  }
> diff --git a/mm/slab.h b/mm/slab.h
> index c036e520d2cf..041260197984 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -138,7 +138,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
>  #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS)
>  
>  int __kmem_cache_shutdown(struct kmem_cache *);
> -void __kmem_cache_shrink(struct kmem_cache *);
> +void __kmem_cache_shrink(struct kmem_cache *, bool);
>  void slab_kmem_cache_release(struct kmem_cache *);
>  
>  struct seq_file;
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 6803639fdff0..472ab7fcffd4 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -549,10 +549,13 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
>  {
>  	int idx;
>  	struct memcg_cache_array *arr;
> -	struct kmem_cache *s;
> +	struct kmem_cache *s, *c;
>  
>  	idx = memcg_cache_id(memcg);
>  
> +	get_online_cpus();
> +	get_online_mems();
> +
>  	mutex_lock(&slab_mutex);
>  	list_for_each_entry(s, &slab_caches, list) {
>  		if (!is_root_cache(s))
> @@ -560,9 +563,17 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
>  
>  		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
>  						lockdep_is_held(&slab_mutex));
> +		c = arr->entries[idx];
> +		if (!c)
> +			continue;
> +
> +		__kmem_cache_shrink(c, true);
>  		arr->entries[idx] = NULL;
>  	}
>  	mutex_unlock(&slab_mutex);
> +
> +	put_online_mems();
> +	put_online_cpus();
>  }
>  
>  void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
> @@ -646,7 +657,7 @@ void kmem_cache_shrink(struct kmem_cache *cachep)
>  {
>  	get_online_cpus();
>  	get_online_mems();
> -	__kmem_cache_shrink(cachep);
> +	__kmem_cache_shrink(cachep, false);
>  	put_online_mems();
>  	put_online_cpus();
>  }
> diff --git a/mm/slob.c b/mm/slob.c
> index 043a14b6ccbe..e63ff9d926dc 100644
> --- a/mm/slob.c
> +++ b/mm/slob.c
> @@ -618,7 +618,7 @@ int __kmem_cache_shutdown(struct kmem_cache *c)
>  	return 0;
>  }
>  
> -void __kmem_cache_shrink(struct kmem_cache *c)
> +void __kmem_cache_shrink(struct kmem_cache *c, bool deactivate)
>  {
>  }
>  
> diff --git a/mm/slub.c b/mm/slub.c
> index c09d93dde40e..6f57824af019 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2007,6 +2007,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>  	int pages;
>  	int pobjects;
>  
> +	preempt_disable();
>  	do {
>  		pages = 0;
>  		pobjects = 0;
> @@ -2040,6 +2041,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>  
>  	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>  								!= oldpage);
> +	if (unlikely(!s->cpu_partial)) {
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> +		local_irq_restore(flags);
> +	}
> +	preempt_enable();
>  #endif
>  }
>  
> @@ -3368,7 +3377,7 @@ EXPORT_SYMBOL(kfree);
>   * being allocated from last increasing the chance that the last objects
>   * are freed in them.
>   */
> -void __kmem_cache_shrink(struct kmem_cache *s)
> +void __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
>  {
>  	int node;
>  	int i;
> @@ -3381,6 +3390,15 @@ void __kmem_cache_shrink(struct kmem_cache *s)
>  		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
>  	unsigned long flags;
>  
> +	if (deactivate) {
> +		/*
> +		 * Disable empty slabs caching. Used to avoid pinning offline
> +		 * memory cgroups by freeable kmem pages.
> +		 */
> +		s->cpu_partial = 0;
> +		s->min_partial = 0;
> +	}
> +

Hello,

Maybe, kick_all_cpus_sync() is needed here since object would
be freed asynchronously so they can't see this updated value.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately
  2015-01-27  8:00   ` Joonsoo Kim
@ 2015-01-27  8:23     ` Vladimir Davydov
  2015-01-27  9:21       ` Joonsoo Kim
  0 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-27  8:23 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

Hi Joonsoo,

On Tue, Jan 27, 2015 at 05:00:09PM +0900, Joonsoo Kim wrote:
> On Mon, Jan 26, 2015 at 03:55:29PM +0300, Vladimir Davydov wrote:
> > @@ -3381,6 +3390,15 @@ void __kmem_cache_shrink(struct kmem_cache *s)
> >  		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
> >  	unsigned long flags;
> >  
> > +	if (deactivate) {
> > +		/*
> > +		 * Disable empty slabs caching. Used to avoid pinning offline
> > +		 * memory cgroups by freeable kmem pages.
> > +		 */
> > +		s->cpu_partial = 0;
> > +		s->min_partial = 0;
> > +	}
> > +
> 
> Maybe, kick_all_cpus_sync() is needed here since object would
> be freed asynchronously so they can't see this updated value.

I thought flush_all() should do the trick, no?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately
  2015-01-27  8:23     ` Vladimir Davydov
@ 2015-01-27  9:21       ` Joonsoo Kim
  2015-01-27  9:28         ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Joonsoo Kim @ 2015-01-27  9:21 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Joonsoo Kim, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko,
	Linux Memory Management List, LKML

2015-01-27 17:23 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> Hi Joonsoo,
>
> On Tue, Jan 27, 2015 at 05:00:09PM +0900, Joonsoo Kim wrote:
>> On Mon, Jan 26, 2015 at 03:55:29PM +0300, Vladimir Davydov wrote:
>> > @@ -3381,6 +3390,15 @@ void __kmem_cache_shrink(struct kmem_cache *s)
>> >             kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
>> >     unsigned long flags;
>> >
>> > +   if (deactivate) {
>> > +           /*
>> > +            * Disable empty slabs caching. Used to avoid pinning offline
>> > +            * memory cgroups by freeable kmem pages.
>> > +            */
>> > +           s->cpu_partial = 0;
>> > +           s->min_partial = 0;
>> > +   }
>> > +
>>
>> Maybe, kick_all_cpus_sync() is needed here since object would
>> be freed asynchronously so they can't see this updated value.
>
> I thought flush_all() should do the trick, no?

Unfortunately, it doesn't.

flush_all() sends IPI to not all cpus. It only sends IPI to cpus where
some conditions
are met and freeing could occur on the other ones.

Thanks.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately
  2015-01-27  9:21       ` Joonsoo Kim
@ 2015-01-27  9:28         ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-27  9:28 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Johannes Weiner, Michal Hocko,
	Linux Memory Management List, LKML

On Tue, Jan 27, 2015 at 06:21:14PM +0900, Joonsoo Kim wrote:
> 2015-01-27 17:23 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> > Hi Joonsoo,
> >
> > On Tue, Jan 27, 2015 at 05:00:09PM +0900, Joonsoo Kim wrote:
> >> On Mon, Jan 26, 2015 at 03:55:29PM +0300, Vladimir Davydov wrote:
> >> > @@ -3381,6 +3390,15 @@ void __kmem_cache_shrink(struct kmem_cache *s)
> >> >             kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
> >> >     unsigned long flags;
> >> >
> >> > +   if (deactivate) {
> >> > +           /*
> >> > +            * Disable empty slabs caching. Used to avoid pinning offline
> >> > +            * memory cgroups by freeable kmem pages.
> >> > +            */
> >> > +           s->cpu_partial = 0;
> >> > +           s->min_partial = 0;
> >> > +   }
> >> > +
> >>
> >> Maybe, kick_all_cpus_sync() is needed here since object would
> >> be freed asynchronously so they can't see this updated value.
> >
> > I thought flush_all() should do the trick, no?
> 
> Unfortunately, it doesn't.
> 
> flush_all() sends IPI to not all cpus. It only sends IPI to cpus where
> some conditions
> are met and freeing could occur on the other ones.

Oh, true, missed that. Yeah, we should kick all cpus explicitly then.
Will fix in the next iteration. Thanks for catching this!

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-26 19:53           ` Christoph Lameter
@ 2015-01-27 12:58             ` Vladimir Davydov
  2015-01-27 17:02               ` Christoph Lameter
  0 siblings, 1 reply; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-27 12:58 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Mon, Jan 26, 2015 at 01:53:32PM -0600, Christoph Lameter wrote:
> On Mon, 26 Jan 2015, Vladimir Davydov wrote:
> 
> > We could do that, but IMO that would only complicate the code w/o
> > yielding any real benefits. This function is slow and called rarely
> > anyway, so I don't think there is any point to optimize out a page
> > allocation here.
> 
> I think you already have the code there. Simply allow the sizeing of the
> empty_page[] array. And rename it.
> 

May be, we could remove this allocation at all then? I mean, always
distribute slabs among constant number of buckets, say 32, like this:

diff --git a/mm/slub.c b/mm/slub.c
index 5ed1a73e2ec8..a43b213770b4 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3358,6 +3358,8 @@ void kfree(const void *x)
 }
 EXPORT_SYMBOL(kfree);
 
+#define SHRINK_BUCKETS 32
+
 /*
  * kmem_cache_shrink removes empty slabs from the partial lists and sorts
  * the remaining slabs by the number of items in use. The slabs with the
@@ -3376,19 +3378,15 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct page *page;
 	struct page *t;
 	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+	struct list_head slabs_by_inuse[SHRINK_BUCKETS];
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
-
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		if (!n->nr_partial)
 			continue;
 
-		for (i = 0; i < objects; i++)
+		for (i = 0; i < SHRINK_BUCKETS; i++)
 			INIT_LIST_HEAD(slabs_by_inuse + i);
 
 		spin_lock_irqsave(&n->list_lock, flags);
@@ -3400,7 +3398,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
+			i = DIV_ROUND_UP(page->inuse * (SHRINK_BUCKETS - 1),
+					 objects);
+			list_move(&page->lru, slabs_by_inuse + i);
 			if (!page->inuse)
 				n->nr_partial--;
 		}
@@ -3409,7 +3409,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * Rebuild the partial list with the slabs filled up most
 		 * first and the least used slabs at the end.
 		 */
-		for (i = objects - 1; i > 0; i--)
+		for (i = SHRINK_BUCKETS - 1; i > 0; i--)
 			list_splice(slabs_by_inuse + i, n->partial.prev);
 
 		spin_unlock_irqrestore(&n->list_lock, flags);
@@ -3419,7 +3419,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 			discard_slab(s, page);
 	}
 
-	kfree(slabs_by_inuse);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-27 12:58             ` Vladimir Davydov
@ 2015-01-27 17:02               ` Christoph Lameter
  2015-01-28 15:00                 ` Vladimir Davydov
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Lameter @ 2015-01-27 17:02 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Tue, 27 Jan 2015, Vladimir Davydov wrote:

> May be, we could remove this allocation at all then? I mean, always
> distribute slabs among constant number of buckets, say 32, like this:

The point of the sorting is to have the slab pages that only have a few
objects available at the beginning of the list. Allocations can then
easily reduce the size of hte partial page list.

What you could do is simply put all slab pages with more than 32 objects
available at the end of the list.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2015-01-27 17:02               ` Christoph Lameter
@ 2015-01-28 15:00                 ` Vladimir Davydov
  0 siblings, 0 replies; 24+ messages in thread
From: Vladimir Davydov @ 2015-01-28 15:00 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, Joonsoo Kim,
	Johannes Weiner, Michal Hocko, linux-mm, linux-kernel

On Tue, Jan 27, 2015 at 11:02:12AM -0600, Christoph Lameter wrote:
> What you could do is simply put all slab pages with more than 32 objects
> available at the end of the list.

OK, got it, will redo. Thanks!

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-01-29  1:56 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-26 12:55 [PATCH -mm 0/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
2015-01-26 12:55 ` [PATCH -mm 1/3] slub: don't fail kmem_cache_shrink if slab placement optimization fails Vladimir Davydov
2015-01-26 15:48   ` Christoph Lameter
2015-01-26 17:01     ` Vladimir Davydov
2015-01-26 18:24       ` Christoph Lameter
2015-01-26 19:36         ` Vladimir Davydov
2015-01-26 19:53           ` Christoph Lameter
2015-01-27 12:58             ` Vladimir Davydov
2015-01-27 17:02               ` Christoph Lameter
2015-01-28 15:00                 ` Vladimir Davydov
2015-01-26 12:55 ` [PATCH -mm 2/3] slab: zap kmem_cache_shrink return value Vladimir Davydov
2015-01-26 15:49   ` Christoph Lameter
2015-01-26 17:04     ` Vladimir Davydov
2015-01-26 18:26       ` Christoph Lameter
2015-01-26 19:48         ` Vladimir Davydov
2015-01-26 19:55           ` Christoph Lameter
2015-01-26 20:16             ` Vladimir Davydov
2015-01-26 20:28               ` Christoph Lameter
2015-01-26 20:43                 ` Vladimir Davydov
2015-01-26 12:55 ` [PATCH -mm 3/3] slub: make dead caches discard free slabs immediately Vladimir Davydov
2015-01-27  8:00   ` Joonsoo Kim
2015-01-27  8:23     ` Vladimir Davydov
2015-01-27  9:21       ` Joonsoo Kim
2015-01-27  9:28         ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).