[PATCH -mm 0/8] memcg/slab: reintroduce dead cache self-destruction

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH -mm 0/8] memcg/slab: reintroduce dead cache self-destruction
@ 2014-05-30 13:51 ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

Hi,

When a memcg is turned offline, some of its kmem caches can still have
active objects and therefore cannot be destroyed immediately. Currently,
we simply leak such caches along with the owner memcg, which is bad and
should be resolved.

It would be perfect if we could move all slab pages of such dead caches
to the root/parent cache on memcg offline. However, when I tried to
implement such re-parenting, I was pointed out by Christoph that the
overhead of this would be unacceptable, at least for SLUB (see
https://lkml.org/lkml/2014/5/13/446)

The problem with re-parenting of individual slabs is that it requires
tracking of all slabs allocated to a cache, but SLUB doesn't track full
slabs if !debug. Changing this behavior would result in significant
performance degradation of regular alloc/free paths, because it would
make alloc/free take per node list locks more often.

After pondering about this problem for some time, I think we should
return to dead caches self-destruction, i.e. scheduling cache
destruction work when the last slab page is freed.

This is the behavior we had before commit 5bd93da9917f ("memcg, slab:
simplify synchronization scheme"). The reason why it was removed was that
it simply didn't work, because SL[AU]B are implemented in such a way
that they don't discard empty slabs immediately, but prefer keeping them
cached for indefinite time to speed up further allocations.

However, we can change this w/o noticeable performance impact for both
SLAB and SLUB. For SLAB, we can make internal cache reaper shrink dead
caches aggressively so that they will die quickly after the last object
is freed. For SLUB, we can make free path drop free slabs as soon as
they become empty. Since dead caches should never be allocated from,
removing empty slabs from them shouldn't degrade performance.

So, this patch set reintroduces dead cache self-destruction and adds
some tweaks to SL[AU]B to prevent dead caches from hanging around
indefinitely. It is organized as follows:

 - patches 1-3 reintroduce dead memcg cache self-destruction;
 - patches 4-5 do some cleanup in kmem_cache_shrink;
 - patch 6 is a minor optimization for SLUB, which makes the following
   work a bit easier for me;
 - patches 7 and 8 solves the problem with dead memcg caches for SLUB
   and SLAB respectively.

Even if the whole approach is NAK'ed, patches 4, 5, and 6 are worth
applying, IMO, provided Christoph doesn't mind, of course. They don't
depend on the rest of the set, BTW.

Note, this doesn't resolve the problem of memcgs pinned by dead kmem
caches. I'm planning to solve this by re-parenting dead kmem caches to
the parent memcg.

Thanks,

Vladimir Davydov (8):
  memcg: cleanup memcg_cache_params refcnt usage
  memcg: destroy kmem caches when last slab is freed
  memcg: mark caches that belong to offline memcgs as dead
  slub: never fail kmem_cache_shrink
  slab: remove kmem_cache_shrink retval
  slub: do not use cmpxchg for adding cpu partials when irqs disabled
  slub: make dead caches discard free slabs immediately
  slab: reap dead memcg caches aggressively

 include/linux/slab.h |   10 ++-
 mm/memcontrol.c      |   25 +++++-
 mm/slab.c            |   28 +++++--
 mm/slab.h            |   12 ++-
 mm/slab_common.c     |    8 +-
 mm/slob.c            |    3 +-
 mm/slub.c            |  212 ++++++++++++++++++++++++++++++++++----------------
 7 files changed, 207 insertions(+), 91 deletions(-)

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH -mm 0/8] memcg/slab: reintroduce dead cache self-destruction
@ 2014-05-30 13:51 ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

Hi,

When a memcg is turned offline, some of its kmem caches can still have
active objects and therefore cannot be destroyed immediately. Currently,
we simply leak such caches along with the owner memcg, which is bad and
should be resolved.

It would be perfect if we could move all slab pages of such dead caches
to the root/parent cache on memcg offline. However, when I tried to
implement such re-parenting, I was pointed out by Christoph that the
overhead of this would be unacceptable, at least for SLUB (see
https://lkml.org/lkml/2014/5/13/446)

The problem with re-parenting of individual slabs is that it requires
tracking of all slabs allocated to a cache, but SLUB doesn't track full
slabs if !debug. Changing this behavior would result in significant
performance degradation of regular alloc/free paths, because it would
make alloc/free take per node list locks more often.

After pondering about this problem for some time, I think we should
return to dead caches self-destruction, i.e. scheduling cache
destruction work when the last slab page is freed.

This is the behavior we had before commit 5bd93da9917f ("memcg, slab:
simplify synchronization scheme"). The reason why it was removed was that
it simply didn't work, because SL[AU]B are implemented in such a way
that they don't discard empty slabs immediately, but prefer keeping them
cached for indefinite time to speed up further allocations.

However, we can change this w/o noticeable performance impact for both
SLAB and SLUB. For SLAB, we can make internal cache reaper shrink dead
caches aggressively so that they will die quickly after the last object
is freed. For SLUB, we can make free path drop free slabs as soon as
they become empty. Since dead caches should never be allocated from,
removing empty slabs from them shouldn't degrade performance.

So, this patch set reintroduces dead cache self-destruction and adds
some tweaks to SL[AU]B to prevent dead caches from hanging around
indefinitely. It is organized as follows:

 - patches 1-3 reintroduce dead memcg cache self-destruction;
 - patches 4-5 do some cleanup in kmem_cache_shrink;
 - patch 6 is a minor optimization for SLUB, which makes the following
   work a bit easier for me;
 - patches 7 and 8 solves the problem with dead memcg caches for SLUB
   and SLAB respectively.

Even if the whole approach is NAK'ed, patches 4, 5, and 6 are worth
applying, IMO, provided Christoph doesn't mind, of course. They don't
depend on the rest of the set, BTW.

Note, this doesn't resolve the problem of memcgs pinned by dead kmem
caches. I'm planning to solve this by re-parenting dead kmem caches to
the parent memcg.

Thanks,

Vladimir Davydov (8):
  memcg: cleanup memcg_cache_params refcnt usage
  memcg: destroy kmem caches when last slab is freed
  memcg: mark caches that belong to offline memcgs as dead
  slub: never fail kmem_cache_shrink
  slab: remove kmem_cache_shrink retval
  slub: do not use cmpxchg for adding cpu partials when irqs disabled
  slub: make dead caches discard free slabs immediately
  slab: reap dead memcg caches aggressively

 include/linux/slab.h |   10 ++-
 mm/memcontrol.c      |   25 +++++-
 mm/slab.c            |   28 +++++--
 mm/slab.h            |   12 ++-
 mm/slab_common.c     |    8 +-
 mm/slob.c            |    3 +-
 mm/slub.c            |  212 ++++++++++++++++++++++++++++++++++----------------
 7 files changed, 207 insertions(+), 91 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH -mm 1/8] memcg: cleanup memcg_cache_params refcnt usage
  2014-05-30 13:51 ` Vladimir Davydov
@ 2014-05-30 13:51   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

Currently, we count the number of pages allocated to a per memcg cache
in memcg_cache_params->nr_pages. We only use this counter to find out if
the cache is empty and can be destroyed. So let's rename it to refcnt
and make it count not pages, but slabs so that we can use atomic_inc/dec
instead of atomic_add/sub in memcg_charge/uncharge_slab.

Also, as the number of slabs theoretically can be greater than INT_MAX,
let's use atomic_long for the counter.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    4 ++--
 mm/memcontrol.c      |    6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1d9abb7d22a0..1985bd9bec7d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -526,7 +526,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
- * @nr_pages: number of pages that belongs to this cache.
+ * @refcnt: reference counter
  */
 struct memcg_cache_params {
 	bool is_root_cache;
@@ -539,7 +539,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
-			atomic_t nr_pages;
+			atomic_long_t refcnt;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 15bda8133ff9..98a24e5ea4b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3279,7 +3279,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
 		kmem_cache_shrink(cachep);
-		if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
+		if (atomic_long_read(&cachep->memcg_params->refcnt) == 0)
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
@@ -3353,14 +3353,14 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
 				PAGE_SIZE << order);
 	if (!res)
-		atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+		atomic_long_inc(&cachep->memcg_params->refcnt);
 	return res;
 }
 
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+	atomic_long_dec(&cachep->memcg_params->refcnt);
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 1/8] memcg: cleanup memcg_cache_params refcnt usage
@ 2014-05-30 13:51   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

Currently, we count the number of pages allocated to a per memcg cache
in memcg_cache_params->nr_pages. We only use this counter to find out if
the cache is empty and can be destroyed. So let's rename it to refcnt
and make it count not pages, but slabs so that we can use atomic_inc/dec
instead of atomic_add/sub in memcg_charge/uncharge_slab.

Also, as the number of slabs theoretically can be greater than INT_MAX,
let's use atomic_long for the counter.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    4 ++--
 mm/memcontrol.c      |    6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1d9abb7d22a0..1985bd9bec7d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -526,7 +526,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
- * @nr_pages: number of pages that belongs to this cache.
+ * @refcnt: reference counter
  */
 struct memcg_cache_params {
 	bool is_root_cache;
@@ -539,7 +539,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
-			atomic_t nr_pages;
+			atomic_long_t refcnt;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 15bda8133ff9..98a24e5ea4b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3279,7 +3279,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
 		kmem_cache_shrink(cachep);
-		if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
+		if (atomic_long_read(&cachep->memcg_params->refcnt) == 0)
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
@@ -3353,14 +3353,14 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
 				PAGE_SIZE << order);
 	if (!res)
-		atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+		atomic_long_inc(&cachep->memcg_params->refcnt);
 	return res;
 }
 
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+	atomic_long_dec(&cachep->memcg_params->refcnt);
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 2/8] memcg: destroy kmem caches when last slab is freed
  2014-05-30 13:51 ` Vladimir Davydov
@ 2014-05-30 13:51   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

When the memcg_cache_params->refcnt goes to 0, schedule the worker that
will unregister the cache. To prevent this from happening when the owner
memcg is alive, keep the refcnt incremented during memcg lifetime.

Note, this doesn't guarantee that the cache that belongs to a dead memcg
will go away as soon as the last object is freed, because SL[AU]B
implementation can cache empty slabs for performance reasons. Hence the
cache may be hanging around indefinitely after memcg offline. This is to
be resolved by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   22 ++++++++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1985bd9bec7d..d9716fdc8211 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
  * @refcnt: reference counter
+ * @unregister_work: worker to destroy the cache
  */
 struct memcg_cache_params {
 	bool is_root_cache;
@@ -540,6 +541,7 @@ struct memcg_cache_params {
 			struct list_head list;
 			struct kmem_cache *root_cache;
 			atomic_long_t refcnt;
+			struct work_struct unregister_work;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 98a24e5ea4b5..886b5b414958 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3114,6 +3114,8 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+static void memcg_unregister_cache_func(struct work_struct *work);
+
 int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 			     struct kmem_cache *root_cache)
 {
@@ -3135,6 +3137,9 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 	if (memcg) {
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
+		atomic_long_set(&s->memcg_params->refcnt, 1);
+		INIT_WORK(&s->memcg_params->unregister_work,
+			  memcg_unregister_cache_func);
 		css_get(&memcg->css);
 	} else
 		s->memcg_params->is_root_cache = true;
@@ -3216,6 +3221,17 @@ static void memcg_unregister_cache(struct kmem_cache *cachep)
 	kmem_cache_destroy(cachep);
 }
 
+static void memcg_unregister_cache_func(struct work_struct *work)
+{
+	struct memcg_cache_params *params =
+		container_of(work, struct memcg_cache_params, unregister_work);
+	struct kmem_cache *cachep = memcg_params_to_cache(params);
+
+	mutex_lock(&memcg_slab_mutex);
+	memcg_unregister_cache(cachep);
+	mutex_unlock(&memcg_slab_mutex);
+}
+
 /*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
@@ -3279,7 +3295,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
 		kmem_cache_shrink(cachep);
-		if (atomic_long_read(&cachep->memcg_params->refcnt) == 0)
+		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
@@ -3360,7 +3376,9 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_long_dec(&cachep->memcg_params->refcnt);
+
+	if (unlikely(atomic_long_dec_and_test(&cachep->memcg_params->refcnt)))
+		schedule_work(&cachep->memcg_params->unregister_work);
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 2/8] memcg: destroy kmem caches when last slab is freed
@ 2014-05-30 13:51   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

When the memcg_cache_params->refcnt goes to 0, schedule the worker that
will unregister the cache. To prevent this from happening when the owner
memcg is alive, keep the refcnt incremented during memcg lifetime.

Note, this doesn't guarantee that the cache that belongs to a dead memcg
will go away as soon as the last object is freed, because SL[AU]B
implementation can cache empty slabs for performance reasons. Hence the
cache may be hanging around indefinitely after memcg offline. This is to
be resolved by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   22 ++++++++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1985bd9bec7d..d9716fdc8211 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
  * @refcnt: reference counter
+ * @unregister_work: worker to destroy the cache
  */
 struct memcg_cache_params {
 	bool is_root_cache;
@@ -540,6 +541,7 @@ struct memcg_cache_params {
 			struct list_head list;
 			struct kmem_cache *root_cache;
 			atomic_long_t refcnt;
+			struct work_struct unregister_work;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 98a24e5ea4b5..886b5b414958 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3114,6 +3114,8 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+static void memcg_unregister_cache_func(struct work_struct *work);
+
 int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 			     struct kmem_cache *root_cache)
 {
@@ -3135,6 +3137,9 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 	if (memcg) {
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
+		atomic_long_set(&s->memcg_params->refcnt, 1);
+		INIT_WORK(&s->memcg_params->unregister_work,
+			  memcg_unregister_cache_func);
 		css_get(&memcg->css);
 	} else
 		s->memcg_params->is_root_cache = true;
@@ -3216,6 +3221,17 @@ static void memcg_unregister_cache(struct kmem_cache *cachep)
 	kmem_cache_destroy(cachep);
 }
 
+static void memcg_unregister_cache_func(struct work_struct *work)
+{
+	struct memcg_cache_params *params =
+		container_of(work, struct memcg_cache_params, unregister_work);
+	struct kmem_cache *cachep = memcg_params_to_cache(params);
+
+	mutex_lock(&memcg_slab_mutex);
+	memcg_unregister_cache(cachep);
+	mutex_unlock(&memcg_slab_mutex);
+}
+
 /*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
@@ -3279,7 +3295,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
 		kmem_cache_shrink(cachep);
-		if (atomic_long_read(&cachep->memcg_params->refcnt) == 0)
+		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
@@ -3360,7 +3376,9 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_long_dec(&cachep->memcg_params->refcnt);
+
+	if (unlikely(atomic_long_dec_and_test(&cachep->memcg_params->refcnt)))
+		schedule_work(&cachep->memcg_params->unregister_work);
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 3/8] memcg: mark caches that belong to offline memcgs as dead
  2014-05-30 13:51 ` Vladimir Davydov
@ 2014-05-30 13:51   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

This will be used by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |    1 +
 mm/slab.h            |   10 ++++++++++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d9716fdc8211..d99d5212b815 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
  * @refcnt: reference counter
+ * @dead: set to true when owner memcg is turned offline
  * @unregister_work: worker to destroy the cache
  */
 struct memcg_cache_params {
@@ -541,6 +542,7 @@ struct memcg_cache_params {
 			struct list_head list;
 			struct kmem_cache *root_cache;
 			atomic_long_t refcnt;
+			bool dead;
 			struct work_struct unregister_work;
 		};
 	};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 886b5b414958..ed42fd1105a5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3294,6 +3294,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	mutex_lock(&memcg_slab_mutex);
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
+		cachep->memcg_params->dead = true;
 		kmem_cache_shrink(cachep);
 		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
 			memcg_unregister_cache(cachep);
diff --git a/mm/slab.h b/mm/slab.h
index 961a3fb1f5a2..9515cc520bf8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -121,6 +121,11 @@ static inline bool is_root_cache(struct kmem_cache *s)
 	return !s->memcg_params || s->memcg_params->is_root_cache;
 }
 
+static inline bool memcg_cache_dead(struct kmem_cache *s)
+{
+	return !is_root_cache(s) && s->memcg_params->dead;
+}
+
 static inline bool slab_equal_or_root(struct kmem_cache *s,
 					struct kmem_cache *p)
 {
@@ -203,6 +208,11 @@ static inline bool is_root_cache(struct kmem_cache *s)
 	return true;
 }
 
+static inline bool memcg_cache_dead(struct kmem_cache *s)
+{
+	return false;
+}
+
 static inline bool slab_equal_or_root(struct kmem_cache *s,
 				      struct kmem_cache *p)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 3/8] memcg: mark caches that belong to offline memcgs as dead
@ 2014-05-30 13:51   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

This will be used by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |    1 +
 mm/slab.h            |   10 ++++++++++
 3 files changed, 13 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d9716fdc8211..d99d5212b815 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
  * @refcnt: reference counter
+ * @dead: set to true when owner memcg is turned offline
  * @unregister_work: worker to destroy the cache
  */
 struct memcg_cache_params {
@@ -541,6 +542,7 @@ struct memcg_cache_params {
 			struct list_head list;
 			struct kmem_cache *root_cache;
 			atomic_long_t refcnt;
+			bool dead;
 			struct work_struct unregister_work;
 		};
 	};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 886b5b414958..ed42fd1105a5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3294,6 +3294,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	mutex_lock(&memcg_slab_mutex);
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
+		cachep->memcg_params->dead = true;
 		kmem_cache_shrink(cachep);
 		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
 			memcg_unregister_cache(cachep);
diff --git a/mm/slab.h b/mm/slab.h
index 961a3fb1f5a2..9515cc520bf8 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -121,6 +121,11 @@ static inline bool is_root_cache(struct kmem_cache *s)
 	return !s->memcg_params || s->memcg_params->is_root_cache;
 }
 
+static inline bool memcg_cache_dead(struct kmem_cache *s)
+{
+	return !is_root_cache(s) && s->memcg_params->dead;
+}
+
 static inline bool slab_equal_or_root(struct kmem_cache *s,
 					struct kmem_cache *p)
 {
@@ -203,6 +208,11 @@ static inline bool is_root_cache(struct kmem_cache *s)
 	return true;
 }
 
+static inline bool memcg_cache_dead(struct kmem_cache *s)
+{
+	return false;
+}
+
 static inline bool slab_equal_or_root(struct kmem_cache *s,
 				      struct kmem_cache *p)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 4/8] slub: never fail kmem_cache_shrink
  2014-05-30 13:51 ` Vladimir Davydov
@ 2014-05-30 13:51   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
but also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.

This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip
the de-fragmentation step if the allocation fails, but still get rid of
empty slabs.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |   39 +++++++++++++++++++++------------------
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..d9976ea93710 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3404,12 +3404,16 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct page *page;
 	struct page *t;
 	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+	LIST_HEAD(empty_slabs);
+	struct list_head *slabs_by_inuse;
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	slabs_by_inuse = kcalloc(objects - 1, sizeof(struct list_head),
+				 GFP_KERNEL);
+	if (slabs_by_inuse) {
+		for (i = 0; i < objects - 1; i++)
+			INIT_LIST_HEAD(slabs_by_inuse + i);
+	}
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -3418,9 +3422,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		if (!n->nr_partial)
 			continue;
 
-		for (i = 0; i < objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -3430,22 +3431,28 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
-			if (!page->inuse)
+			if (!page->inuse) {
+				list_move(&page->lru, &empty_slabs);
 				n->nr_partial--;
+			} else if (slabs_by_inuse)
+				list_move(&page->lru,
+					  slabs_by_inuse + page->inuse - 1);
 		}
 
 		/*
 		 * Rebuild the partial list with the slabs filled up most
 		 * first and the least used slabs at the end.
 		 */
-		for (i = objects - 1; i > 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		if (slabs_by_inuse) {
+			for (i = objects - 2; i >= 0; i--)
+				list_splice_tail_init(slabs_by_inuse + i,
+						      &n->partial);
+		}
 
 		spin_unlock_irqrestore(&n->list_lock, flags);
 
 		/* Release empty slabs */
-		list_for_each_entry_safe(page, t, slabs_by_inuse, lru)
+		list_for_each_entry_safe(page, t, &empty_slabs, lru)
 			discard_slab(s, page);
 	}
 
@@ -4780,13 +4787,9 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf)
 static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
-	if (buf[0] == '1') {
-		int rc = kmem_cache_shrink(s);
-
-		if (rc)
-			return rc;
-	} else
+	if (buf[0] != '1')
 		return -EINVAL;
+	kmem_cache_shrink(s);
 	return length;
 }
 SLAB_ATTR(shrink);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 4/8] slub: never fail kmem_cache_shrink
@ 2014-05-30 13:51   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
but also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.

This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip
the de-fragmentation step if the allocation fails, but still get rid of
empty slabs.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |   39 +++++++++++++++++++++------------------
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..d9976ea93710 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3404,12 +3404,16 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct page *page;
 	struct page *t;
 	int objects = oo_objects(s->max);
-	struct list_head *slabs_by_inuse =
-		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
+	LIST_HEAD(empty_slabs);
+	struct list_head *slabs_by_inuse;
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	slabs_by_inuse = kcalloc(objects - 1, sizeof(struct list_head),
+				 GFP_KERNEL);
+	if (slabs_by_inuse) {
+		for (i = 0; i < objects - 1; i++)
+			INIT_LIST_HEAD(slabs_by_inuse + i);
+	}
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -3418,9 +3422,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		if (!n->nr_partial)
 			continue;
 
-		for (i = 0; i < objects; i++)
-			INIT_LIST_HEAD(slabs_by_inuse + i);
-
 		spin_lock_irqsave(&n->list_lock, flags);
 
 		/*
@@ -3430,22 +3431,28 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
-			if (!page->inuse)
+			if (!page->inuse) {
+				list_move(&page->lru, &empty_slabs);
 				n->nr_partial--;
+			} else if (slabs_by_inuse)
+				list_move(&page->lru,
+					  slabs_by_inuse + page->inuse - 1);
 		}
 
 		/*
 		 * Rebuild the partial list with the slabs filled up most
 		 * first and the least used slabs at the end.
 		 */
-		for (i = objects - 1; i > 0; i--)
-			list_splice(slabs_by_inuse + i, n->partial.prev);
+		if (slabs_by_inuse) {
+			for (i = objects - 2; i >= 0; i--)
+				list_splice_tail_init(slabs_by_inuse + i,
+						      &n->partial);
+		}
 
 		spin_unlock_irqrestore(&n->list_lock, flags);
 
 		/* Release empty slabs */
-		list_for_each_entry_safe(page, t, slabs_by_inuse, lru)
+		list_for_each_entry_safe(page, t, &empty_slabs, lru)
 			discard_slab(s, page);
 	}
 
@@ -4780,13 +4787,9 @@ static ssize_t shrink_show(struct kmem_cache *s, char *buf)
 static ssize_t shrink_store(struct kmem_cache *s,
 			const char *buf, size_t length)
 {
-	if (buf[0] == '1') {
-		int rc = kmem_cache_shrink(s);
-
-		if (rc)
-			return rc;
-	} else
+	if (buf[0] != '1')
 		return -EINVAL;
+	kmem_cache_shrink(s);
 	return length;
 }
 SLAB_ATTR(shrink);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
  2014-05-30 13:51 ` Vladimir Davydov
@ 2014-05-30 13:51   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

First, nobody uses it. Second, it differs across the implementations:
for SLUB it always returns 0, for SLAB it returns 0 if the cache appears
to be empty. So let's get rid of it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 +-
 mm/slab.c            |   11 ++++++++---
 mm/slab.h            |    2 +-
 mm/slab_common.c     |    8 ++------
 mm/slob.c            |    3 +--
 mm/slub.c            |    3 +--
 6 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d99d5212b815..d88ae36e2a15 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -121,7 +121,7 @@ struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *,
 					   const char *);
 #endif
 void kmem_cache_destroy(struct kmem_cache *);
-int kmem_cache_shrink(struct kmem_cache *);
+void kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
 
 /*
diff --git a/mm/slab.c b/mm/slab.c
index 9ca3b87edabc..cecc01bba389 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2478,7 +2478,7 @@ out:
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+static int __cache_shrink(struct kmem_cache *cachep)
 {
 	int ret = 0, i = 0;
 	struct kmem_cache_node *n;
@@ -2499,12 +2499,17 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
+void __kmem_cache_shrink(struct kmem_cache *cachep)
+{
+	__cache_shrink(cachep);
+}
+
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-	int i;
+	int i, rc;
 	struct kmem_cache_node *n;
-	int rc = __kmem_cache_shrink(cachep);
 
+	rc = __cache_shrink(cachep);
 	if (rc)
 		return rc;
 
diff --git a/mm/slab.h b/mm/slab.h
index 9515cc520bf8..c03d707033b7 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -91,7 +91,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS)
 
 int __kmem_cache_shutdown(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+void __kmem_cache_shrink(struct kmem_cache *);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 735e01a0db6f..015fa1d854a9 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -380,18 +380,14 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  * @cachep: The cache to shrink.
  *
  * Releases as many slabs as possible for a cache.
- * To help debugging, a zero exit status indicates all slabs were released.
  */
-int kmem_cache_shrink(struct kmem_cache *cachep)
+void kmem_cache_shrink(struct kmem_cache *cachep)
 {
-	int ret;
-
 	get_online_cpus();
 	get_online_mems();
-	ret = __kmem_cache_shrink(cachep);
+	__kmem_cache_shrink(cachep);
 	put_online_mems();
 	put_online_cpus();
-	return ret;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
diff --git a/mm/slob.c b/mm/slob.c
index 21980e0f39a8..bb6471f26640 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -620,9 +620,8 @@ int __kmem_cache_shutdown(struct kmem_cache *c)
 	return 0;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+void __kmem_cache_shrink(struct kmem_cache *d)
 {
-	return 0;
 }
 
 struct kmem_cache kmem_cache_boot = {
diff --git a/mm/slub.c b/mm/slub.c
index d9976ea93710..2fc84853bffb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3396,7 +3396,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+void __kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3457,7 +3457,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	}
 
 	kfree(slabs_by_inuse);
-	return 0;
 }
 
 static int slab_mem_going_offline_callback(void *arg)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
@ 2014-05-30 13:51   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

First, nobody uses it. Second, it differs across the implementations:
for SLUB it always returns 0, for SLAB it returns 0 if the cache appears
to be empty. So let's get rid of it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 +-
 mm/slab.c            |   11 ++++++++---
 mm/slab.h            |    2 +-
 mm/slab_common.c     |    8 ++------
 mm/slob.c            |    3 +--
 mm/slub.c            |    3 +--
 6 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d99d5212b815..d88ae36e2a15 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -121,7 +121,7 @@ struct kmem_cache *memcg_create_kmem_cache(struct mem_cgroup *,
 					   const char *);
 #endif
 void kmem_cache_destroy(struct kmem_cache *);
-int kmem_cache_shrink(struct kmem_cache *);
+void kmem_cache_shrink(struct kmem_cache *);
 void kmem_cache_free(struct kmem_cache *, void *);
 
 /*
diff --git a/mm/slab.c b/mm/slab.c
index 9ca3b87edabc..cecc01bba389 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2478,7 +2478,7 @@ out:
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+static int __cache_shrink(struct kmem_cache *cachep)
 {
 	int ret = 0, i = 0;
 	struct kmem_cache_node *n;
@@ -2499,12 +2499,17 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 	return (ret ? 1 : 0);
 }
 
+void __kmem_cache_shrink(struct kmem_cache *cachep)
+{
+	__cache_shrink(cachep);
+}
+
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-	int i;
+	int i, rc;
 	struct kmem_cache_node *n;
-	int rc = __kmem_cache_shrink(cachep);
 
+	rc = __cache_shrink(cachep);
 	if (rc)
 		return rc;
 
diff --git a/mm/slab.h b/mm/slab.h
index 9515cc520bf8..c03d707033b7 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -91,7 +91,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 #define CACHE_CREATE_MASK (SLAB_CORE_FLAGS | SLAB_DEBUG_FLAGS | SLAB_CACHE_FLAGS)
 
 int __kmem_cache_shutdown(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+void __kmem_cache_shrink(struct kmem_cache *);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 735e01a0db6f..015fa1d854a9 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -380,18 +380,14 @@ EXPORT_SYMBOL(kmem_cache_destroy);
  * @cachep: The cache to shrink.
  *
  * Releases as many slabs as possible for a cache.
- * To help debugging, a zero exit status indicates all slabs were released.
  */
-int kmem_cache_shrink(struct kmem_cache *cachep)
+void kmem_cache_shrink(struct kmem_cache *cachep)
 {
-	int ret;
-
 	get_online_cpus();
 	get_online_mems();
-	ret = __kmem_cache_shrink(cachep);
+	__kmem_cache_shrink(cachep);
 	put_online_mems();
 	put_online_cpus();
-	return ret;
 }
 EXPORT_SYMBOL(kmem_cache_shrink);
 
diff --git a/mm/slob.c b/mm/slob.c
index 21980e0f39a8..bb6471f26640 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -620,9 +620,8 @@ int __kmem_cache_shutdown(struct kmem_cache *c)
 	return 0;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+void __kmem_cache_shrink(struct kmem_cache *d)
 {
-	return 0;
 }
 
 struct kmem_cache kmem_cache_boot = {
diff --git a/mm/slub.c b/mm/slub.c
index d9976ea93710..2fc84853bffb 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3396,7 +3396,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+void __kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3457,7 +3457,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	}
 
 	kfree(slabs_by_inuse);
-	return 0;
 }
 
 static int slab_mem_going_offline_callback(void *arg)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 6/8] slub: do not use cmpxchg for adding cpu partials when irqs disabled
  2014-05-30 13:51 ` Vladimir Davydov
@ 2014-05-30 13:51   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

We add slabs to per cpu partial lists on both objects allocation (see
get_partial_node) and free (see __slab_free). We use the same function,
put_cpu_partial, in both cases.

Since __slab_free can be executed with preempt/irqs enabled, we have to
use cmpxchg for adding a new element to a partial list in order to avoid
races in case we are moved to another cpu or an irq hits while we are in
the middle of put_cpu_partial.

However, get_partial_node is always called with irqs disabled, which
grants us exclusive access to the current cpu's partial list, so there
is no need in any synchronization and therefore cmpxchg is redundant
there.

Let's get rid of this redundancy and access/set per cpu partial list
from get_partial_node w/o cmpxchg-based synchronization.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |   46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 2fc84853bffb..ac39cc9b6849 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1603,7 +1603,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
 	return freelist;
 }
 
-static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
+static void prepare_cpu_partial(struct page *page, struct page *oldpage);
 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
 
 /*
@@ -1643,7 +1643,8 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
-			put_cpu_partial(s, page, 0);
+			prepare_cpu_partial(page, c->partial);
+			c->partial = page;
 			stat(s, CPU_PARTIAL_NODE);
 		}
 		if (!kmem_cache_has_cpu_partial(s)
@@ -2015,6 +2016,26 @@ static void unfreeze_partials(struct kmem_cache *s,
 #endif
 }
 
+static void prepare_cpu_partial(struct page *page, struct page *oldpage)
+{
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+	int pages = 0;
+	int pobjects = 0;
+
+	if (oldpage) {
+		pages = oldpage->pages;
+		pobjects = oldpage->pobjects;
+	}
+
+	pages++;
+	pobjects += page->objects - page->inuse;
+
+	page->pages = pages;
+	page->pobjects = pobjects;
+	page->next = oldpage;
+#endif
+}
+
 /*
  * Put a page that was just frozen (in __slab_free) into a partial page
  * slot if available. This is done without interrupts disabled and without
@@ -2024,22 +2045,16 @@ static void unfreeze_partials(struct kmem_cache *s,
  * If we did not find a slot then simply move all the partials to the
  * per node partial list.
  */
-static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
+static void put_cpu_partial(struct kmem_cache *s, struct page *page)
 {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct page *oldpage;
-	int pages;
-	int pobjects;
 
 	do {
-		pages = 0;
-		pobjects = 0;
 		oldpage = this_cpu_read(s->cpu_slab->partial);
 
 		if (oldpage) {
-			pobjects = oldpage->pobjects;
-			pages = oldpage->pages;
-			if (drain && pobjects > s->cpu_partial) {
+			if (oldpage->pobjects > s->cpu_partial) {
 				unsigned long flags;
 				/*
 				 * partial array is full. Move the existing
@@ -2049,18 +2064,11 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 				unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
 				local_irq_restore(flags);
 				oldpage = NULL;
-				pobjects = 0;
-				pages = 0;
 				stat(s, CPU_PARTIAL_DRAIN);
 			}
 		}
 
-		pages++;
-		pobjects += page->objects - page->inuse;
-
-		page->pages = pages;
-		page->pobjects = pobjects;
-		page->next = oldpage;
+		prepare_cpu_partial(page, oldpage);
 
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
@@ -2608,7 +2616,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * per cpu partial list.
 		 */
 		if (new.frozen && !was_frozen) {
-			put_cpu_partial(s, page, 1);
+			put_cpu_partial(s, page);
 			stat(s, CPU_PARTIAL_FREE);
 		}
 		/*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 6/8] slub: do not use cmpxchg for adding cpu partials when irqs disabled
@ 2014-05-30 13:51   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

We add slabs to per cpu partial lists on both objects allocation (see
get_partial_node) and free (see __slab_free). We use the same function,
put_cpu_partial, in both cases.

Since __slab_free can be executed with preempt/irqs enabled, we have to
use cmpxchg for adding a new element to a partial list in order to avoid
races in case we are moved to another cpu or an irq hits while we are in
the middle of put_cpu_partial.

However, get_partial_node is always called with irqs disabled, which
grants us exclusive access to the current cpu's partial list, so there
is no need in any synchronization and therefore cmpxchg is redundant
there.

Let's get rid of this redundancy and access/set per cpu partial list
from get_partial_node w/o cmpxchg-based synchronization.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |   46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 2fc84853bffb..ac39cc9b6849 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1603,7 +1603,7 @@ static inline void *acquire_slab(struct kmem_cache *s,
 	return freelist;
 }
 
-static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
+static void prepare_cpu_partial(struct page *page, struct page *oldpage);
 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
 
 /*
@@ -1643,7 +1643,8 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
-			put_cpu_partial(s, page, 0);
+			prepare_cpu_partial(page, c->partial);
+			c->partial = page;
 			stat(s, CPU_PARTIAL_NODE);
 		}
 		if (!kmem_cache_has_cpu_partial(s)
@@ -2015,6 +2016,26 @@ static void unfreeze_partials(struct kmem_cache *s,
 #endif
 }
 
+static void prepare_cpu_partial(struct page *page, struct page *oldpage)
+{
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+	int pages = 0;
+	int pobjects = 0;
+
+	if (oldpage) {
+		pages = oldpage->pages;
+		pobjects = oldpage->pobjects;
+	}
+
+	pages++;
+	pobjects += page->objects - page->inuse;
+
+	page->pages = pages;
+	page->pobjects = pobjects;
+	page->next = oldpage;
+#endif
+}
+
 /*
  * Put a page that was just frozen (in __slab_free) into a partial page
  * slot if available. This is done without interrupts disabled and without
@@ -2024,22 +2045,16 @@ static void unfreeze_partials(struct kmem_cache *s,
  * If we did not find a slot then simply move all the partials to the
  * per node partial list.
  */
-static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
+static void put_cpu_partial(struct kmem_cache *s, struct page *page)
 {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct page *oldpage;
-	int pages;
-	int pobjects;
 
 	do {
-		pages = 0;
-		pobjects = 0;
 		oldpage = this_cpu_read(s->cpu_slab->partial);
 
 		if (oldpage) {
-			pobjects = oldpage->pobjects;
-			pages = oldpage->pages;
-			if (drain && pobjects > s->cpu_partial) {
+			if (oldpage->pobjects > s->cpu_partial) {
 				unsigned long flags;
 				/*
 				 * partial array is full. Move the existing
@@ -2049,18 +2064,11 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 				unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
 				local_irq_restore(flags);
 				oldpage = NULL;
-				pobjects = 0;
-				pages = 0;
 				stat(s, CPU_PARTIAL_DRAIN);
 			}
 		}
 
-		pages++;
-		pobjects += page->objects - page->inuse;
-
-		page->pages = pages;
-		page->pobjects = pobjects;
-		page->next = oldpage;
+		prepare_cpu_partial(page, oldpage);
 
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
@@ -2608,7 +2616,7 @@ static void __slab_free(struct kmem_cache *s, struct page *page,
 		 * per cpu partial list.
 		 */
 		if (new.frozen && !was_frozen) {
-			put_cpu_partial(s, page, 1);
+			put_cpu_partial(s, page);
 			stat(s, CPU_PARTIAL_FREE);
 		}
 		/*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-05-30 13:51 ` Vladimir Davydov
@ 2014-05-30 13:51   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

To speed up further allocations, SLUB may keep some empty slabs on per
cpu/node partial lists. If the cache is dead, i.e. belongs to a memcg
that was turned offline, there is no need in that, because dead caches
are never allocated from.

What is worse, keeping empty slabs on the list will prevent the dead
cache from self destruction, because each allocated slab holds a
reference to the cache, and the cache can only be destroyed when its ref
counter is zero.

To make a SLUB cache discard slabs as soon as they become empty, we have
to (1) drain per cpu/node caches, (2) disable caching of empty slabs on
per node list, (3) disable per cpu partial lists completely, as a slab
that becomes empty on such a list will be freed only when
unfreeze_partials() is called, which can never happen under certain
circumstances.

(1) is already done by kmem_cache_shrink(), which is called from
memcg_unregister_all_caches().

(2) is easy. It's enough to set kmem_cache->min_partial to 0 before
shrinking the cache. Since min_partial is only accessed under
kmem_cache_node->lock, and we take the lock while shrinking the cache,
this will guarantee no empty slabs will be added to the list after
kmem_cache_shrink is called.

(3) is a bit more difficult, because slabs are added to per-cpu partial
lists lock-less. Fortunately, we only have to handle the __slab_free
case, because, as there shouldn't be any allocation requests dispatched
to a dead memcg cache, get_partial_node() should never be called. In
__slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
put_cpu_partial) so that setting ->partial to a special value, which
will make put_cpu_partial bail out, will do the trick.

Note, this shouldn't affect performance, because keeping empty slabs on
per node lists as well as using per cpu partials are only worthwhile if
the cache is used for allocations, which isn't the case for dead caches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |  124 ++++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 95 insertions(+), 29 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ac39cc9b6849..d5a54b03d558 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -133,6 +133,21 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
 }
 
 /*
+ * When the owner memcg of a cache is turned offline, its per cpu partial lists
+ * must be disabled, because they can contain empty slabs and hence prevent the
+ * cache from self-destruction.
+ *
+ * To zap per cpu partials, we set each cpu's cpu_slab->partial to a special
+ * value, CPU_SLAB_PARTIAL_DEAD. Whenever we see this value while trying to put
+ * a frozen slab to a per cpu partial list, we unfreeze the slab and put it
+ * back to its node's list.
+ *
+ * Actually, we only need to handle this on free path, because no allocations
+ * requests can be dispatched to a dead memcg cache.
+ */
+#define CPU_SLAB_PARTIAL_DEAD		((struct page *)~0UL)
+
+/*
  * Issues still to be resolved:
  *
  * - Support PAGE_ALLOC_DEBUG. Should be easy to do.
@@ -1643,6 +1658,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
+			VM_BUG_ON(c->partial == CPU_SLAB_PARTIAL_DEAD);
 			prepare_cpu_partial(page, c->partial);
 			c->partial = page;
 			stat(s, CPU_PARTIAL_NODE);
@@ -1948,6 +1964,56 @@ redo:
 	}
 }
 
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+static void __unfreeze_partial(struct kmem_cache *s, struct kmem_cache_node *n,
+			       struct page *page, struct page **discard_page)
+{
+	struct page new, old;
+
+	do {
+
+		old.freelist = page->freelist;
+		old.counters = page->counters;
+		VM_BUG_ON(!old.frozen);
+
+		new.counters = old.counters;
+		new.freelist = old.freelist;
+
+		new.frozen = 0;
+
+	} while (!__cmpxchg_double_slab(s, page,
+			old.freelist, old.counters,
+			new.freelist, new.counters,
+			"unfreezing slab"));
+
+	if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) {
+		page->next = *discard_page;
+		*discard_page = page;
+	} else {
+		add_partial(n, page, DEACTIVATE_TO_TAIL);
+		stat(s, FREE_ADD_PARTIAL);
+	}
+}
+
+static void cancel_frozen(struct kmem_cache *s, struct page *page)
+{
+	struct page *discard_page = NULL;
+	struct kmem_cache_node *n;
+	unsigned long flags;
+
+	n = get_node(s, page_to_nid(page));
+	spin_lock_irqsave(&n->list_lock, flags);
+	__unfreeze_partial(s, n, page, &discard_page);
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (discard_page) {
+		stat(s, DEACTIVATE_EMPTY);
+		discard_slab(s, discard_page);
+		stat(s, FREE_SLAB);
+	}
+}
+#endif
+
 /*
  * Unfreeze all the cpu partial slabs.
  *
@@ -1962,10 +2028,10 @@ static void unfreeze_partials(struct kmem_cache *s,
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
 	struct page *page, *discard_page = NULL;
 
-	while ((page = c->partial)) {
-		struct page new;
-		struct page old;
+	if (c->partial == CPU_SLAB_PARTIAL_DEAD)
+		return;
 
+	while ((page = c->partial)) {
 		c->partial = page->next;
 
 		n2 = get_node(s, page_to_nid(page));
@@ -1977,29 +2043,7 @@ static void unfreeze_partials(struct kmem_cache *s,
 			spin_lock(&n->list_lock);
 		}
 
-		do {
-
-			old.freelist = page->freelist;
-			old.counters = page->counters;
-			VM_BUG_ON(!old.frozen);
-
-			new.counters = old.counters;
-			new.freelist = old.freelist;
-
-			new.frozen = 0;
-
-		} while (!__cmpxchg_double_slab(s, page,
-				old.freelist, old.counters,
-				new.freelist, new.counters,
-				"unfreezing slab"));
-
-		if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) {
-			page->next = discard_page;
-			discard_page = page;
-		} else {
-			add_partial(n, page, DEACTIVATE_TO_TAIL);
-			stat(s, FREE_ADD_PARTIAL);
-		}
+		__unfreeze_partial(s, n, page, &discard_page);
 	}
 
 	if (n)
@@ -2053,6 +2097,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page)
 	do {
 		oldpage = this_cpu_read(s->cpu_slab->partial);
 
+		/*
+		 * Per cpu partials are disabled. Unfreeze the slab and put it
+		 * to the node partial list.
+		 */
+		if (oldpage == CPU_SLAB_PARTIAL_DEAD) {
+			cancel_frozen(s, page);
+			break;
+		}
+
 		if (oldpage) {
 			if (oldpage->pobjects > s->cpu_partial) {
 				unsigned long flags;
@@ -2099,6 +2152,13 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 			flush_slab(s, c);
 
 		unfreeze_partials(s, c);
+
+		/*
+		 * Disable per cpu partials if the cache is dead so that it
+		 * won't be pinned by empty slabs that can be cached there.
+		 */
+		if (memcg_cache_dead(s))
+			c->partial = CPU_SLAB_PARTIAL_DEAD;
 	}
 }
 
@@ -2368,6 +2428,7 @@ new_slab:
 
 	if (c->partial) {
 		page = c->page = c->partial;
+		VM_BUG_ON(page == CPU_SLAB_PARTIAL_DEAD);
 		c->partial = page->next;
 		stat(s, CPU_PARTIAL_ALLOC);
 		c->freelist = NULL;
@@ -3416,6 +3477,10 @@ void __kmem_cache_shrink(struct kmem_cache *s)
 	struct list_head *slabs_by_inuse;
 	unsigned long flags;
 
+	/* Make dead caches discard empty slabs immediately. */
+	if (memcg_cache_dead(s))
+		s->min_partial = 0;
+
 	slabs_by_inuse = kcalloc(objects - 1, sizeof(struct list_head),
 				 GFP_KERNEL);
 	if (slabs_by_inuse) {
@@ -4329,7 +4394,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 			nodes[node] += x;
 
 			page = ACCESS_ONCE(c->partial);
-			if (page) {
+			if (page && page != CPU_SLAB_PARTIAL_DEAD) {
 				node = page_to_nid(page);
 				if (flags & SO_TOTAL)
 					WARN_ON_ONCE(1);
@@ -4561,7 +4626,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 	for_each_online_cpu(cpu) {
 		struct page *page = per_cpu_ptr(s->cpu_slab, cpu)->partial;
 
-		if (page) {
+		if (page && page != CPU_SLAB_PARTIAL_DEAD) {
 			pages += page->pages;
 			objects += page->pobjects;
 		}
@@ -4573,7 +4638,8 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 	for_each_online_cpu(cpu) {
 		struct page *page = per_cpu_ptr(s->cpu_slab, cpu) ->partial;
 
-		if (page && len < PAGE_SIZE - 20)
+		if (page && page != CPU_SLAB_PARTIAL_DEAD &&
+		    len < PAGE_SIZE - 20)
 			len += sprintf(buf + len, " C%d=%d(%d)", cpu,
 				page->pobjects, page->pages);
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-05-30 13:51   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

To speed up further allocations, SLUB may keep some empty slabs on per
cpu/node partial lists. If the cache is dead, i.e. belongs to a memcg
that was turned offline, there is no need in that, because dead caches
are never allocated from.

What is worse, keeping empty slabs on the list will prevent the dead
cache from self destruction, because each allocated slab holds a
reference to the cache, and the cache can only be destroyed when its ref
counter is zero.

To make a SLUB cache discard slabs as soon as they become empty, we have
to (1) drain per cpu/node caches, (2) disable caching of empty slabs on
per node list, (3) disable per cpu partial lists completely, as a slab
that becomes empty on such a list will be freed only when
unfreeze_partials() is called, which can never happen under certain
circumstances.

(1) is already done by kmem_cache_shrink(), which is called from
memcg_unregister_all_caches().

(2) is easy. It's enough to set kmem_cache->min_partial to 0 before
shrinking the cache. Since min_partial is only accessed under
kmem_cache_node->lock, and we take the lock while shrinking the cache,
this will guarantee no empty slabs will be added to the list after
kmem_cache_shrink is called.

(3) is a bit more difficult, because slabs are added to per-cpu partial
lists lock-less. Fortunately, we only have to handle the __slab_free
case, because, as there shouldn't be any allocation requests dispatched
to a dead memcg cache, get_partial_node() should never be called. In
__slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
put_cpu_partial) so that setting ->partial to a special value, which
will make put_cpu_partial bail out, will do the trick.

Note, this shouldn't affect performance, because keeping empty slabs on
per node lists as well as using per cpu partials are only worthwhile if
the cache is used for allocations, which isn't the case for dead caches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slub.c |  124 ++++++++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 95 insertions(+), 29 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index ac39cc9b6849..d5a54b03d558 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -133,6 +133,21 @@ static inline bool kmem_cache_has_cpu_partial(struct kmem_cache *s)
 }
 
 /*
+ * When the owner memcg of a cache is turned offline, its per cpu partial lists
+ * must be disabled, because they can contain empty slabs and hence prevent the
+ * cache from self-destruction.
+ *
+ * To zap per cpu partials, we set each cpu's cpu_slab->partial to a special
+ * value, CPU_SLAB_PARTIAL_DEAD. Whenever we see this value while trying to put
+ * a frozen slab to a per cpu partial list, we unfreeze the slab and put it
+ * back to its node's list.
+ *
+ * Actually, we only need to handle this on free path, because no allocations
+ * requests can be dispatched to a dead memcg cache.
+ */
+#define CPU_SLAB_PARTIAL_DEAD		((struct page *)~0UL)
+
+/*
  * Issues still to be resolved:
  *
  * - Support PAGE_ALLOC_DEBUG. Should be easy to do.
@@ -1643,6 +1658,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
+			VM_BUG_ON(c->partial == CPU_SLAB_PARTIAL_DEAD);
 			prepare_cpu_partial(page, c->partial);
 			c->partial = page;
 			stat(s, CPU_PARTIAL_NODE);
@@ -1948,6 +1964,56 @@ redo:
 	}
 }
 
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+static void __unfreeze_partial(struct kmem_cache *s, struct kmem_cache_node *n,
+			       struct page *page, struct page **discard_page)
+{
+	struct page new, old;
+
+	do {
+
+		old.freelist = page->freelist;
+		old.counters = page->counters;
+		VM_BUG_ON(!old.frozen);
+
+		new.counters = old.counters;
+		new.freelist = old.freelist;
+
+		new.frozen = 0;
+
+	} while (!__cmpxchg_double_slab(s, page,
+			old.freelist, old.counters,
+			new.freelist, new.counters,
+			"unfreezing slab"));
+
+	if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) {
+		page->next = *discard_page;
+		*discard_page = page;
+	} else {
+		add_partial(n, page, DEACTIVATE_TO_TAIL);
+		stat(s, FREE_ADD_PARTIAL);
+	}
+}
+
+static void cancel_frozen(struct kmem_cache *s, struct page *page)
+{
+	struct page *discard_page = NULL;
+	struct kmem_cache_node *n;
+	unsigned long flags;
+
+	n = get_node(s, page_to_nid(page));
+	spin_lock_irqsave(&n->list_lock, flags);
+	__unfreeze_partial(s, n, page, &discard_page);
+	spin_unlock_irqrestore(&n->list_lock, flags);
+
+	if (discard_page) {
+		stat(s, DEACTIVATE_EMPTY);
+		discard_slab(s, discard_page);
+		stat(s, FREE_SLAB);
+	}
+}
+#endif
+
 /*
  * Unfreeze all the cpu partial slabs.
  *
@@ -1962,10 +2028,10 @@ static void unfreeze_partials(struct kmem_cache *s,
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
 	struct page *page, *discard_page = NULL;
 
-	while ((page = c->partial)) {
-		struct page new;
-		struct page old;
+	if (c->partial == CPU_SLAB_PARTIAL_DEAD)
+		return;
 
+	while ((page = c->partial)) {
 		c->partial = page->next;
 
 		n2 = get_node(s, page_to_nid(page));
@@ -1977,29 +2043,7 @@ static void unfreeze_partials(struct kmem_cache *s,
 			spin_lock(&n->list_lock);
 		}
 
-		do {
-
-			old.freelist = page->freelist;
-			old.counters = page->counters;
-			VM_BUG_ON(!old.frozen);
-
-			new.counters = old.counters;
-			new.freelist = old.freelist;
-
-			new.frozen = 0;
-
-		} while (!__cmpxchg_double_slab(s, page,
-				old.freelist, old.counters,
-				new.freelist, new.counters,
-				"unfreezing slab"));
-
-		if (unlikely(!new.inuse && n->nr_partial > s->min_partial)) {
-			page->next = discard_page;
-			discard_page = page;
-		} else {
-			add_partial(n, page, DEACTIVATE_TO_TAIL);
-			stat(s, FREE_ADD_PARTIAL);
-		}
+		__unfreeze_partial(s, n, page, &discard_page);
 	}
 
 	if (n)
@@ -2053,6 +2097,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page)
 	do {
 		oldpage = this_cpu_read(s->cpu_slab->partial);
 
+		/*
+		 * Per cpu partials are disabled. Unfreeze the slab and put it
+		 * to the node partial list.
+		 */
+		if (oldpage == CPU_SLAB_PARTIAL_DEAD) {
+			cancel_frozen(s, page);
+			break;
+		}
+
 		if (oldpage) {
 			if (oldpage->pobjects > s->cpu_partial) {
 				unsigned long flags;
@@ -2099,6 +2152,13 @@ static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 			flush_slab(s, c);
 
 		unfreeze_partials(s, c);
+
+		/*
+		 * Disable per cpu partials if the cache is dead so that it
+		 * won't be pinned by empty slabs that can be cached there.
+		 */
+		if (memcg_cache_dead(s))
+			c->partial = CPU_SLAB_PARTIAL_DEAD;
 	}
 }
 
@@ -2368,6 +2428,7 @@ new_slab:
 
 	if (c->partial) {
 		page = c->page = c->partial;
+		VM_BUG_ON(page == CPU_SLAB_PARTIAL_DEAD);
 		c->partial = page->next;
 		stat(s, CPU_PARTIAL_ALLOC);
 		c->freelist = NULL;
@@ -3416,6 +3477,10 @@ void __kmem_cache_shrink(struct kmem_cache *s)
 	struct list_head *slabs_by_inuse;
 	unsigned long flags;
 
+	/* Make dead caches discard empty slabs immediately. */
+	if (memcg_cache_dead(s))
+		s->min_partial = 0;
+
 	slabs_by_inuse = kcalloc(objects - 1, sizeof(struct list_head),
 				 GFP_KERNEL);
 	if (slabs_by_inuse) {
@@ -4329,7 +4394,7 @@ static ssize_t show_slab_objects(struct kmem_cache *s,
 			nodes[node] += x;
 
 			page = ACCESS_ONCE(c->partial);
-			if (page) {
+			if (page && page != CPU_SLAB_PARTIAL_DEAD) {
 				node = page_to_nid(page);
 				if (flags & SO_TOTAL)
 					WARN_ON_ONCE(1);
@@ -4561,7 +4626,7 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 	for_each_online_cpu(cpu) {
 		struct page *page = per_cpu_ptr(s->cpu_slab, cpu)->partial;
 
-		if (page) {
+		if (page && page != CPU_SLAB_PARTIAL_DEAD) {
 			pages += page->pages;
 			objects += page->pobjects;
 		}
@@ -4573,7 +4638,8 @@ static ssize_t slabs_cpu_partial_show(struct kmem_cache *s, char *buf)
 	for_each_online_cpu(cpu) {
 		struct page *page = per_cpu_ptr(s->cpu_slab, cpu) ->partial;
 
-		if (page && len < PAGE_SIZE - 20)
+		if (page && page != CPU_SLAB_PARTIAL_DEAD &&
+		    len < PAGE_SIZE - 20)
 			len += sprintf(buf + len, " C%d=%d(%d)", cpu,
 				page->pobjects, page->pages);
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-05-30 13:51 ` Vladimir Davydov
@ 2014-05-30 13:51   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

There is no use in keeping free objects/slabs on dead memcg caches,
because they will never be allocated. So let's make cache_reap() shrink
as many free objects from such caches as possible.

Note the difference between SLAB and SLUB handling of dead memcg caches.
For SLUB, dead cache destruction is scheduled as soon as the last object
is freed, because dead caches do not cache free objects. For SLAB, dead
caches can keep some free objects on per cpu arrays, so that an empty
dead cache will be hanging around until cache_reap() drains it.

We don't disable free objects caching for SLAB, because it would force
kfree to always take a spin lock, which would degrade performance
significantly.

Since cache_reap() drains all caches once ~4 secs on each CPU, empty
dead caches will die quickly.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c |   17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index cecc01bba389..d81e46316c99 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3985,6 +3985,11 @@ static void cache_reap(struct work_struct *w)
 		goto out;

 	list_for_each_entry(searchp, &slab_caches, list) {
+		int force = 0;
+
+		if (memcg_cache_dead(searchp))
+			force = 1;
+
 		check_irq_on();

 		/*
@@ -3996,7 +4001,7 @@ static void cache_reap(struct work_struct *w)

 		reap_alien(searchp, n);

-		drain_array(searchp, n, cpu_cache_get(searchp), 0, node);
+		drain_array(searchp, n, cpu_cache_get(searchp), force, node);

 		/*
 		 * These are racy checks but it does not matter
@@ -4007,15 +4012,17 @@ static void cache_reap(struct work_struct *w)

 		n->next_reap = jiffies + REAPTIMEOUT_NODE;

-		drain_array(searchp, n, n->shared, 0, node);
+		drain_array(searchp, n, n->shared, force, node);

 		if (n->free_touched)
 			n->free_touched = 0;
 		else {
-			int freed;
+			int freed, tofree;
+
+			tofree = force ? slabs_tofree(searchp, n) :
+				DIV_ROUND_UP(n->free_limit, 5 * searchp->num);

-			freed = drain_freelist(searchp, n, (n->free_limit +
-				5 * searchp->num - 1) / (5 * searchp->num));
+			freed = drain_freelist(searchp, n, tofree);
 			STATS_ADD_REAPED(searchp, freed);
 		}
 next:
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-05-30 13:51   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-30 13:51 UTC (permalink / raw)
  To: akpm; +Cc: cl, hannes, mhocko, linux-kernel, linux-mm

There is no use in keeping free objects/slabs on dead memcg caches,
because they will never be allocated. So let's make cache_reap() shrink
as many free objects from such caches as possible.

Note the difference between SLAB and SLUB handling of dead memcg caches.
For SLUB, dead cache destruction is scheduled as soon as the last object
is freed, because dead caches do not cache free objects. For SLAB, dead
caches can keep some free objects on per cpu arrays, so that an empty
dead cache will be hanging around until cache_reap() drains it.

We don't disable free objects caching for SLAB, because it would force
kfree to always take a spin lock, which would degrade performance
significantly.

Since cache_reap() drains all caches once ~4 secs on each CPU, empty
dead caches will die quickly.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c |   17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index cecc01bba389..d81e46316c99 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3985,6 +3985,11 @@ static void cache_reap(struct work_struct *w)
 		goto out;
 
 	list_for_each_entry(searchp, &slab_caches, list) {
+		int force = 0;
+
+		if (memcg_cache_dead(searchp))
+			force = 1;
+
 		check_irq_on();
 
 		/*
@@ -3996,7 +4001,7 @@ static void cache_reap(struct work_struct *w)
 
 		reap_alien(searchp, n);
 
-		drain_array(searchp, n, cpu_cache_get(searchp), 0, node);
+		drain_array(searchp, n, cpu_cache_get(searchp), force, node);
 
 		/*
 		 * These are racy checks but it does not matter
@@ -4007,15 +4012,17 @@ static void cache_reap(struct work_struct *w)
 
 		n->next_reap = jiffies + REAPTIMEOUT_NODE;
 
-		drain_array(searchp, n, n->shared, 0, node);
+		drain_array(searchp, n, n->shared, force, node);
 
 		if (n->free_touched)
 			n->free_touched = 0;
 		else {
-			int freed;
+			int freed, tofree;
+
+			tofree = force ? slabs_tofree(searchp, n) :
+				DIV_ROUND_UP(n->free_limit, 5 * searchp->num);
 
-			freed = drain_freelist(searchp, n, (n->free_limit +
-				5 * searchp->num - 1) / (5 * searchp->num));
+			freed = drain_freelist(searchp, n, tofree);
 			STATS_ADD_REAPED(searchp, freed);
 		}
 next:
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 1/8] memcg: cleanup memcg_cache_params refcnt usage
  2014-05-30 13:51   ` Vladimir Davydov
@ 2014-05-30 14:31     ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:31 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> Currently, we count the number of pages allocated to a per memcg cache
> in memcg_cache_params->nr_pages. We only use this counter to find out if
> the cache is empty and can be destroyed. So let's rename it to refcnt
> and make it count not pages, but slabs so that we can use atomic_inc/dec
> instead of atomic_add/sub in memcg_charge/uncharge_slab.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 1/8] memcg: cleanup memcg_cache_params refcnt usage
@ 2014-05-30 14:31     ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:31 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> Currently, we count the number of pages allocated to a per memcg cache
> in memcg_cache_params->nr_pages. We only use this counter to find out if
> the cache is empty and can be destroyed. So let's rename it to refcnt
> and make it count not pages, but slabs so that we can use atomic_inc/dec
> instead of atomic_add/sub in memcg_charge/uncharge_slab.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 2/8] memcg: destroy kmem caches when last slab is freed
  2014-05-30 13:51   ` Vladimir Davydov
@ 2014-05-30 14:32     ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:32 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> When the memcg_cache_params->refcnt goes to 0, schedule the worker that
> will unregister the cache. To prevent this from happening when the owner
> memcg is alive, keep the refcnt incremented during memcg lifetime.

Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 2/8] memcg: destroy kmem caches when last slab is freed
@ 2014-05-30 14:32     ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:32 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> When the memcg_cache_params->refcnt goes to 0, schedule the worker that
> will unregister the cache. To prevent this from happening when the owner
> memcg is alive, keep the refcnt incremented during memcg lifetime.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 3/8] memcg: mark caches that belong to offline memcgs as dead
  2014-05-30 13:51   ` Vladimir Davydov
@ 2014-05-30 14:33     ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:33 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> This will be used by the next patches.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 3/8] memcg: mark caches that belong to offline memcgs as dead
@ 2014-05-30 14:33     ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:33 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> This will be used by the next patches.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 4/8] slub: never fail kmem_cache_shrink
  2014-05-30 13:51   ` Vladimir Davydov
@ 2014-05-30 14:46     ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:46 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
> but also sorts slabs by the number of objects in-use to cope with
> fragmentation. To achieve that, it tries to allocate a temporary array.
> If it fails, it will abort the whole procedure.

If we cannot allocate a kernel structure that is mostly less than a page
size then we have much more important things to worry about.

The maximum number of objects per slab is 512 on my system here.

> This is unacceptable for kmemcg, where we want to be sure that all empty
> slabs are removed from the cache on memcg offline, so let's just skip
> the de-fragmentation step if the allocation fails, but still get rid of
> empty slabs.

Lets just try the shrink and log the fact that it failed? Try again later?



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 4/8] slub: never fail kmem_cache_shrink
@ 2014-05-30 14:46     ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:46 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
> but also sorts slabs by the number of objects in-use to cope with
> fragmentation. To achieve that, it tries to allocate a temporary array.
> If it fails, it will abort the whole procedure.

If we cannot allocate a kernel structure that is mostly less than a page
size then we have much more important things to worry about.

The maximum number of objects per slab is 512 on my system here.

> This is unacceptable for kmemcg, where we want to be sure that all empty
> slabs are removed from the cache on memcg offline, so let's just skip
> the de-fragmentation step if the allocation fails, but still get rid of
> empty slabs.

Lets just try the shrink and log the fact that it failed? Try again later?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
  2014-05-30 13:51   ` Vladimir Davydov
@ 2014-05-30 14:49     ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:49 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> First, nobody uses it. Second, it differs across the implementations:
> for SLUB it always returns 0, for SLAB it returns 0 if the cache appears
> to be empty. So let's get rid of it.

Well slub returns an error code if it fails. I am all in favor of making
it consistent. The indication in SLAB that the slab is empty may be
useful. May return error code or the number of slab pages in use?

Some of the code that is shared by the allocators here could be moved into
mm/slab_common.c. Put kmem_cache_shrink there and then have
__kmem_cache_shrink to the allocator specific things?
x

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
@ 2014-05-30 14:49     ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:49 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> First, nobody uses it. Second, it differs across the implementations:
> for SLUB it always returns 0, for SLAB it returns 0 if the cache appears
> to be empty. So let's get rid of it.

Well slub returns an error code if it fails. I am all in favor of making
it consistent. The indication in SLAB that the slab is empty may be
useful. May return error code or the number of slab pages in use?

Some of the code that is shared by the allocators here could be moved into
mm/slab_common.c. Put kmem_cache_shrink there and then have
__kmem_cache_shrink to the allocator specific things?
x

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-05-30 13:51   ` Vladimir Davydov
@ 2014-05-30 14:57     ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:57 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> (3) is a bit more difficult, because slabs are added to per-cpu partial
> lists lock-less. Fortunately, we only have to handle the __slab_free
> case, because, as there shouldn't be any allocation requests dispatched
> to a dead memcg cache, get_partial_node() should never be called. In
> __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> put_cpu_partial) so that setting ->partial to a special value, which
> will make put_cpu_partial bail out, will do the trick.
>
> Note, this shouldn't affect performance, because keeping empty slabs on
> per node lists as well as using per cpu partials are only worthwhile if
> the cache is used for allocations, which isn't the case for dead caches.

This all sounds pretty good to me but we still have some pretty extensive
modifications that I would rather avoid.

In put_cpu_partial you can simply check that the memcg is dead right? This
would avoid all the other modifications I would think and will not require
a special value for the per cpu partial pointer.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-05-30 14:57     ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 14:57 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> (3) is a bit more difficult, because slabs are added to per-cpu partial
> lists lock-less. Fortunately, we only have to handle the __slab_free
> case, because, as there shouldn't be any allocation requests dispatched
> to a dead memcg cache, get_partial_node() should never be called. In
> __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> put_cpu_partial) so that setting ->partial to a special value, which
> will make put_cpu_partial bail out, will do the trick.
>
> Note, this shouldn't affect performance, because keeping empty slabs on
> per node lists as well as using per cpu partials are only worthwhile if
> the cache is used for allocations, which isn't the case for dead caches.

This all sounds pretty good to me but we still have some pretty extensive
modifications that I would rather avoid.

In put_cpu_partial you can simply check that the memcg is dead right? This
would avoid all the other modifications I would think and will not require
a special value for the per cpu partial pointer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-05-30 13:51   ` Vladimir Davydov
@ 2014-05-30 15:01     ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 15:01 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> There is no use in keeping free objects/slabs on dead memcg caches,
> because they will never be allocated. So let's make cache_reap() shrink
> as many free objects from such caches as possible.
>
> Note the difference between SLAB and SLUB handling of dead memcg caches.
> For SLUB, dead cache destruction is scheduled as soon as the last object
> is freed, because dead caches do not cache free objects. For SLAB, dead
> caches can keep some free objects on per cpu arrays, so that an empty
> dead cache will be hanging around until cache_reap() drains it.

Calling kmem_cache_shrink() should drain all caches though. Reduce the
size of the queues to zero or so before calling shrink so that no new
caches are build up?

> We don't disable free objects caching for SLAB, because it would force
> kfree to always take a spin lock, which would degrade performance
> significantly.

You can use a similar approach than in SLUB. Reduce the size of the per
cpu array objects to zero. Then SLAB will always fall back to its slow
path in cache_flusharray() where you may be able to do something with less
of an impact on performace.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-05-30 15:01     ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-05-30 15:01 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, 30 May 2014, Vladimir Davydov wrote:

> There is no use in keeping free objects/slabs on dead memcg caches,
> because they will never be allocated. So let's make cache_reap() shrink
> as many free objects from such caches as possible.
>
> Note the difference between SLAB and SLUB handling of dead memcg caches.
> For SLUB, dead cache destruction is scheduled as soon as the last object
> is freed, because dead caches do not cache free objects. For SLAB, dead
> caches can keep some free objects on per cpu arrays, so that an empty
> dead cache will be hanging around until cache_reap() drains it.

Calling kmem_cache_shrink() should drain all caches though. Reduce the
size of the queues to zero or so before calling shrink so that no new
caches are build up?

> We don't disable free objects caching for SLAB, because it would force
> kfree to always take a spin lock, which would degrade performance
> significantly.

You can use a similar approach than in SLUB. Reduce the size of the per
cpu array objects to zero. Then SLAB will always fall back to its slow
path in cache_flusharray() where you may be able to do something with less
of an impact on performace.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 4/8] slub: never fail kmem_cache_shrink
  2014-05-30 14:46     ` Christoph Lameter
@ 2014-05-31 10:18       ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-31 10:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 09:46:33AM -0500, Christoph Lameter wrote:
> On Fri, 30 May 2014, Vladimir Davydov wrote:
> 
> > SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
> > but also sorts slabs by the number of objects in-use to cope with
> > fragmentation. To achieve that, it tries to allocate a temporary array.
> > If it fails, it will abort the whole procedure.
> 
> If we cannot allocate a kernel structure that is mostly less than a page
> size then we have much more important things to worry about.

That's all fair, but that doesn't explain why we should fail shrinking
unused slabs if we just couldn't do some unnecessary optimization? IMO,
that's a behavior one wouldn't expect.

> > This is unacceptable for kmemcg, where we want to be sure that all empty
> > slabs are removed from the cache on memcg offline, so let's just skip
> > the de-fragmentation step if the allocation fails, but still get rid of
> > empty slabs.
> 
> Lets just try the shrink and log the fact that it failed? Try again later?

... which means more async workers, more complication to kmemcg code :-(

Sorry, but I just don't get why we can't make kmem_cache_shrink never
fail? Is failing de-fragmentation, which is even not implied by the
function declaration, so critical that should be noted? If so, we can
return an error while still shrinking empty slabs...

If you just don't like the code after the patch, here is another, less
intrusive version doing practically the same. Would it be better?

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..e45af8c4fb7c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3404,12 +3404,15 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct page *page;
 	struct page *t;
 	int objects = oo_objects(s->max);
+	struct list_head empty_slabs;
 	struct list_head *slabs_by_inuse =
 		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	if (!slabs_by_inuse) {
+		slabs_by_inuse = &empty_slabs;
+		objects = 1;
+	}
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -3430,7 +3433,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
+			if (page->inuse < objects)
+				list_move(&page->lru,
+					  slabs_by_inuse + page->inuse);
 			if (!page->inuse)
 				n->nr_partial--;
 		}

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 4/8] slub: never fail kmem_cache_shrink
@ 2014-05-31 10:18       ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-31 10:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 09:46:33AM -0500, Christoph Lameter wrote:
> On Fri, 30 May 2014, Vladimir Davydov wrote:
> 
> > SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
> > but also sorts slabs by the number of objects in-use to cope with
> > fragmentation. To achieve that, it tries to allocate a temporary array.
> > If it fails, it will abort the whole procedure.
> 
> If we cannot allocate a kernel structure that is mostly less than a page
> size then we have much more important things to worry about.

That's all fair, but that doesn't explain why we should fail shrinking
unused slabs if we just couldn't do some unnecessary optimization? IMO,
that's a behavior one wouldn't expect.

> > This is unacceptable for kmemcg, where we want to be sure that all empty
> > slabs are removed from the cache on memcg offline, so let's just skip
> > the de-fragmentation step if the allocation fails, but still get rid of
> > empty slabs.
> 
> Lets just try the shrink and log the fact that it failed? Try again later?

... which means more async workers, more complication to kmemcg code :-(

Sorry, but I just don't get why we can't make kmem_cache_shrink never
fail? Is failing de-fragmentation, which is even not implied by the
function declaration, so critical that should be noted? If so, we can
return an error while still shrinking empty slabs...

If you just don't like the code after the patch, here is another, less
intrusive version doing practically the same. Would it be better?

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..e45af8c4fb7c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3404,12 +3404,15 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct page *page;
 	struct page *t;
 	int objects = oo_objects(s->max);
+	struct list_head empty_slabs;
 	struct list_head *slabs_by_inuse =
 		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	if (!slabs_by_inuse) {
+		slabs_by_inuse = &empty_slabs;
+		objects = 1;
+	}
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -3430,7 +3433,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
+			if (page->inuse < objects)
+				list_move(&page->lru,
+					  slabs_by_inuse + page->inuse);
 			if (!page->inuse)
 				n->nr_partial--;
 		}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
  2014-05-30 14:49     ` Christoph Lameter
@ 2014-05-31 10:27       ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-31 10:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 09:49:55AM -0500, Christoph Lameter wrote:
> On Fri, 30 May 2014, Vladimir Davydov wrote:
> 
> > First, nobody uses it. Second, it differs across the implementations:
> > for SLUB it always returns 0, for SLAB it returns 0 if the cache appears
> > to be empty. So let's get rid of it.
> 
> Well slub returns an error code if it fails

... to sort slabs by the nubmer of objects in use, which is not even
implied by the function declaration. Why can *shrinking*, which is what
kmem_cache_shrink must do at first place, ever fail?

> I am all in favor of making it consistent. The indication in SLAB that
> the slab is empty may be useful. May return error code or the number
> of slab pages in use?

We can, but why if nobody is going to use it?

> Some of the code that is shared by the allocators here could be moved into
> mm/slab_common.c. Put kmem_cache_shrink there and then have
> __kmem_cache_shrink to the allocator specific things?

Already did in scope of mm-commit 4fabfe86c4a5 ("slab: get_online_mems
for kmem_cache_{create,destroy,shrink}") :-)

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
@ 2014-05-31 10:27       ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-31 10:27 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 09:49:55AM -0500, Christoph Lameter wrote:
> On Fri, 30 May 2014, Vladimir Davydov wrote:
> 
> > First, nobody uses it. Second, it differs across the implementations:
> > for SLUB it always returns 0, for SLAB it returns 0 if the cache appears
> > to be empty. So let's get rid of it.
> 
> Well slub returns an error code if it fails

... to sort slabs by the nubmer of objects in use, which is not even
implied by the function declaration. Why can *shrinking*, which is what
kmem_cache_shrink must do at first place, ever fail?

> I am all in favor of making it consistent. The indication in SLAB that
> the slab is empty may be useful. May return error code or the number
> of slab pages in use?

We can, but why if nobody is going to use it?

> Some of the code that is shared by the allocators here could be moved into
> mm/slab_common.c. Put kmem_cache_shrink there and then have
> __kmem_cache_shrink to the allocator specific things?

Already did in scope of mm-commit 4fabfe86c4a5 ("slab: get_online_mems
for kmem_cache_{create,destroy,shrink}") :-)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-05-30 14:57     ` Christoph Lameter
@ 2014-05-31 11:04       ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-31 11:04 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> On Fri, 30 May 2014, Vladimir Davydov wrote:
> 
> > (3) is a bit more difficult, because slabs are added to per-cpu partial
> > lists lock-less. Fortunately, we only have to handle the __slab_free
> > case, because, as there shouldn't be any allocation requests dispatched
> > to a dead memcg cache, get_partial_node() should never be called. In
> > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> > put_cpu_partial) so that setting ->partial to a special value, which
> > will make put_cpu_partial bail out, will do the trick.
> >
> > Note, this shouldn't affect performance, because keeping empty slabs on
> > per node lists as well as using per cpu partials are only worthwhile if
> > the cache is used for allocations, which isn't the case for dead caches.
> 
> This all sounds pretty good to me but we still have some pretty extensive
> modifications that I would rather avoid.
> 
> In put_cpu_partial you can simply check that the memcg is dead right? This
> would avoid all the other modifications I would think and will not require
> a special value for the per cpu partial pointer.

That would be racy. The check if memcg is dead and the write to per cpu
partial ptr wouldn't proceed as one atomic operation. If we set the dead
flag from another thread between these two operations, put_cpu_partial
will add a slab to a per cpu partial list *after* the cache was zapped.

But aren't modifications this patch introduces that extensive?

In fact, it just adds the check if ->partial == CPU_SLAB_PARTIAL_DEAD in
a couple of places, namely put_cpu_partial and unfreeze_partials, where
it looks pretty natural, IMO. Other hunks of this patch just (1) move
some code w/o modifying it to a separate function, (2) add BUG_ON's to
alloc paths (get_partial_node and __slab_alloc), where we should never
see this value, and (3) add checks to sysfs/debug paths.
[ Now I guess I had to split this patch to make it more readable ]

(1) and (2) doesn't make the code slower or more difficult to
understand, IMO. (3) is a bit cumbersome, but we can make it neater by
introducing a special function for them that will return the partial
slab if it wasn't zapped, something like this:

static struct page *cpu_slab_partial(struct kmem_cache *s, int cpu)
{
	struct page = per_cpu_ptr(s->cpu_slab, cpu)->partial;l

	if (page == CPU_SLAB_PARTIAL_DEAD)
		page = NULL;
	return page;
}

Thus we would only have to check for this special value only in three
places in the code, namely put_cpu_partial, unfreeze_partials, and
cpu_slab_partial.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-05-31 11:04       ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-31 11:04 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> On Fri, 30 May 2014, Vladimir Davydov wrote:
> 
> > (3) is a bit more difficult, because slabs are added to per-cpu partial
> > lists lock-less. Fortunately, we only have to handle the __slab_free
> > case, because, as there shouldn't be any allocation requests dispatched
> > to a dead memcg cache, get_partial_node() should never be called. In
> > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> > put_cpu_partial) so that setting ->partial to a special value, which
> > will make put_cpu_partial bail out, will do the trick.
> >
> > Note, this shouldn't affect performance, because keeping empty slabs on
> > per node lists as well as using per cpu partials are only worthwhile if
> > the cache is used for allocations, which isn't the case for dead caches.
> 
> This all sounds pretty good to me but we still have some pretty extensive
> modifications that I would rather avoid.
> 
> In put_cpu_partial you can simply check that the memcg is dead right? This
> would avoid all the other modifications I would think and will not require
> a special value for the per cpu partial pointer.

That would be racy. The check if memcg is dead and the write to per cpu
partial ptr wouldn't proceed as one atomic operation. If we set the dead
flag from another thread between these two operations, put_cpu_partial
will add a slab to a per cpu partial list *after* the cache was zapped.

But aren't modifications this patch introduces that extensive?

In fact, it just adds the check if ->partial == CPU_SLAB_PARTIAL_DEAD in
a couple of places, namely put_cpu_partial and unfreeze_partials, where
it looks pretty natural, IMO. Other hunks of this patch just (1) move
some code w/o modifying it to a separate function, (2) add BUG_ON's to
alloc paths (get_partial_node and __slab_alloc), where we should never
see this value, and (3) add checks to sysfs/debug paths.
[ Now I guess I had to split this patch to make it more readable ]

(1) and (2) doesn't make the code slower or more difficult to
understand, IMO. (3) is a bit cumbersome, but we can make it neater by
introducing a special function for them that will return the partial
slab if it wasn't zapped, something like this:

static struct page *cpu_slab_partial(struct kmem_cache *s, int cpu)
{
	struct page = per_cpu_ptr(s->cpu_slab, cpu)->partial;l

	if (page == CPU_SLAB_PARTIAL_DEAD)
		page = NULL;
	return page;
}

Thus we would only have to check for this special value only in three
places in the code, namely put_cpu_partial, unfreeze_partials, and
cpu_slab_partial.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-05-30 15:01     ` Christoph Lameter
@ 2014-05-31 11:19       ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-31 11:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 10:01:26AM -0500, Christoph Lameter wrote:
> On Fri, 30 May 2014, Vladimir Davydov wrote:
> 
> > We don't disable free objects caching for SLAB, because it would force
> > kfree to always take a spin lock, which would degrade performance
> > significantly.
> 
> You can use a similar approach than in SLUB. Reduce the size of the per
> cpu array objects to zero. Then SLAB will always fall back to its slow
> path in cache_flusharray() where you may be able to do something with less
> of an impact on performace.

In contrast to SLUB, for SLAB this will slow down kfree significantly.
Fast path for SLAB is just putting an object to a per cpu array, while
the slow path requires taking a per node lock, which is much slower even
with no contention. There still can be lots of objects in a dead memcg
cache (e.g. hundreds of megabytes of dcache), so such performance
degradation is not acceptable, IMO.

OTOH, we already have cache_reap running periodically for each cache.
Making it drain all free objects in dead caches won't impact performance
at all, neither will it complicate the code. The only downside is a dead
cache won't be destroyed immediately after it becomes unused, but since
cache_reap runs pretty often (each several secs), it shouldn't result in
any problems, I guess.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-05-31 11:19       ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-05-31 11:19 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 10:01:26AM -0500, Christoph Lameter wrote:
> On Fri, 30 May 2014, Vladimir Davydov wrote:
> 
> > We don't disable free objects caching for SLAB, because it would force
> > kfree to always take a spin lock, which would degrade performance
> > significantly.
> 
> You can use a similar approach than in SLUB. Reduce the size of the per
> cpu array objects to zero. Then SLAB will always fall back to its slow
> path in cache_flusharray() where you may be able to do something with less
> of an impact on performace.

In contrast to SLUB, for SLAB this will slow down kfree significantly.
Fast path for SLAB is just putting an object to a per cpu array, while
the slow path requires taking a per node lock, which is much slower even
with no contention. There still can be lots of objects in a dead memcg
cache (e.g. hundreds of megabytes of dcache), so such performance
degradation is not acceptable, IMO.

OTOH, we already have cache_reap running periodically for each cache.
Making it drain all free objects in dead caches won't impact performance
at all, neither will it complicate the code. The only downside is a dead
cache won't be destroyed immediately after it becomes unused, but since
cache_reap runs pretty often (each several secs), it shouldn't result in
any problems, I guess.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-05-31 11:04       ` Vladimir Davydov
@ 2014-06-02  4:24         ` Joonsoo Kim
  -1 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-02  4:24 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Christoph Lameter, akpm, hannes, mhocko, linux-kernel, linux-mm

On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> > On Fri, 30 May 2014, Vladimir Davydov wrote:
> > 
> > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> > > lists lock-less. Fortunately, we only have to handle the __slab_free
> > > case, because, as there shouldn't be any allocation requests dispatched
> > > to a dead memcg cache, get_partial_node() should never be called. In
> > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> > > put_cpu_partial) so that setting ->partial to a special value, which
> > > will make put_cpu_partial bail out, will do the trick.
> > >
> > > Note, this shouldn't affect performance, because keeping empty slabs on
> > > per node lists as well as using per cpu partials are only worthwhile if
> > > the cache is used for allocations, which isn't the case for dead caches.
> > 
> > This all sounds pretty good to me but we still have some pretty extensive
> > modifications that I would rather avoid.
> > 
> > In put_cpu_partial you can simply check that the memcg is dead right? This
> > would avoid all the other modifications I would think and will not require
> > a special value for the per cpu partial pointer.
> 
> That would be racy. The check if memcg is dead and the write to per cpu
> partial ptr wouldn't proceed as one atomic operation. If we set the dead
> flag from another thread between these two operations, put_cpu_partial
> will add a slab to a per cpu partial list *after* the cache was zapped.

Hello, Vladimir.

I think that we can do (3) easily.
If we check memcg_cache_dead() in the end of put_cpu_partial() rather
than in the begin of put_cpu_partial(), we can avoid the race you 
mentioned. If someone do put_cpu_partial() before dead flag is set,
it can be zapped by who set dead flag. And if someone do
put_cpu_partial() after dead flag is set, it can be zapped by who
do put_cpu_partial().

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-06-02  4:24         ` Joonsoo Kim
  0 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-02  4:24 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Christoph Lameter, akpm, hannes, mhocko, linux-kernel, linux-mm

On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> > On Fri, 30 May 2014, Vladimir Davydov wrote:
> > 
> > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> > > lists lock-less. Fortunately, we only have to handle the __slab_free
> > > case, because, as there shouldn't be any allocation requests dispatched
> > > to a dead memcg cache, get_partial_node() should never be called. In
> > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> > > put_cpu_partial) so that setting ->partial to a special value, which
> > > will make put_cpu_partial bail out, will do the trick.
> > >
> > > Note, this shouldn't affect performance, because keeping empty slabs on
> > > per node lists as well as using per cpu partials are only worthwhile if
> > > the cache is used for allocations, which isn't the case for dead caches.
> > 
> > This all sounds pretty good to me but we still have some pretty extensive
> > modifications that I would rather avoid.
> > 
> > In put_cpu_partial you can simply check that the memcg is dead right? This
> > would avoid all the other modifications I would think and will not require
> > a special value for the per cpu partial pointer.
> 
> That would be racy. The check if memcg is dead and the write to per cpu
> partial ptr wouldn't proceed as one atomic operation. If we set the dead
> flag from another thread between these two operations, put_cpu_partial
> will add a slab to a per cpu partial list *after* the cache was zapped.

Hello, Vladimir.

I think that we can do (3) easily.
If we check memcg_cache_dead() in the end of put_cpu_partial() rather
than in the begin of put_cpu_partial(), we can avoid the race you 
mentioned. If someone do put_cpu_partial() before dead flag is set,
it can be zapped by who set dead flag. And if someone do
put_cpu_partial() after dead flag is set, it can be zapped by who
do put_cpu_partial().

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-05-30 13:51   ` Vladimir Davydov
@ 2014-06-02  4:41     ` Joonsoo Kim
  -1 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-02  4:41 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, cl, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 05:51:11PM +0400, Vladimir Davydov wrote:
> There is no use in keeping free objects/slabs on dead memcg caches,
> because they will never be allocated. So let's make cache_reap() shrink
> as many free objects from such caches as possible.
> 
> Note the difference between SLAB and SLUB handling of dead memcg caches.
> For SLUB, dead cache destruction is scheduled as soon as the last object
> is freed, because dead caches do not cache free objects. For SLAB, dead
> caches can keep some free objects on per cpu arrays, so that an empty
> dead cache will be hanging around until cache_reap() drains it.
> 
> We don't disable free objects caching for SLAB, because it would force
> kfree to always take a spin lock, which would degrade performance
> significantly.
> 
> Since cache_reap() drains all caches once ~4 secs on each CPU, empty
> dead caches will die quickly.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  mm/slab.c |   17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index cecc01bba389..d81e46316c99 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3985,6 +3985,11 @@ static void cache_reap(struct work_struct *w)
>  		goto out;
>  
>  	list_for_each_entry(searchp, &slab_caches, list) {
> +		int force = 0;
> +
> +		if (memcg_cache_dead(searchp))
> +			force = 1;
> +
>  		check_irq_on();
>  
>  		/*
> @@ -3996,7 +4001,7 @@ static void cache_reap(struct work_struct *w)
>  
>  		reap_alien(searchp, n);
>  
> -		drain_array(searchp, n, cpu_cache_get(searchp), 0, node);
> +		drain_array(searchp, n, cpu_cache_get(searchp), force, node);
>  
>  		/*
>  		 * These are racy checks but it does not matter
> @@ -4007,15 +4012,17 @@ static void cache_reap(struct work_struct *w)
>  
>  		n->next_reap = jiffies + REAPTIMEOUT_NODE;
>  
> -		drain_array(searchp, n, n->shared, 0, node);
> +		drain_array(searchp, n, n->shared, force, node);
>  
>  		if (n->free_touched)
>  			n->free_touched = 0;
>  		else {
> -			int freed;
> +			int freed, tofree;
> +
> +			tofree = force ? slabs_tofree(searchp, n) :
> +				DIV_ROUND_UP(n->free_limit, 5 * searchp->num);

Hello,

According to my code reading, slabs_to_free() doesn't return number of
free slabs. This bug is introduced by 0fa8103b. I think that it is
better to fix it before applyting this patch. Otherwise, use n->free_objects
instead of slabs_tofree() to achieve your purpose correctly.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-06-02  4:41     ` Joonsoo Kim
  0 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-02  4:41 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, cl, hannes, mhocko, linux-kernel, linux-mm

On Fri, May 30, 2014 at 05:51:11PM +0400, Vladimir Davydov wrote:
> There is no use in keeping free objects/slabs on dead memcg caches,
> because they will never be allocated. So let's make cache_reap() shrink
> as many free objects from such caches as possible.
> 
> Note the difference between SLAB and SLUB handling of dead memcg caches.
> For SLUB, dead cache destruction is scheduled as soon as the last object
> is freed, because dead caches do not cache free objects. For SLAB, dead
> caches can keep some free objects on per cpu arrays, so that an empty
> dead cache will be hanging around until cache_reap() drains it.
> 
> We don't disable free objects caching for SLAB, because it would force
> kfree to always take a spin lock, which would degrade performance
> significantly.
> 
> Since cache_reap() drains all caches once ~4 secs on each CPU, empty
> dead caches will die quickly.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  mm/slab.c |   17 ++++++++++++-----
>  1 file changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index cecc01bba389..d81e46316c99 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -3985,6 +3985,11 @@ static void cache_reap(struct work_struct *w)
>  		goto out;
>  
>  	list_for_each_entry(searchp, &slab_caches, list) {
> +		int force = 0;
> +
> +		if (memcg_cache_dead(searchp))
> +			force = 1;
> +
>  		check_irq_on();
>  
>  		/*
> @@ -3996,7 +4001,7 @@ static void cache_reap(struct work_struct *w)
>  
>  		reap_alien(searchp, n);
>  
> -		drain_array(searchp, n, cpu_cache_get(searchp), 0, node);
> +		drain_array(searchp, n, cpu_cache_get(searchp), force, node);
>  
>  		/*
>  		 * These are racy checks but it does not matter
> @@ -4007,15 +4012,17 @@ static void cache_reap(struct work_struct *w)
>  
>  		n->next_reap = jiffies + REAPTIMEOUT_NODE;
>  
> -		drain_array(searchp, n, n->shared, 0, node);
> +		drain_array(searchp, n, n->shared, force, node);
>  
>  		if (n->free_touched)
>  			n->free_touched = 0;
>  		else {
> -			int freed;
> +			int freed, tofree;
> +
> +			tofree = force ? slabs_tofree(searchp, n) :
> +				DIV_ROUND_UP(n->free_limit, 5 * searchp->num);

Hello,

According to my code reading, slabs_to_free() doesn't return number of
free slabs. This bug is introduced by 0fa8103b. I think that it is
better to fix it before applyting this patch. Otherwise, use n->free_objects
instead of slabs_tofree() to achieve your purpose correctly.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-06-02  4:24         ` Joonsoo Kim
@ 2014-06-02 11:47           ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-02 11:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, akpm, hannes, mhocko, linux-kernel, linux-mm

Hi Joonsoo,

On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
> > > 
> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
> > > > case, because, as there shouldn't be any allocation requests dispatched
> > > > to a dead memcg cache, get_partial_node() should never be called. In
> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> > > > put_cpu_partial) so that setting ->partial to a special value, which
> > > > will make put_cpu_partial bail out, will do the trick.
[...]
> I think that we can do (3) easily.
> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
> than in the begin of put_cpu_partial(), we can avoid the race you 
> mentioned. If someone do put_cpu_partial() before dead flag is set,
> it can be zapped by who set dead flag. And if someone do
> put_cpu_partial() after dead flag is set, it can be zapped by who
> do put_cpu_partial().

After put_cpu_partial() adds a frozen slab to a per cpu partial list,
the slab becomes visible to other threads, which means it can be
unfrozen and freed. The latter can trigger cache destruction. Hence we
shouldn't touch the cache, in particular call memcg_cache_dead() on it,
after calling put_cpu_partial(), otherwise we can get use-after-free.

However, what you propose makes sense if we disable irqs before adding a
slab to a partial list and enable them only after checking if the cache
is dead and unfreezing all partials if so, i.e.

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..14b9e9a8677c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 	struct page *oldpage;
 	int pages;
 	int pobjects;
+	unsigned long flags;
+	int irq_saved = 0;
 
 	do {
+		if (irq_saved) {
+			local_irq_restore(flags);
+			irq_saved = 0;
+		}
+
 		pages = 0;
 		pobjects = 0;
 		oldpage = this_cpu_read(s->cpu_slab->partial);
@@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 		page->pobjects = pobjects;
 		page->next = oldpage;
 
+		local_irq_save(flags);
+		irq_saved = 1;
+
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
+
+	if (memcg_cache_dead(s))
+		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+
+	local_irq_restore(flags);
 #endif
 }
 

That would be safe against possible cache destruction, because to remove
a slab from a per cpu partial list we have to run on the cpu it was
frozen on. Disabling irqs makes it impossible.

Christoph,

Does it look better to you? BTW, why can't we *always* disable irqs for
the whole put_cpu_partial()? That way handling dead caches there would
be trivial, and we wouldn't have to use this_cpu_cmpxchg().

Thanks.

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-06-02 11:47           ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-02 11:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Christoph Lameter, akpm, hannes, mhocko, linux-kernel, linux-mm

Hi Joonsoo,

On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
> > > 
> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
> > > > case, because, as there shouldn't be any allocation requests dispatched
> > > > to a dead memcg cache, get_partial_node() should never be called. In
> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> > > > put_cpu_partial) so that setting ->partial to a special value, which
> > > > will make put_cpu_partial bail out, will do the trick.
[...]
> I think that we can do (3) easily.
> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
> than in the begin of put_cpu_partial(), we can avoid the race you 
> mentioned. If someone do put_cpu_partial() before dead flag is set,
> it can be zapped by who set dead flag. And if someone do
> put_cpu_partial() after dead flag is set, it can be zapped by who
> do put_cpu_partial().

After put_cpu_partial() adds a frozen slab to a per cpu partial list,
the slab becomes visible to other threads, which means it can be
unfrozen and freed. The latter can trigger cache destruction. Hence we
shouldn't touch the cache, in particular call memcg_cache_dead() on it,
after calling put_cpu_partial(), otherwise we can get use-after-free.

However, what you propose makes sense if we disable irqs before adding a
slab to a partial list and enable them only after checking if the cache
is dead and unfreezing all partials if so, i.e.

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..14b9e9a8677c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 	struct page *oldpage;
 	int pages;
 	int pobjects;
+	unsigned long flags;
+	int irq_saved = 0;
 
 	do {
+		if (irq_saved) {
+			local_irq_restore(flags);
+			irq_saved = 0;
+		}
+
 		pages = 0;
 		pobjects = 0;
 		oldpage = this_cpu_read(s->cpu_slab->partial);
@@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 		page->pobjects = pobjects;
 		page->next = oldpage;
 
+		local_irq_save(flags);
+		irq_saved = 1;
+
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
+
+	if (memcg_cache_dead(s))
+		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+
+	local_irq_restore(flags);
 #endif
 }
 

That would be safe against possible cache destruction, because to remove
a slab from a per cpu partial list we have to run on the cpu it was
frozen on. Disabling irqs makes it impossible.

Christoph,

Does it look better to you? BTW, why can't we *always* disable irqs for
the whole put_cpu_partial()? That way handling dead caches there would
be trivial, and we wouldn't have to use this_cpu_cmpxchg().

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-06-02  4:41     ` Joonsoo Kim
@ 2014-06-02 12:10       ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-02 12:10 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, cl, hannes, mhocko, linux-kernel, linux-mm

On Mon, Jun 02, 2014 at 01:41:55PM +0900, Joonsoo Kim wrote:
> According to my code reading, slabs_to_free() doesn't return number of
> free slabs. This bug is introduced by 0fa8103b. I think that it is
> better to fix it before applyting this patch. Otherwise, use n->free_objects
> instead of slabs_tofree() to achieve your purpose correctly.

Hmm, I don't think slab_tofree() computes the number of free slabs
wrong. If we have N free objects, there may be
DIV_ROUND_UP(N,objs_per_slab) empty slabs at max, and that's exactly
what slab_tofree() does, no?

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-06-02 12:10       ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-02 12:10 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: akpm, cl, hannes, mhocko, linux-kernel, linux-mm

On Mon, Jun 02, 2014 at 01:41:55PM +0900, Joonsoo Kim wrote:
> According to my code reading, slabs_to_free() doesn't return number of
> free slabs. This bug is introduced by 0fa8103b. I think that it is
> better to fix it before applyting this patch. Otherwise, use n->free_objects
> instead of slabs_tofree() to achieve your purpose correctly.

Hmm, I don't think slab_tofree() computes the number of free slabs
wrong. If we have N free objects, there may be
DIV_ROUND_UP(N,objs_per_slab) empty slabs at max, and that's exactly
what slab_tofree() does, no?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-06-02 12:10       ` Vladimir Davydov
@ 2014-06-02 14:01         ` Joonsoo Kim
  -1 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-02 14:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Joonsoo Kim, Andrew Morton, Christoph Lameter, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

2014-06-02 21:10 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> On Mon, Jun 02, 2014 at 01:41:55PM +0900, Joonsoo Kim wrote:
>> According to my code reading, slabs_to_free() doesn't return number of
>> free slabs. This bug is introduced by 0fa8103b. I think that it is
>> better to fix it before applyting this patch. Otherwise, use n->free_objects
>> instead of slabs_tofree() to achieve your purpose correctly.
>
> Hmm, I don't think slab_tofree() computes the number of free slabs
> wrong. If we have N free objects, there may be
> DIV_ROUND_UP(N,objs_per_slab) empty slabs at max, and that's exactly
> what slab_tofree() does, no?

Oops... Sorry for wrong comment.
You are right. Please ignore my comment. :)

BTW, we don't need DIV_ROUND_UP. I think that just N / objs_per_slab is
sufficient to get number of empty slabs at max. Am I missing too?

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-06-02 14:01         ` Joonsoo Kim
  0 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-02 14:01 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Joonsoo Kim, Andrew Morton, Christoph Lameter, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

2014-06-02 21:10 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> On Mon, Jun 02, 2014 at 01:41:55PM +0900, Joonsoo Kim wrote:
>> According to my code reading, slabs_to_free() doesn't return number of
>> free slabs. This bug is introduced by 0fa8103b. I think that it is
>> better to fix it before applyting this patch. Otherwise, use n->free_objects
>> instead of slabs_tofree() to achieve your purpose correctly.
>
> Hmm, I don't think slab_tofree() computes the number of free slabs
> wrong. If we have N free objects, there may be
> DIV_ROUND_UP(N,objs_per_slab) empty slabs at max, and that's exactly
> what slab_tofree() does, no?

Oops... Sorry for wrong comment.
You are right. Please ignore my comment. :)

BTW, we don't need DIV_ROUND_UP. I think that just N / objs_per_slab is
sufficient to get number of empty slabs at max. Am I missing too?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-06-02 11:47           ` Vladimir Davydov
@ 2014-06-02 14:03             ` Joonsoo Kim
  -1 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-02 14:03 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Joonsoo Kim, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

2014-06-02 20:47 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> Hi Joonsoo,
>
> On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
>> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
>> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
>> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
>> > >
>> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
>> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
>> > > > case, because, as there shouldn't be any allocation requests dispatched
>> > > > to a dead memcg cache, get_partial_node() should never be called. In
>> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
>> > > > put_cpu_partial) so that setting ->partial to a special value, which
>> > > > will make put_cpu_partial bail out, will do the trick.
> [...]
>> I think that we can do (3) easily.
>> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
>> than in the begin of put_cpu_partial(), we can avoid the race you
>> mentioned. If someone do put_cpu_partial() before dead flag is set,
>> it can be zapped by who set dead flag. And if someone do
>> put_cpu_partial() after dead flag is set, it can be zapped by who
>> do put_cpu_partial().
>
> After put_cpu_partial() adds a frozen slab to a per cpu partial list,
> the slab becomes visible to other threads, which means it can be
> unfrozen and freed. The latter can trigger cache destruction. Hence we
> shouldn't touch the cache, in particular call memcg_cache_dead() on it,
> after calling put_cpu_partial(), otherwise we can get use-after-free.
>
> However, what you propose makes sense if we disable irqs before adding a
> slab to a partial list and enable them only after checking if the cache
> is dead and unfreezing all partials if so, i.e.
>
> diff --git a/mm/slub.c b/mm/slub.c
> index d96faa2464c3..14b9e9a8677c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>         struct page *oldpage;
>         int pages;
>         int pobjects;
> +       unsigned long flags;
> +       int irq_saved = 0;
>
>         do {
> +               if (irq_saved) {
> +                       local_irq_restore(flags);
> +                       irq_saved = 0;
> +               }
> +
>                 pages = 0;
>                 pobjects = 0;
>                 oldpage = this_cpu_read(s->cpu_slab->partial);
> @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>                 page->pobjects = pobjects;
>                 page->next = oldpage;
>
> +               local_irq_save(flags);
> +               irq_saved = 1;
> +
>         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>                                                                 != oldpage);
> +
> +       if (memcg_cache_dead(s))
> +               unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> +
> +       local_irq_restore(flags);
>  #endif
>  }
>
>
> That would be safe against possible cache destruction, because to remove
> a slab from a per cpu partial list we have to run on the cpu it was
> frozen on. Disabling irqs makes it impossible.

Hmm... this is also a bit ugly.
How about following change?

Thanks.

diff --git a/mm/slub.c b/mm/slub.c
index 2b1ce69..6adab87 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
*s, struct page *page, int drain)

        } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
                                                                != oldpage);
+
+       if (memcg_cache_dead(s)) {
+               bool done = false;
+               unsigned long flags;
+
+               local_irq_save(flags);
+               if (this_cpu_read(s->cpu_slab->partial) == page) {
+                       done = true;
+                       unfreeze_partials(s, this_cpu_ptr);
+               }
+               local_irq_restore(flags);
+
+               if (!done)
+                       flush_all(s);
+       }
 #endif
 }

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-06-02 14:03             ` Joonsoo Kim
  0 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-02 14:03 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Joonsoo Kim, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

2014-06-02 20:47 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> Hi Joonsoo,
>
> On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
>> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
>> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
>> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
>> > >
>> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
>> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
>> > > > case, because, as there shouldn't be any allocation requests dispatched
>> > > > to a dead memcg cache, get_partial_node() should never be called. In
>> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
>> > > > put_cpu_partial) so that setting ->partial to a special value, which
>> > > > will make put_cpu_partial bail out, will do the trick.
> [...]
>> I think that we can do (3) easily.
>> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
>> than in the begin of put_cpu_partial(), we can avoid the race you
>> mentioned. If someone do put_cpu_partial() before dead flag is set,
>> it can be zapped by who set dead flag. And if someone do
>> put_cpu_partial() after dead flag is set, it can be zapped by who
>> do put_cpu_partial().
>
> After put_cpu_partial() adds a frozen slab to a per cpu partial list,
> the slab becomes visible to other threads, which means it can be
> unfrozen and freed. The latter can trigger cache destruction. Hence we
> shouldn't touch the cache, in particular call memcg_cache_dead() on it,
> after calling put_cpu_partial(), otherwise we can get use-after-free.
>
> However, what you propose makes sense if we disable irqs before adding a
> slab to a partial list and enable them only after checking if the cache
> is dead and unfreezing all partials if so, i.e.
>
> diff --git a/mm/slub.c b/mm/slub.c
> index d96faa2464c3..14b9e9a8677c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>         struct page *oldpage;
>         int pages;
>         int pobjects;
> +       unsigned long flags;
> +       int irq_saved = 0;
>
>         do {
> +               if (irq_saved) {
> +                       local_irq_restore(flags);
> +                       irq_saved = 0;
> +               }
> +
>                 pages = 0;
>                 pobjects = 0;
>                 oldpage = this_cpu_read(s->cpu_slab->partial);
> @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>                 page->pobjects = pobjects;
>                 page->next = oldpage;
>
> +               local_irq_save(flags);
> +               irq_saved = 1;
> +
>         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>                                                                 != oldpage);
> +
> +       if (memcg_cache_dead(s))
> +               unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> +
> +       local_irq_restore(flags);
>  #endif
>  }
>
>
> That would be safe against possible cache destruction, because to remove
> a slab from a per cpu partial list we have to run on the cpu it was
> frozen on. Disabling irqs makes it impossible.

Hmm... this is also a bit ugly.
How about following change?

Thanks.

diff --git a/mm/slub.c b/mm/slub.c
index 2b1ce69..6adab87 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
*s, struct page *page, int drain)

        } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
                                                                != oldpage);
+
+       if (memcg_cache_dead(s)) {
+               bool done = false;
+               unsigned long flags;
+
+               local_irq_save(flags);
+               if (this_cpu_read(s->cpu_slab->partial) == page) {
+                       done = true;
+                       unfreeze_partials(s, this_cpu_ptr);
+               }
+               local_irq_restore(flags);
+
+               if (!done)
+                       flush_all(s);
+       }
 #endif
 }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 4/8] slub: never fail kmem_cache_shrink
  2014-05-31 10:18       ` Vladimir Davydov
@ 2014-06-02 15:13         ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-02 15:13 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Sat, 31 May 2014, Vladimir Davydov wrote:

> ... which means more async workers, more complication to kmemcg code :-(
>
> Sorry, but I just don't get why we can't make kmem_cache_shrink never
> fail? Is failing de-fragmentation, which is even not implied by the
> function declaration, so critical that should be noted? If so, we can
> return an error while still shrinking empty slabs...

There could be other reasons for failure in the future as
kmem_cache_shrink is updated. Requiring kmem_cache_shrink to never fail
may cause problems for future modifications.

> If you just don't like the code after the patch, here is another, less
> intrusive version doing practically the same. Would it be better?

That looks acceptable.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 4/8] slub: never fail kmem_cache_shrink
@ 2014-06-02 15:13         ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-02 15:13 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Sat, 31 May 2014, Vladimir Davydov wrote:

> ... which means more async workers, more complication to kmemcg code :-(
>
> Sorry, but I just don't get why we can't make kmem_cache_shrink never
> fail? Is failing de-fragmentation, which is even not implied by the
> function declaration, so critical that should be noted? If so, we can
> return an error while still shrinking empty slabs...

There could be other reasons for failure in the future as
kmem_cache_shrink is updated. Requiring kmem_cache_shrink to never fail
may cause problems for future modifications.

> If you just don't like the code after the patch, here is another, less
> intrusive version doing practically the same. Would it be better?

That looks acceptable.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
  2014-05-31 10:27       ` Vladimir Davydov
@ 2014-06-02 15:16         ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-02 15:16 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Sat, 31 May 2014, Vladimir Davydov wrote:

> > Well slub returns an error code if it fails
>
> ... to sort slabs by the nubmer of objects in use, which is not even
> implied by the function declaration. Why can *shrinking*, which is what
> kmem_cache_shrink must do at first place, ever fail?

Because there is a memory allocation failure. Or there may be other
processes going on that prevent shrinking. F.e. We may want to merge a
patchset that does defragmentation of slabs at some point.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
@ 2014-06-02 15:16         ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-02 15:16 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Sat, 31 May 2014, Vladimir Davydov wrote:

> > Well slub returns an error code if it fails
>
> ... to sort slabs by the nubmer of objects in use, which is not even
> implied by the function declaration. Why can *shrinking*, which is what
> kmem_cache_shrink must do at first place, ever fail?

Because there is a memory allocation failure. Or there may be other
processes going on that prevent shrinking. F.e. We may want to merge a
patchset that does defragmentation of slabs at some point.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-06-02 14:03             ` Joonsoo Kim
@ 2014-06-02 15:17               ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-02 15:17 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vladimir Davydov, Joonsoo Kim, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

On Mon, 2 Jun 2014, Joonsoo Kim wrote:

> Hmm... this is also a bit ugly.
> How about following change?

That looks much cleaner.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-06-02 15:17               ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-02 15:17 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vladimir Davydov, Joonsoo Kim, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

On Mon, 2 Jun 2014, Joonsoo Kim wrote:

> Hmm... this is also a bit ugly.
> How about following change?

That looks much cleaner.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-05-31 11:19       ` Vladimir Davydov
@ 2014-06-02 15:24         ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-02 15:24 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Sat, 31 May 2014, Vladimir Davydov wrote:

> > You can use a similar approach than in SLUB. Reduce the size of the per
> > cpu array objects to zero. Then SLAB will always fall back to its slow
> > path in cache_flusharray() where you may be able to do something with less
> > of an impact on performace.
>
> In contrast to SLUB, for SLAB this will slow down kfree significantly.

But that is only when you want to destroy a cache. This is similar.

> Fast path for SLAB is just putting an object to a per cpu array, while
> the slow path requires taking a per node lock, which is much slower even
> with no contention. There still can be lots of objects in a dead memcg
> cache (e.g. hundreds of megabytes of dcache), so such performance
> degradation is not acceptable, IMO.

I am not sure that there is such a stark difference to SLUB. SLUB also
takes the per node lock if necessary to handle freeing especially if you
zap the per cpu partial slab pages.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-06-02 15:24         ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-02 15:24 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Sat, 31 May 2014, Vladimir Davydov wrote:

> > You can use a similar approach than in SLUB. Reduce the size of the per
> > cpu array objects to zero. Then SLAB will always fall back to its slow
> > path in cache_flusharray() where you may be able to do something with less
> > of an impact on performace.
>
> In contrast to SLUB, for SLAB this will slow down kfree significantly.

But that is only when you want to destroy a cache. This is similar.

> Fast path for SLAB is just putting an object to a per cpu array, while
> the slow path requires taking a per node lock, which is much slower even
> with no contention. There still can be lots of objects in a dead memcg
> cache (e.g. hundreds of megabytes of dcache), so such performance
> degradation is not acceptable, IMO.

I am not sure that there is such a stark difference to SLUB. SLUB also
takes the per node lock if necessary to handle freeing especially if you
zap the per cpu partial slab pages.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-06-02 14:03             ` Joonsoo Kim
@ 2014-06-03  8:16               ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03  8:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

On Mon, Jun 02, 2014 at 11:03:51PM +0900, Joonsoo Kim wrote:
> 2014-06-02 20:47 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> > Hi Joonsoo,
> >
> > On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
> >> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> >> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> >> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
> >> > >
> >> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> >> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
> >> > > > case, because, as there shouldn't be any allocation requests dispatched
> >> > > > to a dead memcg cache, get_partial_node() should never be called. In
> >> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> >> > > > put_cpu_partial) so that setting ->partial to a special value, which
> >> > > > will make put_cpu_partial bail out, will do the trick.
> > [...]
> >> I think that we can do (3) easily.
> >> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
> >> than in the begin of put_cpu_partial(), we can avoid the race you
> >> mentioned. If someone do put_cpu_partial() before dead flag is set,
> >> it can be zapped by who set dead flag. And if someone do
> >> put_cpu_partial() after dead flag is set, it can be zapped by who
> >> do put_cpu_partial().
> >
> > After put_cpu_partial() adds a frozen slab to a per cpu partial list,
> > the slab becomes visible to other threads, which means it can be
> > unfrozen and freed. The latter can trigger cache destruction. Hence we
> > shouldn't touch the cache, in particular call memcg_cache_dead() on it,
> > after calling put_cpu_partial(), otherwise we can get use-after-free.
> >
> > However, what you propose makes sense if we disable irqs before adding a
> > slab to a partial list and enable them only after checking if the cache
> > is dead and unfreezing all partials if so, i.e.
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index d96faa2464c3..14b9e9a8677c 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> >         struct page *oldpage;
> >         int pages;
> >         int pobjects;
> > +       unsigned long flags;
> > +       int irq_saved = 0;
> >
> >         do {
> > +               if (irq_saved) {
> > +                       local_irq_restore(flags);
> > +                       irq_saved = 0;
> > +               }
> > +
> >                 pages = 0;
> >                 pobjects = 0;
> >                 oldpage = this_cpu_read(s->cpu_slab->partial);
> > @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> >                 page->pobjects = pobjects;
> >                 page->next = oldpage;
> >
> > +               local_irq_save(flags);
> > +               irq_saved = 1;
> > +
> >         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> >                                                                 != oldpage);
> > +
> > +       if (memcg_cache_dead(s))
> > +               unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> > +
> > +       local_irq_restore(flags);
> >  #endif
> >  }
> >
> >
> > That would be safe against possible cache destruction, because to remove
> > a slab from a per cpu partial list we have to run on the cpu it was
> > frozen on. Disabling irqs makes it impossible.
> 
> Hmm... this is also a bit ugly.
> How about following change?
> 
> Thanks.
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 2b1ce69..6adab87 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
> *s, struct page *page, int drain)
> 
>         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>                                                                 != oldpage);
> +
> +       if (memcg_cache_dead(s)) {
> +               bool done = false;
> +               unsigned long flags;

Suppose we are preempted here. In the meanwhile all objects are freed to
the cache, all frozen pages are unfrozen and also freed. The cache
destruction is then scheduled (patch 2 of this set). Then when this
thread continues execution it will operate on the cache that was
destroyed - use-after-free.

I admit, this is very unlikely, but can we ignore this possibility?

Thanks.

> +
> +               local_irq_save(flags);
> +               if (this_cpu_read(s->cpu_slab->partial) == page) {
> +                       done = true;
> +                       unfreeze_partials(s, this_cpu_ptr);
> +               }
> +               local_irq_restore(flags);
> +
> +               if (!done)
> +                       flush_all(s);
> +       }
>  #endif
>  }

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-06-03  8:16               ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03  8:16 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

On Mon, Jun 02, 2014 at 11:03:51PM +0900, Joonsoo Kim wrote:
> 2014-06-02 20:47 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> > Hi Joonsoo,
> >
> > On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
> >> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
> >> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
> >> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
> >> > >
> >> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
> >> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
> >> > > > case, because, as there shouldn't be any allocation requests dispatched
> >> > > > to a dead memcg cache, get_partial_node() should never be called. In
> >> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
> >> > > > put_cpu_partial) so that setting ->partial to a special value, which
> >> > > > will make put_cpu_partial bail out, will do the trick.
> > [...]
> >> I think that we can do (3) easily.
> >> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
> >> than in the begin of put_cpu_partial(), we can avoid the race you
> >> mentioned. If someone do put_cpu_partial() before dead flag is set,
> >> it can be zapped by who set dead flag. And if someone do
> >> put_cpu_partial() after dead flag is set, it can be zapped by who
> >> do put_cpu_partial().
> >
> > After put_cpu_partial() adds a frozen slab to a per cpu partial list,
> > the slab becomes visible to other threads, which means it can be
> > unfrozen and freed. The latter can trigger cache destruction. Hence we
> > shouldn't touch the cache, in particular call memcg_cache_dead() on it,
> > after calling put_cpu_partial(), otherwise we can get use-after-free.
> >
> > However, what you propose makes sense if we disable irqs before adding a
> > slab to a partial list and enable them only after checking if the cache
> > is dead and unfreezing all partials if so, i.e.
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index d96faa2464c3..14b9e9a8677c 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> >         struct page *oldpage;
> >         int pages;
> >         int pobjects;
> > +       unsigned long flags;
> > +       int irq_saved = 0;
> >
> >         do {
> > +               if (irq_saved) {
> > +                       local_irq_restore(flags);
> > +                       irq_saved = 0;
> > +               }
> > +
> >                 pages = 0;
> >                 pobjects = 0;
> >                 oldpage = this_cpu_read(s->cpu_slab->partial);
> > @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
> >                 page->pobjects = pobjects;
> >                 page->next = oldpage;
> >
> > +               local_irq_save(flags);
> > +               irq_saved = 1;
> > +
> >         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> >                                                                 != oldpage);
> > +
> > +       if (memcg_cache_dead(s))
> > +               unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> > +
> > +       local_irq_restore(flags);
> >  #endif
> >  }
> >
> >
> > That would be safe against possible cache destruction, because to remove
> > a slab from a per cpu partial list we have to run on the cpu it was
> > frozen on. Disabling irqs makes it impossible.
> 
> Hmm... this is also a bit ugly.
> How about following change?
> 
> Thanks.
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 2b1ce69..6adab87 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
> *s, struct page *page, int drain)
> 
>         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>                                                                 != oldpage);
> +
> +       if (memcg_cache_dead(s)) {
> +               bool done = false;
> +               unsigned long flags;

Suppose we are preempted here. In the meanwhile all objects are freed to
the cache, all frozen pages are unfrozen and also freed. The cache
destruction is then scheduled (patch 2 of this set). Then when this
thread continues execution it will operate on the cache that was
destroyed - use-after-free.

I admit, this is very unlikely, but can we ignore this possibility?

Thanks.

> +
> +               local_irq_save(flags);
> +               if (this_cpu_read(s->cpu_slab->partial) == page) {
> +                       done = true;
> +                       unfreeze_partials(s, this_cpu_ptr);
> +               }
> +               local_irq_restore(flags);
> +
> +               if (!done)
> +                       flush_all(s);
> +       }
>  #endif
>  }

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-06-02 14:01         ` Joonsoo Kim
@ 2014-06-03  8:21           ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03  8:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Christoph Lameter, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

On Mon, Jun 02, 2014 at 11:01:55PM +0900, Joonsoo Kim wrote:
> 2014-06-02 21:10 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> > On Mon, Jun 02, 2014 at 01:41:55PM +0900, Joonsoo Kim wrote:
> >> According to my code reading, slabs_to_free() doesn't return number of
> >> free slabs. This bug is introduced by 0fa8103b. I think that it is
> >> better to fix it before applyting this patch. Otherwise, use n->free_objects
> >> instead of slabs_tofree() to achieve your purpose correctly.
> >
> > Hmm, I don't think slab_tofree() computes the number of free slabs
> > wrong. If we have N free objects, there may be
> > DIV_ROUND_UP(N,objs_per_slab) empty slabs at max, and that's exactly
> > what slab_tofree() does, no?
> 
[...]
> BTW, we don't need DIV_ROUND_UP. I think that just N / objs_per_slab is
> sufficient to get number of empty slabs at max. Am I missing too?

Yeah, you're right - DIV_ROUND_UP is obviously redundant, DIV would be
enough. Not a bug though.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-06-03  8:21           ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03  8:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Christoph Lameter, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

On Mon, Jun 02, 2014 at 11:01:55PM +0900, Joonsoo Kim wrote:
> 2014-06-02 21:10 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> > On Mon, Jun 02, 2014 at 01:41:55PM +0900, Joonsoo Kim wrote:
> >> According to my code reading, slabs_to_free() doesn't return number of
> >> free slabs. This bug is introduced by 0fa8103b. I think that it is
> >> better to fix it before applyting this patch. Otherwise, use n->free_objects
> >> instead of slabs_tofree() to achieve your purpose correctly.
> >
> > Hmm, I don't think slab_tofree() computes the number of free slabs
> > wrong. If we have N free objects, there may be
> > DIV_ROUND_UP(N,objs_per_slab) empty slabs at max, and that's exactly
> > what slab_tofree() does, no?
> 
[...]
> BTW, we don't need DIV_ROUND_UP. I think that just N / objs_per_slab is
> sufficient to get number of empty slabs at max. Am I missing too?

Yeah, you're right - DIV_ROUND_UP is obviously redundant, DIV would be
enough. Not a bug though.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
  2014-06-02 15:16         ` Christoph Lameter
@ 2014-06-03  9:06           ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03  9:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Mon, Jun 02, 2014 at 10:16:03AM -0500, Christoph Lameter wrote:
> On Sat, 31 May 2014, Vladimir Davydov wrote:
> 
> > > Well slub returns an error code if it fails
> >
> > ... to sort slabs by the nubmer of objects in use, which is not even
> > implied by the function declaration. Why can *shrinking*, which is what
> > kmem_cache_shrink must do at first place, ever fail?
> 
> Because there is a memory allocation failure. Or there may be other
> processes going on that prevent shrinking. F.e. We may want to merge a
> patchset that does defragmentation of slabs at some point.

Fair enough.

Still, I really want to evict all empty slabs from cache on memcg
offline for sure. Handling failures there means introducing a worker
that will retry shrinking, but that seems to me as an unnecessary
complication, because there's nothing that can prevent us from shrinking
empty slabs from the cache, even if we merge slab defragmentation, isn't
it?

May be, it's worth introducing a special function, say kmem_cache_zap(),
that will only evict empty slabs from the cache, plus disable empty
slabs caching? This function would be called only from memcg offline for
dead memcg caches.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
@ 2014-06-03  9:06           ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03  9:06 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Mon, Jun 02, 2014 at 10:16:03AM -0500, Christoph Lameter wrote:
> On Sat, 31 May 2014, Vladimir Davydov wrote:
> 
> > > Well slub returns an error code if it fails
> >
> > ... to sort slabs by the nubmer of objects in use, which is not even
> > implied by the function declaration. Why can *shrinking*, which is what
> > kmem_cache_shrink must do at first place, ever fail?
> 
> Because there is a memory allocation failure. Or there may be other
> processes going on that prevent shrinking. F.e. We may want to merge a
> patchset that does defragmentation of slabs at some point.

Fair enough.

Still, I really want to evict all empty slabs from cache on memcg
offline for sure. Handling failures there means introducing a worker
that will retry shrinking, but that seems to me as an unnecessary
complication, because there's nothing that can prevent us from shrinking
empty slabs from the cache, even if we merge slab defragmentation, isn't
it?

May be, it's worth introducing a special function, say kmem_cache_zap(),
that will only evict empty slabs from the cache, plus disable empty
slabs caching? This function would be called only from memcg offline for
dead memcg caches.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
  2014-06-03  9:06           ` Vladimir Davydov
@ 2014-06-03 14:48             ` Christoph Lameter
  -1 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-03 14:48 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Tue, 3 Jun 2014, Vladimir Davydov wrote:

> Still, I really want to evict all empty slabs from cache on memcg
> offline for sure. Handling failures there means introducing a worker
> that will retry shrinking, but that seems to me as an unnecessary
> complication, because there's nothing that can prevent us from shrinking
> empty slabs from the cache, even if we merge slab defragmentation, isn't
> it?
>
> May be, it's worth introducing a special function, say kmem_cache_zap(),
> that will only evict empty slabs from the cache, plus disable empty
> slabs caching? This function would be called only from memcg offline for
> dead memcg caches.

I am fine with the lower impact version that you came up with later.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
@ 2014-06-03 14:48             ` Christoph Lameter
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Lameter @ 2014-06-03 14:48 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Tue, 3 Jun 2014, Vladimir Davydov wrote:

> Still, I really want to evict all empty slabs from cache on memcg
> offline for sure. Handling failures there means introducing a worker
> that will retry shrinking, but that seems to me as an unnecessary
> complication, because there's nothing that can prevent us from shrinking
> empty slabs from the cache, even if we merge slab defragmentation, isn't
> it?
>
> May be, it's worth introducing a special function, say kmem_cache_zap(),
> that will only evict empty slabs from the cache, plus disable empty
> slabs caching? This function would be called only from memcg offline for
> dead memcg caches.

I am fine with the lower impact version that you came up with later.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
  2014-06-03 14:48             ` Christoph Lameter
@ 2014-06-03 19:00               ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03 19:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Tue, Jun 03, 2014 at 09:48:51AM -0500, Christoph Lameter wrote:
> On Tue, 3 Jun 2014, Vladimir Davydov wrote:
> 
> > Still, I really want to evict all empty slabs from cache on memcg
> > offline for sure. Handling failures there means introducing a worker
> > that will retry shrinking, but that seems to me as an unnecessary
> > complication, because there's nothing that can prevent us from shrinking
> > empty slabs from the cache, even if we merge slab defragmentation, isn't
> > it?
> >
> > May be, it's worth introducing a special function, say kmem_cache_zap(),
> > that will only evict empty slabs from the cache, plus disable empty
> > slabs caching? This function would be called only from memcg offline for
> > dead memcg caches.
> 
> I am fine with the lower impact version that you came up with later.

Oh, I missed a couple of your previous e-mails, because our mail server
marked them (along with a hundred or so another messages :-) ) as junk
for some reason. Going to turn off the filter completely.

Sorry for being inconsistent and thank you.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval
@ 2014-06-03 19:00               ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03 19:00 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Tue, Jun 03, 2014 at 09:48:51AM -0500, Christoph Lameter wrote:
> On Tue, 3 Jun 2014, Vladimir Davydov wrote:
> 
> > Still, I really want to evict all empty slabs from cache on memcg
> > offline for sure. Handling failures there means introducing a worker
> > that will retry shrinking, but that seems to me as an unnecessary
> > complication, because there's nothing that can prevent us from shrinking
> > empty slabs from the cache, even if we merge slab defragmentation, isn't
> > it?
> >
> > May be, it's worth introducing a special function, say kmem_cache_zap(),
> > that will only evict empty slabs from the cache, plus disable empty
> > slabs caching? This function would be called only from memcg offline for
> > dead memcg caches.
> 
> I am fine with the lower impact version that you came up with later.

Oh, I missed a couple of your previous e-mails, because our mail server
marked them (along with a hundred or so another messages :-) ) as junk
for some reason. Going to turn off the filter completely.

Sorry for being inconsistent and thank you.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
  2014-06-02 15:24         ` Christoph Lameter
@ 2014-06-03 20:18           ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03 20:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Mon, Jun 02, 2014 at 10:24:09AM -0500, Christoph Lameter wrote:
> On Sat, 31 May 2014, Vladimir Davydov wrote:
> 
> > > You can use a similar approach than in SLUB. Reduce the size of the per
> > > cpu array objects to zero. Then SLAB will always fall back to its slow
> > > path in cache_flusharray() where you may be able to do something with less
> > > of an impact on performace.
> >
> > In contrast to SLUB, for SLAB this will slow down kfree significantly.
> 
> But that is only when you want to destroy a cache. This is similar.

When we want to destroy a memcg cache, there can be really a lot of
objects allocated from it, e.g. gigabytes of inodes and dentries. That's
why I think we should avoid any performance degradations if possible.

> 
> > Fast path for SLAB is just putting an object to a per cpu array, while
> > the slow path requires taking a per node lock, which is much slower even
> > with no contention. There still can be lots of objects in a dead memcg
> > cache (e.g. hundreds of megabytes of dcache), so such performance
> > degradation is not acceptable, IMO.
> 
> I am not sure that there is such a stark difference to SLUB. SLUB also
> takes the per node lock if necessary to handle freeing especially if you
> zap the per cpu partial slab pages.

Hmm, for SLUB we will only take the node lock for inserting a slab on
the partial list, while for SLAB disabling per-cpu arrays will result in
taking the lock on each object free. So if there are only several
objects per slab, the difference won't be huge, otherwise the slow down
will be noticeable for SLAB, but not for SLUB.

I'm not that sure that we should prefer one way over another though. I
just think that if we already have periodic reaping for SLAB, why not
employ it for reaping dead memcg caches too, provided it won't obfuscate
the code? Anyway, if you think that we can neglect possible performance
degradation that will result from disabling per cpu caches for SLAB, I
can give it a try.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 8/8] slab: reap dead memcg caches aggressively
@ 2014-06-03 20:18           ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-03 20:18 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: akpm, hannes, mhocko, linux-kernel, linux-mm

On Mon, Jun 02, 2014 at 10:24:09AM -0500, Christoph Lameter wrote:
> On Sat, 31 May 2014, Vladimir Davydov wrote:
> 
> > > You can use a similar approach than in SLUB. Reduce the size of the per
> > > cpu array objects to zero. Then SLAB will always fall back to its slow
> > > path in cache_flusharray() where you may be able to do something with less
> > > of an impact on performace.
> >
> > In contrast to SLUB, for SLAB this will slow down kfree significantly.
> 
> But that is only when you want to destroy a cache. This is similar.

When we want to destroy a memcg cache, there can be really a lot of
objects allocated from it, e.g. gigabytes of inodes and dentries. That's
why I think we should avoid any performance degradations if possible.

> 
> > Fast path for SLAB is just putting an object to a per cpu array, while
> > the slow path requires taking a per node lock, which is much slower even
> > with no contention. There still can be lots of objects in a dead memcg
> > cache (e.g. hundreds of megabytes of dcache), so such performance
> > degradation is not acceptable, IMO.
> 
> I am not sure that there is such a stark difference to SLUB. SLUB also
> takes the per node lock if necessary to handle freeing especially if you
> zap the per cpu partial slab pages.

Hmm, for SLUB we will only take the node lock for inserting a slab on
the partial list, while for SLAB disabling per-cpu arrays will result in
taking the lock on each object free. So if there are only several
objects per slab, the difference won't be huge, otherwise the slow down
will be noticeable for SLAB, but not for SLUB.

I'm not that sure that we should prefer one way over another though. I
just think that if we already have periodic reaping for SLAB, why not
employ it for reaping dead memcg caches too, provided it won't obfuscate
the code? Anyway, if you think that we can neglect possible performance
degradation that will result from disabling per cpu caches for SLAB, I
can give it a try.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-06-03  8:16               ` Vladimir Davydov
@ 2014-06-04  8:53                 ` Joonsoo Kim
  -1 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-04  8:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Joonsoo Kim, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

2014-06-03 17:16 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> On Mon, Jun 02, 2014 at 11:03:51PM +0900, Joonsoo Kim wrote:
>> 2014-06-02 20:47 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
>> > Hi Joonsoo,
>> >
>> > On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
>> >> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
>> >> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
>> >> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
>> >> > >
>> >> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
>> >> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
>> >> > > > case, because, as there shouldn't be any allocation requests dispatched
>> >> > > > to a dead memcg cache, get_partial_node() should never be called. In
>> >> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
>> >> > > > put_cpu_partial) so that setting ->partial to a special value, which
>> >> > > > will make put_cpu_partial bail out, will do the trick.
>> > [...]
>> >> I think that we can do (3) easily.
>> >> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
>> >> than in the begin of put_cpu_partial(), we can avoid the race you
>> >> mentioned. If someone do put_cpu_partial() before dead flag is set,
>> >> it can be zapped by who set dead flag. And if someone do
>> >> put_cpu_partial() after dead flag is set, it can be zapped by who
>> >> do put_cpu_partial().
>> >
>> > After put_cpu_partial() adds a frozen slab to a per cpu partial list,
>> > the slab becomes visible to other threads, which means it can be
>> > unfrozen and freed. The latter can trigger cache destruction. Hence we
>> > shouldn't touch the cache, in particular call memcg_cache_dead() on it,
>> > after calling put_cpu_partial(), otherwise we can get use-after-free.
>> >
>> > However, what you propose makes sense if we disable irqs before adding a
>> > slab to a partial list and enable them only after checking if the cache
>> > is dead and unfreezing all partials if so, i.e.
>> >
>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index d96faa2464c3..14b9e9a8677c 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> >         struct page *oldpage;
>> >         int pages;
>> >         int pobjects;
>> > +       unsigned long flags;
>> > +       int irq_saved = 0;
>> >
>> >         do {
>> > +               if (irq_saved) {
>> > +                       local_irq_restore(flags);
>> > +                       irq_saved = 0;
>> > +               }
>> > +
>> >                 pages = 0;
>> >                 pobjects = 0;
>> >                 oldpage = this_cpu_read(s->cpu_slab->partial);
>> > @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> >                 page->pobjects = pobjects;
>> >                 page->next = oldpage;
>> >
>> > +               local_irq_save(flags);
>> > +               irq_saved = 1;
>> > +
>> >         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>> >                                                                 != oldpage);
>> > +
>> > +       if (memcg_cache_dead(s))
>> > +               unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
>> > +
>> > +       local_irq_restore(flags);
>> >  #endif
>> >  }
>> >
>> >
>> > That would be safe against possible cache destruction, because to remove
>> > a slab from a per cpu partial list we have to run on the cpu it was
>> > frozen on. Disabling irqs makes it impossible.
>>
>> Hmm... this is also a bit ugly.
>> How about following change?
>>
>> Thanks.
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 2b1ce69..6adab87 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
>> *s, struct page *page, int drain)
>>
>>         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>>                                                                 != oldpage);
>> +
>> +       if (memcg_cache_dead(s)) {
>> +               bool done = false;
>> +               unsigned long flags;
>
> Suppose we are preempted here. In the meanwhile all objects are freed to
> the cache, all frozen pages are unfrozen and also freed. The cache
> destruction is then scheduled (patch 2 of this set). Then when this
> thread continues execution it will operate on the cache that was
> destroyed - use-after-free.
>
> I admit, this is very unlikely, but can we ignore this possibility?
>

Hello,

>From your comment, now, I realize that your cache destruction solution
has severe problem.

With you solution, kmem_cache can be destroyed before last kfree() caller
has returned. It means that we can't safely do anything related to
the kmem_cache after losing control about that slab where we try to free
object in free path.

Consider __slab_free(). After put_cpu_partial() in __slab_free() is called,
we attempt to update stat. There is possibility that this operation could be
use-after-free with your solution. Until now, we have just stat operation, but
it could be more. I don't like to impose this constraint to the slab free path.
So IMHO, it is better that we should defer to destroy kmem_cache
until last kfree() caller returns. Is it fair enough? :)

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-06-04  8:53                 ` Joonsoo Kim
  0 siblings, 0 replies; 76+ messages in thread
From: Joonsoo Kim @ 2014-06-04  8:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Joonsoo Kim, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

2014-06-03 17:16 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
> On Mon, Jun 02, 2014 at 11:03:51PM +0900, Joonsoo Kim wrote:
>> 2014-06-02 20:47 GMT+09:00 Vladimir Davydov <vdavydov@parallels.com>:
>> > Hi Joonsoo,
>> >
>> > On Mon, Jun 02, 2014 at 01:24:36PM +0900, Joonsoo Kim wrote:
>> >> On Sat, May 31, 2014 at 03:04:58PM +0400, Vladimir Davydov wrote:
>> >> > On Fri, May 30, 2014 at 09:57:10AM -0500, Christoph Lameter wrote:
>> >> > > On Fri, 30 May 2014, Vladimir Davydov wrote:
>> >> > >
>> >> > > > (3) is a bit more difficult, because slabs are added to per-cpu partial
>> >> > > > lists lock-less. Fortunately, we only have to handle the __slab_free
>> >> > > > case, because, as there shouldn't be any allocation requests dispatched
>> >> > > > to a dead memcg cache, get_partial_node() should never be called. In
>> >> > > > __slab_free we use cmpxchg to modify kmem_cache_cpu->partial (see
>> >> > > > put_cpu_partial) so that setting ->partial to a special value, which
>> >> > > > will make put_cpu_partial bail out, will do the trick.
>> > [...]
>> >> I think that we can do (3) easily.
>> >> If we check memcg_cache_dead() in the end of put_cpu_partial() rather
>> >> than in the begin of put_cpu_partial(), we can avoid the race you
>> >> mentioned. If someone do put_cpu_partial() before dead flag is set,
>> >> it can be zapped by who set dead flag. And if someone do
>> >> put_cpu_partial() after dead flag is set, it can be zapped by who
>> >> do put_cpu_partial().
>> >
>> > After put_cpu_partial() adds a frozen slab to a per cpu partial list,
>> > the slab becomes visible to other threads, which means it can be
>> > unfrozen and freed. The latter can trigger cache destruction. Hence we
>> > shouldn't touch the cache, in particular call memcg_cache_dead() on it,
>> > after calling put_cpu_partial(), otherwise we can get use-after-free.
>> >
>> > However, what you propose makes sense if we disable irqs before adding a
>> > slab to a partial list and enable them only after checking if the cache
>> > is dead and unfreezing all partials if so, i.e.
>> >
>> > diff --git a/mm/slub.c b/mm/slub.c
>> > index d96faa2464c3..14b9e9a8677c 100644
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -2030,8 +2030,15 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> >         struct page *oldpage;
>> >         int pages;
>> >         int pobjects;
>> > +       unsigned long flags;
>> > +       int irq_saved = 0;
>> >
>> >         do {
>> > +               if (irq_saved) {
>> > +                       local_irq_restore(flags);
>> > +                       irq_saved = 0;
>> > +               }
>> > +
>> >                 pages = 0;
>> >                 pobjects = 0;
>> >                 oldpage = this_cpu_read(s->cpu_slab->partial);
>> > @@ -2062,8 +2069,16 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>> >                 page->pobjects = pobjects;
>> >                 page->next = oldpage;
>> >
>> > +               local_irq_save(flags);
>> > +               irq_saved = 1;
>> > +
>> >         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>> >                                                                 != oldpage);
>> > +
>> > +       if (memcg_cache_dead(s))
>> > +               unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
>> > +
>> > +       local_irq_restore(flags);
>> >  #endif
>> >  }
>> >
>> >
>> > That would be safe against possible cache destruction, because to remove
>> > a slab from a per cpu partial list we have to run on the cpu it was
>> > frozen on. Disabling irqs makes it impossible.
>>
>> Hmm... this is also a bit ugly.
>> How about following change?
>>
>> Thanks.
>>
>> diff --git a/mm/slub.c b/mm/slub.c
>> index 2b1ce69..6adab87 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -2058,6 +2058,21 @@ static void put_cpu_partial(struct kmem_cache
>> *s, struct page *page, int drain)
>>
>>         } while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>>                                                                 != oldpage);
>> +
>> +       if (memcg_cache_dead(s)) {
>> +               bool done = false;
>> +               unsigned long flags;
>
> Suppose we are preempted here. In the meanwhile all objects are freed to
> the cache, all frozen pages are unfrozen and also freed. The cache
> destruction is then scheduled (patch 2 of this set). Then when this
> thread continues execution it will operate on the cache that was
> destroyed - use-after-free.
>
> I admit, this is very unlikely, but can we ignore this possibility?
>

Hello,

>From your comment, now, I realize that your cache destruction solution
has severe problem.

With you solution, kmem_cache can be destroyed before last kfree() caller
has returned. It means that we can't safely do anything related to
the kmem_cache after losing control about that slab where we try to free
object in free path.

Consider __slab_free(). After put_cpu_partial() in __slab_free() is called,
we attempt to update stat. There is possibility that this operation could be
use-after-free with your solution. Until now, we have just stat operation, but
it could be more. I don't like to impose this constraint to the slab free path.
So IMHO, it is better that we should defer to destroy kmem_cache
until last kfree() caller returns. Is it fair enough? :)

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
  2014-06-04  8:53                 ` Joonsoo Kim
@ 2014-06-04  9:47                   ` Vladimir Davydov
  -1 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-04  9:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

On Wed, Jun 04, 2014 at 05:53:29PM +0900, Joonsoo Kim wrote:
> Consider __slab_free(). After put_cpu_partial() in __slab_free() is called,
> we attempt to update stat. There is possibility that this operation could be
> use-after-free with your solution. Until now, we have just stat operation, but
> it could be more. I don't like to impose this constraint to the slab free path.

We can move stats update before object free I guess, but I admit this is
not going to be a flexible solution, because every future modifications
to slab_free should be done with great care then, otherwise it may break
things.

> So IMHO, it is better that we should defer to destroy kmem_cache
> until last kfree() caller returns. Is it fair enough? :)

Actually, I was thinking about it (even discussed with Christoph), but
the problem is that there is currently no way to wait for all currently
executing kfree's to complete, because SLUB's version can be preempted
at any time.

One way to solve this is to make slab_free non-preemptable and call
synchronize_sched before kmem_cache_destroy (or use call_rcu_sched).
When I started to implement this approach I found the resulting code a
bit ugly. Also, Christoph had some concerns about it (see
https://lkml.org/lkml/2014/5/23/524). That's why I tried to go with this
patch set first, but that doesn't mean that I'm 100% sure in it :-) I'll
send the implementations of the other approach (with prempt_disable)
soon.

Thanks.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately
@ 2014-06-04  9:47                   ` Vladimir Davydov
  0 siblings, 0 replies; 76+ messages in thread
From: Vladimir Davydov @ 2014-06-04  9:47 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Christoph Lameter, Andrew Morton, Johannes Weiner,
	Michal Hocko, LKML, Linux Memory Management List

On Wed, Jun 04, 2014 at 05:53:29PM +0900, Joonsoo Kim wrote:
> Consider __slab_free(). After put_cpu_partial() in __slab_free() is called,
> we attempt to update stat. There is possibility that this operation could be
> use-after-free with your solution. Until now, we have just stat operation, but
> it could be more. I don't like to impose this constraint to the slab free path.

We can move stats update before object free I guess, but I admit this is
not going to be a flexible solution, because every future modifications
to slab_free should be done with great care then, otherwise it may break
things.

> So IMHO, it is better that we should defer to destroy kmem_cache
> until last kfree() caller returns. Is it fair enough? :)

Actually, I was thinking about it (even discussed with Christoph), but
the problem is that there is currently no way to wait for all currently
executing kfree's to complete, because SLUB's version can be preempted
at any time.

One way to solve this is to make slab_free non-preemptable and call
synchronize_sched before kmem_cache_destroy (or use call_rcu_sched).
When I started to implement this approach I found the resulting code a
bit ugly. Also, Christoph had some concerns about it (see
https://lkml.org/lkml/2014/5/23/524). That's why I tried to go with this
patch set first, but that doesn't mean that I'm 100% sure in it :-) I'll
send the implementations of the other approach (with prempt_disable)
soon.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2014-06-04  9:47 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-30 13:51 [PATCH -mm 0/8] memcg/slab: reintroduce dead cache self-destruction Vladimir Davydov
2014-05-30 13:51 ` Vladimir Davydov
2014-05-30 13:51 ` [PATCH -mm 1/8] memcg: cleanup memcg_cache_params refcnt usage Vladimir Davydov
2014-05-30 13:51   ` Vladimir Davydov
2014-05-30 14:31   ` Christoph Lameter
2014-05-30 14:31     ` Christoph Lameter
2014-05-30 13:51 ` [PATCH -mm 2/8] memcg: destroy kmem caches when last slab is freed Vladimir Davydov
2014-05-30 13:51   ` Vladimir Davydov
2014-05-30 14:32   ` Christoph Lameter
2014-05-30 14:32     ` Christoph Lameter
2014-05-30 13:51 ` [PATCH -mm 3/8] memcg: mark caches that belong to offline memcgs as dead Vladimir Davydov
2014-05-30 13:51   ` Vladimir Davydov
2014-05-30 14:33   ` Christoph Lameter
2014-05-30 14:33     ` Christoph Lameter
2014-05-30 13:51 ` [PATCH -mm 4/8] slub: never fail kmem_cache_shrink Vladimir Davydov
2014-05-30 13:51   ` Vladimir Davydov
2014-05-30 14:46   ` Christoph Lameter
2014-05-30 14:46     ` Christoph Lameter
2014-05-31 10:18     ` Vladimir Davydov
2014-05-31 10:18       ` Vladimir Davydov
2014-06-02 15:13       ` Christoph Lameter
2014-06-02 15:13         ` Christoph Lameter
2014-05-30 13:51 ` [PATCH -mm 5/8] slab: remove kmem_cache_shrink retval Vladimir Davydov
2014-05-30 13:51   ` Vladimir Davydov
2014-05-30 14:49   ` Christoph Lameter
2014-05-30 14:49     ` Christoph Lameter
2014-05-31 10:27     ` Vladimir Davydov
2014-05-31 10:27       ` Vladimir Davydov
2014-06-02 15:16       ` Christoph Lameter
2014-06-02 15:16         ` Christoph Lameter
2014-06-03  9:06         ` Vladimir Davydov
2014-06-03  9:06           ` Vladimir Davydov
2014-06-03 14:48           ` Christoph Lameter
2014-06-03 14:48             ` Christoph Lameter
2014-06-03 19:00             ` Vladimir Davydov
2014-06-03 19:00               ` Vladimir Davydov
2014-05-30 13:51 ` [PATCH -mm 6/8] slub: do not use cmpxchg for adding cpu partials when irqs disabled Vladimir Davydov
2014-05-30 13:51   ` Vladimir Davydov
2014-05-30 13:51 ` [PATCH -mm 7/8] slub: make dead caches discard free slabs immediately Vladimir Davydov
2014-05-30 13:51   ` Vladimir Davydov
2014-05-30 14:57   ` Christoph Lameter
2014-05-30 14:57     ` Christoph Lameter
2014-05-31 11:04     ` Vladimir Davydov
2014-05-31 11:04       ` Vladimir Davydov
2014-06-02  4:24       ` Joonsoo Kim
2014-06-02  4:24         ` Joonsoo Kim
2014-06-02 11:47         ` Vladimir Davydov
2014-06-02 11:47           ` Vladimir Davydov
2014-06-02 14:03           ` Joonsoo Kim
2014-06-02 14:03             ` Joonsoo Kim
2014-06-02 15:17             ` Christoph Lameter
2014-06-02 15:17               ` Christoph Lameter
2014-06-03  8:16             ` Vladimir Davydov
2014-06-03  8:16               ` Vladimir Davydov
2014-06-04  8:53               ` Joonsoo Kim
2014-06-04  8:53                 ` Joonsoo Kim
2014-06-04  9:47                 ` Vladimir Davydov
2014-06-04  9:47                   ` Vladimir Davydov
2014-05-30 13:51 ` [PATCH -mm 8/8] slab: reap dead memcg caches aggressively Vladimir Davydov
2014-05-30 13:51   ` Vladimir Davydov
2014-05-30 15:01   ` Christoph Lameter
2014-05-30 15:01     ` Christoph Lameter
2014-05-31 11:19     ` Vladimir Davydov
2014-05-31 11:19       ` Vladimir Davydov
2014-06-02 15:24       ` Christoph Lameter
2014-06-02 15:24         ` Christoph Lameter
2014-06-03 20:18         ` Vladimir Davydov
2014-06-03 20:18           ` Vladimir Davydov
2014-06-02  4:41   ` Joonsoo Kim
2014-06-02  4:41     ` Joonsoo Kim
2014-06-02 12:10     ` Vladimir Davydov
2014-06-02 12:10       ` Vladimir Davydov
2014-06-02 14:01       ` Joonsoo Kim
2014-06-02 14:01         ` Joonsoo Kim
2014-06-03  8:21         ` Vladimir Davydov
2014-06-03  8:21           ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.