All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -mm v3 0/8] memcg/slab: reintroduce dead cache self-destruction
@ 2014-06-12 20:38 ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Hi,

When a memcg is turned offline, some of its kmem caches can still have
active objects and therefore cannot be destroyed immediately. Currently,
we simply leak such caches along with the owner memcg, which is bad and
should be resolved.

It would be perfect if we could move all slab pages of such dead caches
to the root/parent cache on memcg offline. However, when I tried to
implement such re-parenting, I was pointed out by Christoph that the
overhead of this would be unacceptable, at least for SLUB (see
https://lkml.org/lkml/2014/5/13/446)

The problem with re-parenting of individual slabs is that it requires
tracking of all slabs allocated to a cache, but SLUB doesn't track full
slabs if !debug. Changing this behavior would result in significant
performance degradation of regular alloc/free paths, because it would
make alloc/free take per node list locks more often.

After pondering about this problem for some time, I think we should
return to dead caches self-destruction, i.e. scheduling cache
destruction work when the last slab page is freed.

This is the behavior we had before commit 5bd93da9917f ("memcg, slab:
simplify synchronization scheme"). The reason why it was removed was that
it simply didn't work, because SL[AU]B are implemented in such a way
that they don't discard empty slabs immediately, but prefer keeping them
cached for indefinite time to speed up further allocations.

However, we can change this w/o noticeable performance impact for both
SLAB and SLUB by making them drop free slabs as soon as they become
empty. Since dead caches should never be allocated from, removing empty
slabs from them shouldn't result in noticeable performance degradation.

So, this patch set reintroduces dead cache self-destruction and adds
some tweaks to SL[AU]B to prevent dead caches from hanging around
indefinitely. It is organized as follows:

 - patches 1-3 reintroduce dead memcg cache self-destruction;
 - patch 4 makes SLUB's version of kmem_cache_shrink always drop empty
   slabs, even if it fails to allocate a temporary array;
 - patches 5 and 6 fix possible use-after-free connected with
   asynchronous cache destruction;
 - patches 7 and 8 disable caching of empty slabs for dead memcg caches
   for SLUB and SLAB respectively.

Note, this doesn't resolve the problem of memcgs pinned by dead kmem
caches. I'm planning to solve this by re-parenting dead kmem caches to
the parent memcg.

v3:

 - add smp barrier to memcg_cache_dead (Joonsoo);
 - add comment explaining why kfree has to be non-preemptable (Joonsoo);
 - do not call flush_all from put_cpu_partial (SLUB), because slab_free
   is now non-preemptable (Joonsoo);
 - simplify the patch disabling free slabs/objects caching for dead SLAB
   caches.

v2: https://lkml.org/lkml/2014/6/6/366

 - fix use-after-free connected with asynchronous cache destruction;
 - less intrusive version of SLUB's kmem_cache_shrink fix;
 - simplify disabling of free slabs caching for SLUB (Joonsoo);
 - disable free slabs caching instead of using cache_reap for SLAB
   (Christoph).

v1: https://lkml.org/lkml/2014/5/30/264

Thanks,

Vladimir Davydov (8):
  memcg: cleanup memcg_cache_params refcnt usage
  memcg: destroy kmem caches when last slab is freed
  memcg: mark caches that belong to offline memcgs as dead
  slub: don't fail kmem_cache_shrink if slab placement optimization
    fails
  slub: make slab_free non-preemptable
  memcg: wait for kfree's to finish before destroying cache
  slub: make dead memcg caches discard free slabs immediately
  slab: do not keep free objects/slabs on dead memcg caches

 include/linux/slab.h |   14 +++++++-----
 mm/memcontrol.c      |   59 ++++++++++++++++++++++++++++++++++++++++++++++----
 mm/slab.c            |   37 ++++++++++++++++++++++++++++++-
 mm/slab.h            |   25 +++++++++++++++++++++
 mm/slub.c            |   42 ++++++++++++++++++++++++++---------
 5 files changed, 156 insertions(+), 21 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 0/8] memcg/slab: reintroduce dead cache self-destruction
@ 2014-06-12 20:38 ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Hi,

When a memcg is turned offline, some of its kmem caches can still have
active objects and therefore cannot be destroyed immediately. Currently,
we simply leak such caches along with the owner memcg, which is bad and
should be resolved.

It would be perfect if we could move all slab pages of such dead caches
to the root/parent cache on memcg offline. However, when I tried to
implement such re-parenting, I was pointed out by Christoph that the
overhead of this would be unacceptable, at least for SLUB (see
https://lkml.org/lkml/2014/5/13/446)

The problem with re-parenting of individual slabs is that it requires
tracking of all slabs allocated to a cache, but SLUB doesn't track full
slabs if !debug. Changing this behavior would result in significant
performance degradation of regular alloc/free paths, because it would
make alloc/free take per node list locks more often.

After pondering about this problem for some time, I think we should
return to dead caches self-destruction, i.e. scheduling cache
destruction work when the last slab page is freed.

This is the behavior we had before commit 5bd93da9917f ("memcg, slab:
simplify synchronization scheme"). The reason why it was removed was that
it simply didn't work, because SL[AU]B are implemented in such a way
that they don't discard empty slabs immediately, but prefer keeping them
cached for indefinite time to speed up further allocations.

However, we can change this w/o noticeable performance impact for both
SLAB and SLUB by making them drop free slabs as soon as they become
empty. Since dead caches should never be allocated from, removing empty
slabs from them shouldn't result in noticeable performance degradation.

So, this patch set reintroduces dead cache self-destruction and adds
some tweaks to SL[AU]B to prevent dead caches from hanging around
indefinitely. It is organized as follows:

 - patches 1-3 reintroduce dead memcg cache self-destruction;
 - patch 4 makes SLUB's version of kmem_cache_shrink always drop empty
   slabs, even if it fails to allocate a temporary array;
 - patches 5 and 6 fix possible use-after-free connected with
   asynchronous cache destruction;
 - patches 7 and 8 disable caching of empty slabs for dead memcg caches
   for SLUB and SLAB respectively.

Note, this doesn't resolve the problem of memcgs pinned by dead kmem
caches. I'm planning to solve this by re-parenting dead kmem caches to
the parent memcg.

v3:

 - add smp barrier to memcg_cache_dead (Joonsoo);
 - add comment explaining why kfree has to be non-preemptable (Joonsoo);
 - do not call flush_all from put_cpu_partial (SLUB), because slab_free
   is now non-preemptable (Joonsoo);
 - simplify the patch disabling free slabs/objects caching for dead SLAB
   caches.

v2: https://lkml.org/lkml/2014/6/6/366

 - fix use-after-free connected with asynchronous cache destruction;
 - less intrusive version of SLUB's kmem_cache_shrink fix;
 - simplify disabling of free slabs caching for SLUB (Joonsoo);
 - disable free slabs caching instead of using cache_reap for SLAB
   (Christoph).

v1: https://lkml.org/lkml/2014/5/30/264

Thanks,

Vladimir Davydov (8):
  memcg: cleanup memcg_cache_params refcnt usage
  memcg: destroy kmem caches when last slab is freed
  memcg: mark caches that belong to offline memcgs as dead
  slub: don't fail kmem_cache_shrink if slab placement optimization
    fails
  slub: make slab_free non-preemptable
  memcg: wait for kfree's to finish before destroying cache
  slub: make dead memcg caches discard free slabs immediately
  slab: do not keep free objects/slabs on dead memcg caches

 include/linux/slab.h |   14 +++++++-----
 mm/memcontrol.c      |   59 ++++++++++++++++++++++++++++++++++++++++++++++----
 mm/slab.c            |   37 ++++++++++++++++++++++++++++++-
 mm/slab.h            |   25 +++++++++++++++++++++
 mm/slub.c            |   42 ++++++++++++++++++++++++++---------
 5 files changed, 156 insertions(+), 21 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 1/8] memcg: cleanup memcg_cache_params refcnt usage
  2014-06-12 20:38 ` Vladimir Davydov
@ 2014-06-12 20:38   ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Currently, we count the number of pages allocated to a per memcg cache
in memcg_cache_params->nr_pages. We only use this counter to find out if
the cache is empty and can be destroyed. So let's rename it to refcnt
and make it count not pages, but slabs so that we can use atomic_inc/dec
instead of atomic_add/sub in memcg_charge/uncharge_slab.

Also, as the number of slabs theoretically can be greater than INT_MAX,
let's use atomic_long for the counter.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 include/linux/slab.h |    4 ++--
 mm/memcontrol.c      |    6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1d9abb7d22a0..1985bd9bec7d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -526,7 +526,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
- * @nr_pages: number of pages that belongs to this cache.
+ * @refcnt: reference counter
  */
 struct memcg_cache_params {
 	bool is_root_cache;
@@ -539,7 +539,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
-			atomic_t nr_pages;
+			atomic_long_t refcnt;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 15bda8133ff9..98a24e5ea4b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3279,7 +3279,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
 		kmem_cache_shrink(cachep);
-		if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
+		if (atomic_long_read(&cachep->memcg_params->refcnt) == 0)
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
@@ -3353,14 +3353,14 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
 				PAGE_SIZE << order);
 	if (!res)
-		atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+		atomic_long_inc(&cachep->memcg_params->refcnt);
 	return res;
 }
 
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+	atomic_long_dec(&cachep->memcg_params->refcnt);
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 1/8] memcg: cleanup memcg_cache_params refcnt usage
@ 2014-06-12 20:38   ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Currently, we count the number of pages allocated to a per memcg cache
in memcg_cache_params->nr_pages. We only use this counter to find out if
the cache is empty and can be destroyed. So let's rename it to refcnt
and make it count not pages, but slabs so that we can use atomic_inc/dec
instead of atomic_add/sub in memcg_charge/uncharge_slab.

Also, as the number of slabs theoretically can be greater than INT_MAX,
let's use atomic_long for the counter.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 include/linux/slab.h |    4 ++--
 mm/memcontrol.c      |    6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1d9abb7d22a0..1985bd9bec7d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -526,7 +526,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
- * @nr_pages: number of pages that belongs to this cache.
+ * @refcnt: reference counter
  */
 struct memcg_cache_params {
 	bool is_root_cache;
@@ -539,7 +539,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
-			atomic_t nr_pages;
+			atomic_long_t refcnt;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 15bda8133ff9..98a24e5ea4b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3279,7 +3279,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
 		kmem_cache_shrink(cachep);
-		if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
+		if (atomic_long_read(&cachep->memcg_params->refcnt) == 0)
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
@@ -3353,14 +3353,14 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 	res = memcg_charge_kmem(cachep->memcg_params->memcg, gfp,
 				PAGE_SIZE << order);
 	if (!res)
-		atomic_add(1 << order, &cachep->memcg_params->nr_pages);
+		atomic_long_inc(&cachep->memcg_params->refcnt);
 	return res;
 }
 
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_sub(1 << order, &cachep->memcg_params->nr_pages);
+	atomic_long_dec(&cachep->memcg_params->refcnt);
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 2/8] memcg: destroy kmem caches when last slab is freed
  2014-06-12 20:38 ` Vladimir Davydov
@ 2014-06-12 20:38   ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

When the memcg_cache_params->refcnt goes to 0, schedule the worker that
will unregister the cache. To prevent this from happening when the owner
memcg is alive, keep the refcnt incremented during memcg lifetime.

Note, this doesn't guarantee that the cache that belongs to a dead memcg
will go away as soon as the last object is freed, because SL[AU]B
implementation can cache empty slabs for performance reasons. Hence the
cache may be hanging around indefinitely after memcg offline. This is to
be resolved by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   22 ++++++++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1985bd9bec7d..d9716fdc8211 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
  * @refcnt: reference counter
+ * @unregister_work: worker to destroy the cache
  */
 struct memcg_cache_params {
 	bool is_root_cache;
@@ -540,6 +541,7 @@ struct memcg_cache_params {
 			struct list_head list;
 			struct kmem_cache *root_cache;
 			atomic_long_t refcnt;
+			struct work_struct unregister_work;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 98a24e5ea4b5..886b5b414958 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3114,6 +3114,8 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+static void memcg_unregister_cache_func(struct work_struct *work);
+
 int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 			     struct kmem_cache *root_cache)
 {
@@ -3135,6 +3137,9 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 	if (memcg) {
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
+		atomic_long_set(&s->memcg_params->refcnt, 1);
+		INIT_WORK(&s->memcg_params->unregister_work,
+			  memcg_unregister_cache_func);
 		css_get(&memcg->css);
 	} else
 		s->memcg_params->is_root_cache = true;
@@ -3216,6 +3221,17 @@ static void memcg_unregister_cache(struct kmem_cache *cachep)
 	kmem_cache_destroy(cachep);
 }
 
+static void memcg_unregister_cache_func(struct work_struct *work)
+{
+	struct memcg_cache_params *params =
+		container_of(work, struct memcg_cache_params, unregister_work);
+	struct kmem_cache *cachep = memcg_params_to_cache(params);
+
+	mutex_lock(&memcg_slab_mutex);
+	memcg_unregister_cache(cachep);
+	mutex_unlock(&memcg_slab_mutex);
+}
+
 /*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
@@ -3279,7 +3295,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
 		kmem_cache_shrink(cachep);
-		if (atomic_long_read(&cachep->memcg_params->refcnt) == 0)
+		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
@@ -3360,7 +3376,9 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_long_dec(&cachep->memcg_params->refcnt);
+
+	if (unlikely(atomic_long_dec_and_test(&cachep->memcg_params->refcnt)))
+		schedule_work(&cachep->memcg_params->unregister_work);
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 2/8] memcg: destroy kmem caches when last slab is freed
@ 2014-06-12 20:38   ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

When the memcg_cache_params->refcnt goes to 0, schedule the worker that
will unregister the cache. To prevent this from happening when the owner
memcg is alive, keep the refcnt incremented during memcg lifetime.

Note, this doesn't guarantee that the cache that belongs to a dead memcg
will go away as soon as the last object is freed, because SL[AU]B
implementation can cache empty slabs for performance reasons. Hence the
cache may be hanging around indefinitely after memcg offline. This is to
be resolved by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   22 ++++++++++++++++++++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 1985bd9bec7d..d9716fdc8211 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
  * @refcnt: reference counter
+ * @unregister_work: worker to destroy the cache
  */
 struct memcg_cache_params {
 	bool is_root_cache;
@@ -540,6 +541,7 @@ struct memcg_cache_params {
 			struct list_head list;
 			struct kmem_cache *root_cache;
 			atomic_long_t refcnt;
+			struct work_struct unregister_work;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 98a24e5ea4b5..886b5b414958 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3114,6 +3114,8 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 	return 0;
 }
 
+static void memcg_unregister_cache_func(struct work_struct *work);
+
 int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 			     struct kmem_cache *root_cache)
 {
@@ -3135,6 +3137,9 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 	if (memcg) {
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
+		atomic_long_set(&s->memcg_params->refcnt, 1);
+		INIT_WORK(&s->memcg_params->unregister_work,
+			  memcg_unregister_cache_func);
 		css_get(&memcg->css);
 	} else
 		s->memcg_params->is_root_cache = true;
@@ -3216,6 +3221,17 @@ static void memcg_unregister_cache(struct kmem_cache *cachep)
 	kmem_cache_destroy(cachep);
 }
 
+static void memcg_unregister_cache_func(struct work_struct *work)
+{
+	struct memcg_cache_params *params =
+		container_of(work, struct memcg_cache_params, unregister_work);
+	struct kmem_cache *cachep = memcg_params_to_cache(params);
+
+	mutex_lock(&memcg_slab_mutex);
+	memcg_unregister_cache(cachep);
+	mutex_unlock(&memcg_slab_mutex);
+}
+
 /*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
@@ -3279,7 +3295,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
 		kmem_cache_shrink(cachep);
-		if (atomic_long_read(&cachep->memcg_params->refcnt) == 0)
+		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
 			memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
@@ -3360,7 +3376,9 @@ int __memcg_charge_slab(struct kmem_cache *cachep, gfp_t gfp, int order)
 void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 {
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
-	atomic_long_dec(&cachep->memcg_params->refcnt);
+
+	if (unlikely(atomic_long_dec_and_test(&cachep->memcg_params->refcnt)))
+		schedule_work(&cachep->memcg_params->unregister_work);
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 3/8] memcg: mark caches that belong to offline memcgs as dead
  2014-06-12 20:38 ` Vladimir Davydov
@ 2014-06-12 20:38   ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

This will be used by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |    3 +++
 mm/slab.h            |   25 +++++++++++++++++++++++++
 3 files changed, 30 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d9716fdc8211..d99d5212b815 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
  * @refcnt: reference counter
+ * @dead: set to true when owner memcg is turned offline
  * @unregister_work: worker to destroy the cache
  */
 struct memcg_cache_params {
@@ -541,6 +542,7 @@ struct memcg_cache_params {
 			struct list_head list;
 			struct kmem_cache *root_cache;
 			atomic_long_t refcnt;
+			bool dead;
 			struct work_struct unregister_work;
 		};
 	};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 886b5b414958..8f08044d26a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3294,7 +3294,10 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	mutex_lock(&memcg_slab_mutex);
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
+
+		memcg_cache_mark_dead(cachep);
 		kmem_cache_shrink(cachep);
+
 		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
 			memcg_unregister_cache(cachep);
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 961a3fb1f5a2..2af5a0bc00a4 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -121,6 +121,26 @@ static inline bool is_root_cache(struct kmem_cache *s)
 	return !s->memcg_params || s->memcg_params->is_root_cache;
 }
 
+static inline bool memcg_cache_dead(struct kmem_cache *s)
+{
+	if (is_root_cache(s))
+		return false;
+
+	/*
+	 * Since this function can be called without holding any locks, it
+	 * needs a barrier here to guarantee the read won't be reordered.
+	 */
+	smp_rmb();
+	return s->memcg_params->dead;
+}
+
+static inline void memcg_cache_mark_dead(struct kmem_cache *s)
+{
+	BUG_ON(is_root_cache(s));
+	s->memcg_params->dead = true;
+	smp_wmb();		/* matches rmb in memcg_cache_dead() */
+}
+
 static inline bool slab_equal_or_root(struct kmem_cache *s,
 					struct kmem_cache *p)
 {
@@ -203,6 +223,11 @@ static inline bool is_root_cache(struct kmem_cache *s)
 	return true;
 }
 
+static inline bool memcg_cache_dead(struct kmem_cache *s)
+{
+	return false;
+}
+
 static inline bool slab_equal_or_root(struct kmem_cache *s,
 				      struct kmem_cache *p)
 {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 3/8] memcg: mark caches that belong to offline memcgs as dead
@ 2014-06-12 20:38   ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

This will be used by the next patches.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |    3 +++
 mm/slab.h            |   25 +++++++++++++++++++++++++
 3 files changed, 30 insertions(+)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d9716fdc8211..d99d5212b815 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -527,6 +527,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
  * @refcnt: reference counter
+ * @dead: set to true when owner memcg is turned offline
  * @unregister_work: worker to destroy the cache
  */
 struct memcg_cache_params {
@@ -541,6 +542,7 @@ struct memcg_cache_params {
 			struct list_head list;
 			struct kmem_cache *root_cache;
 			atomic_long_t refcnt;
+			bool dead;
 			struct work_struct unregister_work;
 		};
 	};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 886b5b414958..8f08044d26a7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3294,7 +3294,10 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 	mutex_lock(&memcg_slab_mutex);
 	list_for_each_entry_safe(params, tmp, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
+
+		memcg_cache_mark_dead(cachep);
 		kmem_cache_shrink(cachep);
+
 		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
 			memcg_unregister_cache(cachep);
 	}
diff --git a/mm/slab.h b/mm/slab.h
index 961a3fb1f5a2..2af5a0bc00a4 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -121,6 +121,26 @@ static inline bool is_root_cache(struct kmem_cache *s)
 	return !s->memcg_params || s->memcg_params->is_root_cache;
 }
 
+static inline bool memcg_cache_dead(struct kmem_cache *s)
+{
+	if (is_root_cache(s))
+		return false;
+
+	/*
+	 * Since this function can be called without holding any locks, it
+	 * needs a barrier here to guarantee the read won't be reordered.
+	 */
+	smp_rmb();
+	return s->memcg_params->dead;
+}
+
+static inline void memcg_cache_mark_dead(struct kmem_cache *s)
+{
+	BUG_ON(is_root_cache(s));
+	s->memcg_params->dead = true;
+	smp_wmb();		/* matches rmb in memcg_cache_dead() */
+}
+
 static inline bool slab_equal_or_root(struct kmem_cache *s,
 					struct kmem_cache *p)
 {
@@ -203,6 +223,11 @@ static inline bool is_root_cache(struct kmem_cache *s)
 	return true;
 }
 
+static inline bool memcg_cache_dead(struct kmem_cache *s)
+{
+	return false;
+}
+
 static inline bool slab_equal_or_root(struct kmem_cache *s,
 				      struct kmem_cache *p)
 {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 4/8] slub: don't fail kmem_cache_shrink if slab placement optimization fails
  2014-06-12 20:38 ` Vladimir Davydov
@ 2014-06-12 20:38   ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
but also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.

This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip
the slab placement optimization step if the allocation fails, but still
get rid of empty slabs.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..35741592be8c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3404,12 +3404,20 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct page *page;
 	struct page *t;
 	int objects = oo_objects(s->max);
+	struct list_head empty_slabs;
 	struct list_head *slabs_by_inuse =
 		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	if (!slabs_by_inuse) {
+		/*
+		 * Do not fail shrinking empty slabs if allocation of the
+		 * temporary array failed. Just skip the slab placement
+		 * optimization then.
+		 */
+		slabs_by_inuse = &empty_slabs;
+		objects = 1;
+	}
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -3430,7 +3438,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
+			if (page->inuse < objects)
+				list_move(&page->lru,
+					  slabs_by_inuse + page->inuse);
 			if (!page->inuse)
 				n->nr_partial--;
 		}
@@ -3449,7 +3459,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 			discard_slab(s, page);
 	}
 
-	kfree(slabs_by_inuse);
+	if (slabs_by_inuse != &empty_slabs)
+		kfree(slabs_by_inuse);
 	return 0;
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 4/8] slub: don't fail kmem_cache_shrink if slab placement optimization fails
@ 2014-06-12 20:38   ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

SLUB's kmem_cache_shrink not only removes empty slabs from the cache,
but also sorts slabs by the number of objects in-use to cope with
fragmentation. To achieve that, it tries to allocate a temporary array.
If it fails, it will abort the whole procedure.

This is unacceptable for kmemcg, where we want to be sure that all empty
slabs are removed from the cache on memcg offline, so let's just skip
the slab placement optimization step if the allocation fails, but still
get rid of empty slabs.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 mm/slub.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index d96faa2464c3..35741592be8c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3404,12 +3404,20 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	struct page *page;
 	struct page *t;
 	int objects = oo_objects(s->max);
+	struct list_head empty_slabs;
 	struct list_head *slabs_by_inuse =
 		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
 	unsigned long flags;
 
-	if (!slabs_by_inuse)
-		return -ENOMEM;
+	if (!slabs_by_inuse) {
+		/*
+		 * Do not fail shrinking empty slabs if allocation of the
+		 * temporary array failed. Just skip the slab placement
+		 * optimization then.
+		 */
+		slabs_by_inuse = &empty_slabs;
+		objects = 1;
+	}
 
 	flush_all(s);
 	for_each_node_state(node, N_NORMAL_MEMORY) {
@@ -3430,7 +3438,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		 * list_lock. page->inuse here is the upper limit.
 		 */
 		list_for_each_entry_safe(page, t, &n->partial, lru) {
-			list_move(&page->lru, slabs_by_inuse + page->inuse);
+			if (page->inuse < objects)
+				list_move(&page->lru,
+					  slabs_by_inuse + page->inuse);
 			if (!page->inuse)
 				n->nr_partial--;
 		}
@@ -3449,7 +3459,8 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 			discard_slab(s, page);
 	}
 
-	kfree(slabs_by_inuse);
+	if (slabs_by_inuse != &empty_slabs)
+		kfree(slabs_by_inuse);
 	return 0;
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 5/8] slub: make slab_free non-preemptable
  2014-06-12 20:38 ` Vladimir Davydov
@ 2014-06-12 20:38   ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Since per memcg cache destruction is scheduled when the last slab is
freed, to avoid use-after-free in kmem_cache_free we should either
rearrange code in kmem_cache_free so that it won't dereference the cache
ptr after freeing the object, or wait for all kmem_cache_free's to
complete before proceeding to cache destruction.

The former approach isn't a good option from the future development
point of view, because every modifications to kmem_cache_free must be
done with great care then. Hence we should provide a method to wait for
all currently executing kmem_cache_free's to finish.

This patch makes SLUB's implementation of kmem_cache_free
non-preemptable. As a result, synchronize_sched() will work as a barrier
against kmem_cache_free's in flight, so that issuing it before cache
destruction will protect us against the use-after-free.

This won't affect performance of kmem_cache_free, because we already
disable preemption there, and this patch only moves preempt_enable to
the end of the function. Neither should it affect the system latency,
because kmem_cache_free is extremely short, even in its slow path.

SLAB's version of kmem_cache_free already proceeds with irqs disabled,
so we only add a comment explaining why it's necessary for kmemcg there.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 mm/slab.c |    6 ++++++
 mm/slub.c |   12 ++++++------
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 9ca3b87edabc..b3af82419251 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3450,6 +3450,12 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 {
 	struct array_cache *ac = cpu_cache_get(cachep);
 
+	/*
+	 * Since we free objects with irqs and therefore preemption disabled,
+	 * we can use synchronize_sched() to wait for all currently executing
+	 * kfree's to finish. This is necessary to avoid use-after-free on
+	 * per memcg cache destruction.
+	 */
 	check_irq_off();
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
diff --git a/mm/slub.c b/mm/slub.c
index 35741592be8c..52565a9426ef 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2673,18 +2673,17 @@ static __always_inline void slab_free(struct kmem_cache *s,
 
 	slab_free_hook(s, x);
 
-redo:
 	/*
-	 * Determine the currently cpus per cpu slab.
-	 * The cpu may change afterward. However that does not matter since
-	 * data is retrieved via this pointer. If we are on the same cpu
-	 * during the cmpxchg then the free will succedd.
+	 * We could make this function fully preemptable, but then we wouldn't
+	 * have a method to wait for all currently executing kfree's to finish,
+	 * which is necessary to avoid use-after-free on per memcg cache
+	 * destruction.
 	 */
 	preempt_disable();
+redo:
 	c = this_cpu_ptr(s->cpu_slab);
 
 	tid = c->tid;
-	preempt_enable();
 
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);
@@ -2701,6 +2700,7 @@ redo:
 	} else
 		__slab_free(s, page, x, addr);
 
+	preempt_enable();
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 5/8] slub: make slab_free non-preemptable
@ 2014-06-12 20:38   ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Since per memcg cache destruction is scheduled when the last slab is
freed, to avoid use-after-free in kmem_cache_free we should either
rearrange code in kmem_cache_free so that it won't dereference the cache
ptr after freeing the object, or wait for all kmem_cache_free's to
complete before proceeding to cache destruction.

The former approach isn't a good option from the future development
point of view, because every modifications to kmem_cache_free must be
done with great care then. Hence we should provide a method to wait for
all currently executing kmem_cache_free's to finish.

This patch makes SLUB's implementation of kmem_cache_free
non-preemptable. As a result, synchronize_sched() will work as a barrier
against kmem_cache_free's in flight, so that issuing it before cache
destruction will protect us against the use-after-free.

This won't affect performance of kmem_cache_free, because we already
disable preemption there, and this patch only moves preempt_enable to
the end of the function. Neither should it affect the system latency,
because kmem_cache_free is extremely short, even in its slow path.

SLAB's version of kmem_cache_free already proceeds with irqs disabled,
so we only add a comment explaining why it's necessary for kmemcg there.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Christoph Lameter <cl@linux.com>
---
 mm/slab.c |    6 ++++++
 mm/slub.c |   12 ++++++------
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 9ca3b87edabc..b3af82419251 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3450,6 +3450,12 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 {
 	struct array_cache *ac = cpu_cache_get(cachep);
 
+	/*
+	 * Since we free objects with irqs and therefore preemption disabled,
+	 * we can use synchronize_sched() to wait for all currently executing
+	 * kfree's to finish. This is necessary to avoid use-after-free on
+	 * per memcg cache destruction.
+	 */
 	check_irq_off();
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
diff --git a/mm/slub.c b/mm/slub.c
index 35741592be8c..52565a9426ef 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2673,18 +2673,17 @@ static __always_inline void slab_free(struct kmem_cache *s,
 
 	slab_free_hook(s, x);
 
-redo:
 	/*
-	 * Determine the currently cpus per cpu slab.
-	 * The cpu may change afterward. However that does not matter since
-	 * data is retrieved via this pointer. If we are on the same cpu
-	 * during the cmpxchg then the free will succedd.
+	 * We could make this function fully preemptable, but then we wouldn't
+	 * have a method to wait for all currently executing kfree's to finish,
+	 * which is necessary to avoid use-after-free on per memcg cache
+	 * destruction.
 	 */
 	preempt_disable();
+redo:
 	c = this_cpu_ptr(s->cpu_slab);
 
 	tid = c->tid;
-	preempt_enable();
 
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);
@@ -2701,6 +2700,7 @@ redo:
 	} else
 		__slab_free(s, page, x, addr);
 
+	preempt_enable();
 }
 
 void kmem_cache_free(struct kmem_cache *s, void *x)
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 6/8] memcg: wait for kfree's to finish before destroying cache
  2014-06-12 20:38 ` Vladimir Davydov
@ 2014-06-12 20:38   ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

kmem_cache_free doesn't expect that the cache can be destroyed as soon
as the object is freed, e.g. SLUB's implementation may want to update
cache stats after putting the object to the free list.

Therefore we should wait for all kmem_cache_free's to finish before
proceeding to cache destruction. Since both SLAB and SLUB versions of
kmem_cache_free are non-preemptable, we wait for rcu-sched grace period
to elapse.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    6 ++----
 mm/memcontrol.c      |   34 ++++++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d99d5212b815..68b1feaba9d6 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -532,11 +532,9 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  */
 struct memcg_cache_params {
 	bool is_root_cache;
+	struct rcu_head rcu_head;
 	union {
-		struct {
-			struct rcu_head rcu_head;
-			struct kmem_cache *memcg_caches[0];
-		};
+		struct kmem_cache *memcg_caches[0];
 		struct {
 			struct mem_cgroup *memcg;
 			struct list_head list;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8f08044d26a7..516964a11f5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3232,6 +3232,14 @@ static void memcg_unregister_cache_func(struct work_struct *work)
 	mutex_unlock(&memcg_slab_mutex);
 }
 
+static void memcg_unregister_cache_rcu_func(struct rcu_head *rcu)
+{
+	struct memcg_cache_params *params =
+		container_of(rcu, struct memcg_cache_params, rcu_head);
+
+	schedule_work(&params->unregister_work);
+}
+
 /*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
@@ -3287,6 +3295,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 {
 	struct kmem_cache *cachep;
 	struct memcg_cache_params *params, *tmp;
+	LIST_HEAD(empty_caches);
 
 	if (!memcg_kmem_is_active(memcg))
 		return;
@@ -3299,7 +3308,26 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 		kmem_cache_shrink(cachep);
 
 		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
-			memcg_unregister_cache(cachep);
+			list_move(&cachep->memcg_params->list, &empty_caches);
+	}
+
+	/*
+	 * kmem_cache_free doesn't expect that the cache can be destroyed as
+	 * soon as the object is freed, e.g. SLUB's implementation may want to
+	 * update cache stats after putting the object to the free list.
+	 *
+	 * Therefore we should wait for all kmem_cache_free's to finish before
+	 * proceeding to cache destruction. Since both SLAB and SLUB versions
+	 * of kmem_cache_free are non-preemptable, we wait for rcu-sched grace
+	 * period to elapse.
+	 */
+	synchronize_sched();
+
+	while (!list_empty(&empty_caches)) {
+		params = list_first_entry(&empty_caches,
+					  struct memcg_cache_params, list);
+		cachep = memcg_params_to_cache(params);
+		memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
 }
@@ -3381,7 +3409,9 @@ void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
 
 	if (unlikely(atomic_long_dec_and_test(&cachep->memcg_params->refcnt)))
-		schedule_work(&cachep->memcg_params->unregister_work);
+		/* see memcg_unregister_all_caches */
+		call_rcu_sched(&cachep->memcg_params->rcu_head,
+			       memcg_unregister_cache_rcu_func);
 }
 
 /*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 6/8] memcg: wait for kfree's to finish before destroying cache
@ 2014-06-12 20:38   ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

kmem_cache_free doesn't expect that the cache can be destroyed as soon
as the object is freed, e.g. SLUB's implementation may want to update
cache stats after putting the object to the free list.

Therefore we should wait for all kmem_cache_free's to finish before
proceeding to cache destruction. Since both SLAB and SLUB versions of
kmem_cache_free are non-preemptable, we wait for rcu-sched grace period
to elapse.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/slab.h |    6 ++----
 mm/memcontrol.c      |   34 ++++++++++++++++++++++++++++++++--
 2 files changed, 34 insertions(+), 6 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index d99d5212b815..68b1feaba9d6 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -532,11 +532,9 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  */
 struct memcg_cache_params {
 	bool is_root_cache;
+	struct rcu_head rcu_head;
 	union {
-		struct {
-			struct rcu_head rcu_head;
-			struct kmem_cache *memcg_caches[0];
-		};
+		struct kmem_cache *memcg_caches[0];
 		struct {
 			struct mem_cgroup *memcg;
 			struct list_head list;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8f08044d26a7..516964a11f5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3232,6 +3232,14 @@ static void memcg_unregister_cache_func(struct work_struct *work)
 	mutex_unlock(&memcg_slab_mutex);
 }
 
+static void memcg_unregister_cache_rcu_func(struct rcu_head *rcu)
+{
+	struct memcg_cache_params *params =
+		container_of(rcu, struct memcg_cache_params, rcu_head);
+
+	schedule_work(&params->unregister_work);
+}
+
 /*
  * During the creation a new cache, we need to disable our accounting mechanism
  * altogether. This is true even if we are not creating, but rather just
@@ -3287,6 +3295,7 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 {
 	struct kmem_cache *cachep;
 	struct memcg_cache_params *params, *tmp;
+	LIST_HEAD(empty_caches);
 
 	if (!memcg_kmem_is_active(memcg))
 		return;
@@ -3299,7 +3308,26 @@ static void memcg_unregister_all_caches(struct mem_cgroup *memcg)
 		kmem_cache_shrink(cachep);
 
 		if (atomic_long_dec_and_test(&cachep->memcg_params->refcnt))
-			memcg_unregister_cache(cachep);
+			list_move(&cachep->memcg_params->list, &empty_caches);
+	}
+
+	/*
+	 * kmem_cache_free doesn't expect that the cache can be destroyed as
+	 * soon as the object is freed, e.g. SLUB's implementation may want to
+	 * update cache stats after putting the object to the free list.
+	 *
+	 * Therefore we should wait for all kmem_cache_free's to finish before
+	 * proceeding to cache destruction. Since both SLAB and SLUB versions
+	 * of kmem_cache_free are non-preemptable, we wait for rcu-sched grace
+	 * period to elapse.
+	 */
+	synchronize_sched();
+
+	while (!list_empty(&empty_caches)) {
+		params = list_first_entry(&empty_caches,
+					  struct memcg_cache_params, list);
+		cachep = memcg_params_to_cache(params);
+		memcg_unregister_cache(cachep);
 	}
 	mutex_unlock(&memcg_slab_mutex);
 }
@@ -3381,7 +3409,9 @@ void __memcg_uncharge_slab(struct kmem_cache *cachep, int order)
 	memcg_uncharge_kmem(cachep->memcg_params->memcg, PAGE_SIZE << order);
 
 	if (unlikely(atomic_long_dec_and_test(&cachep->memcg_params->refcnt)))
-		schedule_work(&cachep->memcg_params->unregister_work);
+		/* see memcg_unregister_all_caches */
+		call_rcu_sched(&cachep->memcg_params->rcu_head,
+			       memcg_unregister_cache_rcu_func);
 }
 
 /*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately
  2014-06-12 20:38 ` Vladimir Davydov
@ 2014-06-12 20:38   ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of empty slabs for such caches,
otherwise they will be hanging around forever.

This patch makes SLUB discard dead memcg caches' slabs as soon as they
become empty. To achieve that, it disables per cpu partial lists for
dead caches (see put_cpu_partial) and forbids keeping empty slabs on per
node partial lists by setting cache's min_partial to 0 on
kmem_cache_shrink, which is always called on memcg offline (see
memcg_unregister_all_caches).

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Thanks-to: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slub.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 52565a9426ef..0d2d1978e62c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2064,6 +2064,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
+
+	if (memcg_cache_dead(s)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+		local_irq_restore(flags);
+	}
 #endif
 }
 
@@ -3409,6 +3417,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
 	unsigned long flags;
 
+	if (memcg_cache_dead(s))
+		s->min_partial = 0;
+
 	if (!slabs_by_inuse) {
 		/*
 		 * Do not fail shrinking empty slabs if allocation of the
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately
@ 2014-06-12 20:38   ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of empty slabs for such caches,
otherwise they will be hanging around forever.

This patch makes SLUB discard dead memcg caches' slabs as soon as they
become empty. To achieve that, it disables per cpu partial lists for
dead caches (see put_cpu_partial) and forbids keeping empty slabs on per
node partial lists by setting cache's min_partial to 0 on
kmem_cache_shrink, which is always called on memcg offline (see
memcg_unregister_all_caches).

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Thanks-to: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slub.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index 52565a9426ef..0d2d1978e62c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2064,6 +2064,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
+
+	if (memcg_cache_dead(s)) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+		local_irq_restore(flags);
+	}
 #endif
 }
 
@@ -3409,6 +3417,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
 	unsigned long flags;
 
+	if (memcg_cache_dead(s))
+		s->min_partial = 0;
+
 	if (!slabs_by_inuse) {
 		/*
 		 * Do not fail shrinking empty slabs if allocation of the
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-12 20:38 ` Vladimir Davydov
@ 2014-06-12 20:38   ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of free objects/slabs for such
caches, otherwise they will be hanging around forever.

For SLAB that means we must disable per cpu free object arrays and make
free_block always discard empty slabs irrespective of node's free_limit.

To disable per cpu arrays, we free them on kmem_cache_shrink (see
drain_cpu_caches -> do_drain) and make __cache_free fall back to
free_block if there is no per cpu array. Also, we have to disable
allocation of per cpu arrays on cpu hotplug for dead caches (see
cpuup_prepare, __do_tune_cpucache).

After we disabled free objects/slabs caching, there is no need to reap
those caches periodically. Moreover, it will only result in slowdown. So
we also make cache_reap skip then.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c |   31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/mm/slab.c b/mm/slab.c
index b3af82419251..7e91f5f1341d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1210,6 +1210,9 @@ static int cpuup_prepare(long cpu)
 		struct array_cache *shared = NULL;
 		struct array_cache **alien = NULL;
 
+		if (memcg_cache_dead(cachep))
+			continue;
+
 		nc = alloc_arraycache(node, cachep->limit,
 					cachep->batchcount, GFP_KERNEL);
 		if (!nc)
@@ -2411,10 +2414,18 @@ static void do_drain(void *arg)
 
 	check_irq_off();
 	ac = cpu_cache_get(cachep);
+	if (!ac)
+		return;
+
 	spin_lock(&cachep->node[node]->list_lock);
 	free_block(cachep, ac->entry, ac->avail, node);
 	spin_unlock(&cachep->node[node]->list_lock);
 	ac->avail = 0;
+
+	if (memcg_cache_dead(cachep)) {
+		cachep->array[smp_processor_id()] = NULL;
+		kfree(ac);
+	}
 }
 
 static void drain_cpu_caches(struct kmem_cache *cachep)
@@ -3368,7 +3379,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 
 		/* fixup slab chains */
 		if (page->active == 0) {
-			if (n->free_objects > n->free_limit) {
+			if (n->free_objects > n->free_limit ||
+			    memcg_cache_dead(cachep)) {
 				n->free_objects -= cachep->num;
 				/* No need to drop any previously held
 				 * lock here, even if we have a off-slab slab
@@ -3462,6 +3474,17 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 
 	kmemcheck_slab_free(cachep, objp, cachep->object_size);
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (unlikely(!ac)) {
+		int nodeid = page_to_nid(virt_to_page(objp));
+
+		spin_lock(&cachep->node[nodeid]->list_lock);
+		free_block(cachep, &objp, 1, nodeid);
+		spin_unlock(&cachep->node[nodeid]->list_lock);
+		return;
+	}
+#endif
+
 	/*
 	 * Skip calling cache_free_alien() when the platform is not numa.
 	 * This will avoid cache misses that happen while accessing slabp (which
@@ -3803,6 +3826,9 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	struct ccupdate_struct *new;
 	int i;
 
+	if (memcg_cache_dead(cachep))
+		return 0;
+
 	new = kzalloc(sizeof(*new) + nr_cpu_ids * sizeof(struct array_cache *),
 		      gfp);
 	if (!new)
@@ -3988,6 +4014,9 @@ static void cache_reap(struct work_struct *w)
 	list_for_each_entry(searchp, &slab_caches, list) {
 		check_irq_on();
 
+		if (memcg_cache_dead(searchp))
+			continue;
+
 		/*
 		 * We only take the node lock if absolutely necessary and we
 		 * have established with reasonable certainty that
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-12 20:38   ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:38 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Since a dead memcg cache is destroyed only after the last slab allocated
to it is freed, we must disable caching of free objects/slabs for such
caches, otherwise they will be hanging around forever.

For SLAB that means we must disable per cpu free object arrays and make
free_block always discard empty slabs irrespective of node's free_limit.

To disable per cpu arrays, we free them on kmem_cache_shrink (see
drain_cpu_caches -> do_drain) and make __cache_free fall back to
free_block if there is no per cpu array. Also, we have to disable
allocation of per cpu arrays on cpu hotplug for dead caches (see
cpuup_prepare, __do_tune_cpucache).

After we disabled free objects/slabs caching, there is no need to reap
those caches periodically. Moreover, it will only result in slowdown. So
we also make cache_reap skip then.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c |   31 ++++++++++++++++++++++++++++++-
 1 file changed, 30 insertions(+), 1 deletion(-)

diff --git a/mm/slab.c b/mm/slab.c
index b3af82419251..7e91f5f1341d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1210,6 +1210,9 @@ static int cpuup_prepare(long cpu)
 		struct array_cache *shared = NULL;
 		struct array_cache **alien = NULL;
 
+		if (memcg_cache_dead(cachep))
+			continue;
+
 		nc = alloc_arraycache(node, cachep->limit,
 					cachep->batchcount, GFP_KERNEL);
 		if (!nc)
@@ -2411,10 +2414,18 @@ static void do_drain(void *arg)
 
 	check_irq_off();
 	ac = cpu_cache_get(cachep);
+	if (!ac)
+		return;
+
 	spin_lock(&cachep->node[node]->list_lock);
 	free_block(cachep, ac->entry, ac->avail, node);
 	spin_unlock(&cachep->node[node]->list_lock);
 	ac->avail = 0;
+
+	if (memcg_cache_dead(cachep)) {
+		cachep->array[smp_processor_id()] = NULL;
+		kfree(ac);
+	}
 }
 
 static void drain_cpu_caches(struct kmem_cache *cachep)
@@ -3368,7 +3379,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 
 		/* fixup slab chains */
 		if (page->active == 0) {
-			if (n->free_objects > n->free_limit) {
+			if (n->free_objects > n->free_limit ||
+			    memcg_cache_dead(cachep)) {
 				n->free_objects -= cachep->num;
 				/* No need to drop any previously held
 				 * lock here, even if we have a off-slab slab
@@ -3462,6 +3474,17 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 
 	kmemcheck_slab_free(cachep, objp, cachep->object_size);
 
+#ifdef CONFIG_MEMCG_KMEM
+	if (unlikely(!ac)) {
+		int nodeid = page_to_nid(virt_to_page(objp));
+
+		spin_lock(&cachep->node[nodeid]->list_lock);
+		free_block(cachep, &objp, 1, nodeid);
+		spin_unlock(&cachep->node[nodeid]->list_lock);
+		return;
+	}
+#endif
+
 	/*
 	 * Skip calling cache_free_alien() when the platform is not numa.
 	 * This will avoid cache misses that happen while accessing slabp (which
@@ -3803,6 +3826,9 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	struct ccupdate_struct *new;
 	int i;
 
+	if (memcg_cache_dead(cachep))
+		return 0;
+
 	new = kzalloc(sizeof(*new) + nr_cpu_ids * sizeof(struct array_cache *),
 		      gfp);
 	if (!new)
@@ -3988,6 +4014,9 @@ static void cache_reap(struct work_struct *w)
 	list_for_each_entry(searchp, &slab_caches, list) {
 		check_irq_on();
 
+		if (memcg_cache_dead(searchp))
+			continue;
+
 		/*
 		 * We only take the node lock if absolutely necessary and we
 		 * have established with reasonable certainty that
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-12 20:38   ` Vladimir Davydov
@ 2014-06-12 20:41     ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:41 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of free objects/slabs for such
> caches, otherwise they will be hanging around forever.
> 
> For SLAB that means we must disable per cpu free object arrays and make
> free_block always discard empty slabs irrespective of node's free_limit.

An alternative to this could be making cache_reap, which drains per cpu
arrays and drops free slabs periodically for all caches, shrink dead
caches aggressively. The patch doing this is attached.

This approach has its pros and cons comparing to disabling per cpu
arrays.

Pros:
 - Less intrusive: it only requires modification of cache_reap.
 - Doesn't impact performance: free path isn't touched.

Cons:
 - Delays dead cache destruction: lag between the last object is freed
   and the cache is destroyed isn't constant. It depends on the number
   of kmem-active memcgs and the number of dead caches (the more of
   them, the longer it'll take to shrink dead caches). Also, on NUMA
   machines the upper bound will be proportional to the number of NUMA
   nodes, because alien caches are reaped one at a time (see
   reap_alien).
 - If there are a lot of dead caches, periodic shrinking will be slowed
   down even for active caches (see cache_reap).

--

diff --git a/mm/slab.c b/mm/slab.c
index 9ca3b87edabc..811fdb214b9e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3980,6 +3980,11 @@ static void cache_reap(struct work_struct *w)
 		goto out;
 
 	list_for_each_entry(searchp, &slab_caches, list) {
+		int force = 0;
+
+		if (memcg_cache_dead(searchp))
+			force = 1;
+
 		check_irq_on();
 
 		/*
@@ -3991,7 +3996,7 @@ static void cache_reap(struct work_struct *w)
 
 		reap_alien(searchp, n);
 
-		drain_array(searchp, n, cpu_cache_get(searchp), 0, node);
+		drain_array(searchp, n, cpu_cache_get(searchp), force, node);
 
 		/*
 		 * These are racy checks but it does not matter
@@ -4002,15 +4007,17 @@ static void cache_reap(struct work_struct *w)
 
 		n->next_reap = jiffies + REAPTIMEOUT_NODE;
 
-		drain_array(searchp, n, n->shared, 0, node);
+		drain_array(searchp, n, n->shared, force, node);
 
 		if (n->free_touched)
 			n->free_touched = 0;
 		else {
-			int freed;
+			int freed, tofree;
+
+			tofree = force ? slabs_tofree(searchp, n) :
+				DIV_ROUND_UP(n->free_limit, 5 * searchp->num);
 
-			freed = drain_freelist(searchp, n, (n->free_limit +
-				5 * searchp->num - 1) / (5 * searchp->num));
+			freed = drain_freelist(searchp, n, tofree);
 			STATS_ADD_REAPED(searchp, freed);
 		}
 next:

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-12 20:41     ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-12 20:41 UTC (permalink / raw)
  To: akpm
  Cc: cl, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of free objects/slabs for such
> caches, otherwise they will be hanging around forever.
> 
> For SLAB that means we must disable per cpu free object arrays and make
> free_block always discard empty slabs irrespective of node's free_limit.

An alternative to this could be making cache_reap, which drains per cpu
arrays and drops free slabs periodically for all caches, shrink dead
caches aggressively. The patch doing this is attached.

This approach has its pros and cons comparing to disabling per cpu
arrays.

Pros:
 - Less intrusive: it only requires modification of cache_reap.
 - Doesn't impact performance: free path isn't touched.

Cons:
 - Delays dead cache destruction: lag between the last object is freed
   and the cache is destroyed isn't constant. It depends on the number
   of kmem-active memcgs and the number of dead caches (the more of
   them, the longer it'll take to shrink dead caches). Also, on NUMA
   machines the upper bound will be proportional to the number of NUMA
   nodes, because alien caches are reaped one at a time (see
   reap_alien).
 - If there are a lot of dead caches, periodic shrinking will be slowed
   down even for active caches (see cache_reap).

--

diff --git a/mm/slab.c b/mm/slab.c
index 9ca3b87edabc..811fdb214b9e 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3980,6 +3980,11 @@ static void cache_reap(struct work_struct *w)
 		goto out;
 
 	list_for_each_entry(searchp, &slab_caches, list) {
+		int force = 0;
+
+		if (memcg_cache_dead(searchp))
+			force = 1;
+
 		check_irq_on();
 
 		/*
@@ -3991,7 +3996,7 @@ static void cache_reap(struct work_struct *w)
 
 		reap_alien(searchp, n);
 
-		drain_array(searchp, n, cpu_cache_get(searchp), 0, node);
+		drain_array(searchp, n, cpu_cache_get(searchp), force, node);
 
 		/*
 		 * These are racy checks but it does not matter
@@ -4002,15 +4007,17 @@ static void cache_reap(struct work_struct *w)
 
 		n->next_reap = jiffies + REAPTIMEOUT_NODE;
 
-		drain_array(searchp, n, n->shared, 0, node);
+		drain_array(searchp, n, n->shared, force, node);
 
 		if (n->free_touched)
 			n->free_touched = 0;
 		else {
-			int freed;
+			int freed, tofree;
+
+			tofree = force ? slabs_tofree(searchp, n) :
+				DIV_ROUND_UP(n->free_limit, 5 * searchp->num);
 
-			freed = drain_freelist(searchp, n, (n->free_limit +
-				5 * searchp->num - 1) / (5 * searchp->num));
+			freed = drain_freelist(searchp, n, tofree);
 			STATS_ADD_REAPED(searchp, freed);
 		}
 next:

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately
  2014-06-12 20:38   ` Vladimir Davydov
@ 2014-06-13 16:54     ` Christoph Lameter
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-06-13 16:54 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

On Fri, 13 Jun 2014, Vladimir Davydov wrote:

> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of empty slabs for such caches,
> otherwise they will be hanging around forever.

Looks good and clean.

Acked-by: Christoph Lameter <cl@linux.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately
@ 2014-06-13 16:54     ` Christoph Lameter
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-06-13 16:54 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

On Fri, 13 Jun 2014, Vladimir Davydov wrote:

> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of empty slabs for such caches,
> otherwise they will be hanging around forever.

Looks good and clean.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-12 20:38   ` Vladimir Davydov
@ 2014-06-24  7:25     ` Joonsoo Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-06-24  7:25 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of free objects/slabs for such
> caches, otherwise they will be hanging around forever.
> 
> For SLAB that means we must disable per cpu free object arrays and make
> free_block always discard empty slabs irrespective of node's free_limit.
> 
> To disable per cpu arrays, we free them on kmem_cache_shrink (see
> drain_cpu_caches -> do_drain) and make __cache_free fall back to
> free_block if there is no per cpu array. Also, we have to disable
> allocation of per cpu arrays on cpu hotplug for dead caches (see
> cpuup_prepare, __do_tune_cpucache).
> 
> After we disabled free objects/slabs caching, there is no need to reap
> those caches periodically. Moreover, it will only result in slowdown. So
> we also make cache_reap skip then.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  mm/slab.c |   31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index b3af82419251..7e91f5f1341d 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1210,6 +1210,9 @@ static int cpuup_prepare(long cpu)
>  		struct array_cache *shared = NULL;
>  		struct array_cache **alien = NULL;
>  
> +		if (memcg_cache_dead(cachep))
> +			continue;
> +
>  		nc = alloc_arraycache(node, cachep->limit,
>  					cachep->batchcount, GFP_KERNEL);
>  		if (!nc)
> @@ -2411,10 +2414,18 @@ static void do_drain(void *arg)
>  
>  	check_irq_off();
>  	ac = cpu_cache_get(cachep);
> +	if (!ac)
> +		return;
> +
>  	spin_lock(&cachep->node[node]->list_lock);
>  	free_block(cachep, ac->entry, ac->avail, node);
>  	spin_unlock(&cachep->node[node]->list_lock);
>  	ac->avail = 0;
> +
> +	if (memcg_cache_dead(cachep)) {
> +		cachep->array[smp_processor_id()] = NULL;
> +		kfree(ac);
> +	}
>  }
>  
>  static void drain_cpu_caches(struct kmem_cache *cachep)
> @@ -3368,7 +3379,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
>  
>  		/* fixup slab chains */
>  		if (page->active == 0) {
> -			if (n->free_objects > n->free_limit) {
> +			if (n->free_objects > n->free_limit ||
> +			    memcg_cache_dead(cachep)) {

Hello, Vladimir.

I'd like to set 0 to free_limit in __kmem_cache_shrink()
rather than memcg_cache_dead() test here, because memcg_cache_dead()
is more expensive than it. Is there any problem in this way?

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-24  7:25     ` Joonsoo Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-06-24  7:25 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of free objects/slabs for such
> caches, otherwise they will be hanging around forever.
> 
> For SLAB that means we must disable per cpu free object arrays and make
> free_block always discard empty slabs irrespective of node's free_limit.
> 
> To disable per cpu arrays, we free them on kmem_cache_shrink (see
> drain_cpu_caches -> do_drain) and make __cache_free fall back to
> free_block if there is no per cpu array. Also, we have to disable
> allocation of per cpu arrays on cpu hotplug for dead caches (see
> cpuup_prepare, __do_tune_cpucache).
> 
> After we disabled free objects/slabs caching, there is no need to reap
> those caches periodically. Moreover, it will only result in slowdown. So
> we also make cache_reap skip then.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  mm/slab.c |   31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index b3af82419251..7e91f5f1341d 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1210,6 +1210,9 @@ static int cpuup_prepare(long cpu)
>  		struct array_cache *shared = NULL;
>  		struct array_cache **alien = NULL;
>  
> +		if (memcg_cache_dead(cachep))
> +			continue;
> +
>  		nc = alloc_arraycache(node, cachep->limit,
>  					cachep->batchcount, GFP_KERNEL);
>  		if (!nc)
> @@ -2411,10 +2414,18 @@ static void do_drain(void *arg)
>  
>  	check_irq_off();
>  	ac = cpu_cache_get(cachep);
> +	if (!ac)
> +		return;
> +
>  	spin_lock(&cachep->node[node]->list_lock);
>  	free_block(cachep, ac->entry, ac->avail, node);
>  	spin_unlock(&cachep->node[node]->list_lock);
>  	ac->avail = 0;
> +
> +	if (memcg_cache_dead(cachep)) {
> +		cachep->array[smp_processor_id()] = NULL;
> +		kfree(ac);
> +	}
>  }
>  
>  static void drain_cpu_caches(struct kmem_cache *cachep)
> @@ -3368,7 +3379,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
>  
>  		/* fixup slab chains */
>  		if (page->active == 0) {
> -			if (n->free_objects > n->free_limit) {
> +			if (n->free_objects > n->free_limit ||
> +			    memcg_cache_dead(cachep)) {

Hello, Vladimir.

I'd like to set 0 to free_limit in __kmem_cache_shrink()
rather than memcg_cache_dead() test here, because memcg_cache_dead()
is more expensive than it. Is there any problem in this way?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-12 20:38   ` Vladimir Davydov
@ 2014-06-24  7:38     ` Joonsoo Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-06-24  7:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of free objects/slabs for such
> caches, otherwise they will be hanging around forever.
> 
> For SLAB that means we must disable per cpu free object arrays and make
> free_block always discard empty slabs irrespective of node's free_limit.
> 
> To disable per cpu arrays, we free them on kmem_cache_shrink (see
> drain_cpu_caches -> do_drain) and make __cache_free fall back to
> free_block if there is no per cpu array. Also, we have to disable
> allocation of per cpu arrays on cpu hotplug for dead caches (see
> cpuup_prepare, __do_tune_cpucache).
> 
> After we disabled free objects/slabs caching, there is no need to reap
> those caches periodically. Moreover, it will only result in slowdown. So
> we also make cache_reap skip then.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  mm/slab.c |   31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index b3af82419251..7e91f5f1341d 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1210,6 +1210,9 @@ static int cpuup_prepare(long cpu)
>  		struct array_cache *shared = NULL;
>  		struct array_cache **alien = NULL;
>  
> +		if (memcg_cache_dead(cachep))
> +			continue;
> +
>  		nc = alloc_arraycache(node, cachep->limit,
>  					cachep->batchcount, GFP_KERNEL);
>  		if (!nc)
> @@ -2411,10 +2414,18 @@ static void do_drain(void *arg)
>  
>  	check_irq_off();
>  	ac = cpu_cache_get(cachep);
> +	if (!ac)
> +		return;
> +
>  	spin_lock(&cachep->node[node]->list_lock);
>  	free_block(cachep, ac->entry, ac->avail, node);
>  	spin_unlock(&cachep->node[node]->list_lock);
>  	ac->avail = 0;
> +
> +	if (memcg_cache_dead(cachep)) {
> +		cachep->array[smp_processor_id()] = NULL;
> +		kfree(ac);
> +	}
>  }
>  
>  static void drain_cpu_caches(struct kmem_cache *cachep)
> @@ -3368,7 +3379,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
>  
>  		/* fixup slab chains */
>  		if (page->active == 0) {
> -			if (n->free_objects > n->free_limit) {
> +			if (n->free_objects > n->free_limit ||
> +			    memcg_cache_dead(cachep)) {
>  				n->free_objects -= cachep->num;
>  				/* No need to drop any previously held
>  				 * lock here, even if we have a off-slab slab
> @@ -3462,6 +3474,17 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
>  
>  	kmemcheck_slab_free(cachep, objp, cachep->object_size);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (unlikely(!ac)) {
> +		int nodeid = page_to_nid(virt_to_page(objp));
> +
> +		spin_lock(&cachep->node[nodeid]->list_lock);
> +		free_block(cachep, &objp, 1, nodeid);
> +		spin_unlock(&cachep->node[nodeid]->list_lock);
> +		return;
> +	}
> +#endif
> +

And, please document intention of this code. :)

And, you said that this way of implementation would be slow because
there could be many object in dead caches and this implementation
needs node spin_lock on each object freeing. Is it no problem now?

If you have any performance data about this implementation and
alternative one, could you share it?

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-24  7:38     ` Joonsoo Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-06-24  7:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of free objects/slabs for such
> caches, otherwise they will be hanging around forever.
> 
> For SLAB that means we must disable per cpu free object arrays and make
> free_block always discard empty slabs irrespective of node's free_limit.
> 
> To disable per cpu arrays, we free them on kmem_cache_shrink (see
> drain_cpu_caches -> do_drain) and make __cache_free fall back to
> free_block if there is no per cpu array. Also, we have to disable
> allocation of per cpu arrays on cpu hotplug for dead caches (see
> cpuup_prepare, __do_tune_cpucache).
> 
> After we disabled free objects/slabs caching, there is no need to reap
> those caches periodically. Moreover, it will only result in slowdown. So
> we also make cache_reap skip then.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> ---
>  mm/slab.c |   31 ++++++++++++++++++++++++++++++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/slab.c b/mm/slab.c
> index b3af82419251..7e91f5f1341d 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -1210,6 +1210,9 @@ static int cpuup_prepare(long cpu)
>  		struct array_cache *shared = NULL;
>  		struct array_cache **alien = NULL;
>  
> +		if (memcg_cache_dead(cachep))
> +			continue;
> +
>  		nc = alloc_arraycache(node, cachep->limit,
>  					cachep->batchcount, GFP_KERNEL);
>  		if (!nc)
> @@ -2411,10 +2414,18 @@ static void do_drain(void *arg)
>  
>  	check_irq_off();
>  	ac = cpu_cache_get(cachep);
> +	if (!ac)
> +		return;
> +
>  	spin_lock(&cachep->node[node]->list_lock);
>  	free_block(cachep, ac->entry, ac->avail, node);
>  	spin_unlock(&cachep->node[node]->list_lock);
>  	ac->avail = 0;
> +
> +	if (memcg_cache_dead(cachep)) {
> +		cachep->array[smp_processor_id()] = NULL;
> +		kfree(ac);
> +	}
>  }
>  
>  static void drain_cpu_caches(struct kmem_cache *cachep)
> @@ -3368,7 +3379,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
>  
>  		/* fixup slab chains */
>  		if (page->active == 0) {
> -			if (n->free_objects > n->free_limit) {
> +			if (n->free_objects > n->free_limit ||
> +			    memcg_cache_dead(cachep)) {
>  				n->free_objects -= cachep->num;
>  				/* No need to drop any previously held
>  				 * lock here, even if we have a off-slab slab
> @@ -3462,6 +3474,17 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
>  
>  	kmemcheck_slab_free(cachep, objp, cachep->object_size);
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	if (unlikely(!ac)) {
> +		int nodeid = page_to_nid(virt_to_page(objp));
> +
> +		spin_lock(&cachep->node[nodeid]->list_lock);
> +		free_block(cachep, &objp, 1, nodeid);
> +		spin_unlock(&cachep->node[nodeid]->list_lock);
> +		return;
> +	}
> +#endif
> +

And, please document intention of this code. :)

And, you said that this way of implementation would be slow because
there could be many object in dead caches and this implementation
needs node spin_lock on each object freeing. Is it no problem now?

If you have any performance data about this implementation and
alternative one, could you share it?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-24  7:25     ` Joonsoo Kim
@ 2014-06-24  7:42       ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24  7:42 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

Hi,

On Tue, Jun 24, 2014 at 04:25:54PM +0900, Joonsoo Kim wrote:
> On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> > @@ -3368,7 +3379,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
> >  
> >  		/* fixup slab chains */
> >  		if (page->active == 0) {
> > -			if (n->free_objects > n->free_limit) {
> > +			if (n->free_objects > n->free_limit ||
> > +			    memcg_cache_dead(cachep)) {
> 
> I'd like to set 0 to free_limit in __kmem_cache_shrink()
> rather than memcg_cache_dead() test here, because memcg_cache_dead()
> is more expensive than it. Is there any problem in this way?

We'd have to be careful on cpu hotplug then, because it may update the
free_limit. Not a big problem though. Will fix.

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-24  7:42       ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24  7:42 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

Hi,

On Tue, Jun 24, 2014 at 04:25:54PM +0900, Joonsoo Kim wrote:
> On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> > @@ -3368,7 +3379,8 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
> >  
> >  		/* fixup slab chains */
> >  		if (page->active == 0) {
> > -			if (n->free_objects > n->free_limit) {
> > +			if (n->free_objects > n->free_limit ||
> > +			    memcg_cache_dead(cachep)) {
> 
> I'd like to set 0 to free_limit in __kmem_cache_shrink()
> rather than memcg_cache_dead() test here, because memcg_cache_dead()
> is more expensive than it. Is there any problem in this way?

We'd have to be careful on cpu hotplug then, because it may update the
free_limit. Not a big problem though. Will fix.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-24  7:38     ` Joonsoo Kim
@ 2014-06-24  7:48       ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24  7:48 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Tue, Jun 24, 2014 at 04:38:41PM +0900, Joonsoo Kim wrote:
> On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> > @@ -3462,6 +3474,17 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
> >  
> >  	kmemcheck_slab_free(cachep, objp, cachep->object_size);
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	if (unlikely(!ac)) {
> > +		int nodeid = page_to_nid(virt_to_page(objp));
> > +
> > +		spin_lock(&cachep->node[nodeid]->list_lock);
> > +		free_block(cachep, &objp, 1, nodeid);
> > +		spin_unlock(&cachep->node[nodeid]->list_lock);
> > +		return;
> > +	}
> > +#endif
> > +
> 
> And, please document intention of this code. :)

Sure.

> And, you said that this way of implementation would be slow because
> there could be many object in dead caches and this implementation
> needs node spin_lock on each object freeing. Is it no problem now?

It may be :(

> If you have any performance data about this implementation and
> alternative one, could you share it?

I haven't (shame on me!). I'll do some testing today and send you the
results.

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-24  7:48       ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24  7:48 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Tue, Jun 24, 2014 at 04:38:41PM +0900, Joonsoo Kim wrote:
> On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> > @@ -3462,6 +3474,17 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
> >  
> >  	kmemcheck_slab_free(cachep, objp, cachep->object_size);
> >  
> > +#ifdef CONFIG_MEMCG_KMEM
> > +	if (unlikely(!ac)) {
> > +		int nodeid = page_to_nid(virt_to_page(objp));
> > +
> > +		spin_lock(&cachep->node[nodeid]->list_lock);
> > +		free_block(cachep, &objp, 1, nodeid);
> > +		spin_unlock(&cachep->node[nodeid]->list_lock);
> > +		return;
> > +	}
> > +#endif
> > +
> 
> And, please document intention of this code. :)

Sure.

> And, you said that this way of implementation would be slow because
> there could be many object in dead caches and this implementation
> needs node spin_lock on each object freeing. Is it no problem now?

It may be :(

> If you have any performance data about this implementation and
> alternative one, could you share it?

I haven't (shame on me!). I'll do some testing today and send you the
results.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately
  2014-06-12 20:38   ` Vladimir Davydov
@ 2014-06-24  7:50     ` Joonsoo Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-06-24  7:50 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Fri, Jun 13, 2014 at 12:38:21AM +0400, Vladimir Davydov wrote:
> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of empty slabs for such caches,
> otherwise they will be hanging around forever.
> 
> This patch makes SLUB discard dead memcg caches' slabs as soon as they
> become empty. To achieve that, it disables per cpu partial lists for
> dead caches (see put_cpu_partial) and forbids keeping empty slabs on per
> node partial lists by setting cache's min_partial to 0 on
> kmem_cache_shrink, which is always called on memcg offline (see
> memcg_unregister_all_caches).
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Thanks-to: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/slub.c |   11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 52565a9426ef..0d2d1978e62c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2064,6 +2064,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>  
>  	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>  								!= oldpage);
> +
> +	if (memcg_cache_dead(s)) {
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> +		local_irq_restore(flags);
> +	}
>  #endif
>  }
>  
> @@ -3409,6 +3417,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
>  	unsigned long flags;
>  
> +	if (memcg_cache_dead(s))
> +		s->min_partial = 0;
> +
>  	if (!slabs_by_inuse) {
>  		/*
>  		 * Do not fail shrinking empty slabs if allocation of the

I think that you should move down n->nr_partial test after holding the
lock in __kmem_cache_shrink(). Access to n->nr_partial without node lock
is racy and you can see wrong value. It results in skipping to free empty
slab so your destroying logic could fail.

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately
@ 2014-06-24  7:50     ` Joonsoo Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-06-24  7:50 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Fri, Jun 13, 2014 at 12:38:21AM +0400, Vladimir Davydov wrote:
> Since a dead memcg cache is destroyed only after the last slab allocated
> to it is freed, we must disable caching of empty slabs for such caches,
> otherwise they will be hanging around forever.
> 
> This patch makes SLUB discard dead memcg caches' slabs as soon as they
> become empty. To achieve that, it disables per cpu partial lists for
> dead caches (see put_cpu_partial) and forbids keeping empty slabs on per
> node partial lists by setting cache's min_partial to 0 on
> kmem_cache_shrink, which is always called on memcg offline (see
> memcg_unregister_all_caches).
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Thanks-to: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/slub.c |   11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index 52565a9426ef..0d2d1978e62c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2064,6 +2064,14 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>  
>  	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
>  								!= oldpage);
> +
> +	if (memcg_cache_dead(s)) {
> +		unsigned long flags;
> +
> +		local_irq_save(flags);
> +		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
> +		local_irq_restore(flags);
> +	}
>  #endif
>  }
>  
> @@ -3409,6 +3417,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
>  		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
>  	unsigned long flags;
>  
> +	if (memcg_cache_dead(s))
> +		s->min_partial = 0;
> +
>  	if (!slabs_by_inuse) {
>  		/*
>  		 * Do not fail shrinking empty slabs if allocation of the

I think that you should move down n->nr_partial test after holding the
lock in __kmem_cache_shrink(). Access to n->nr_partial without node lock
is racy and you can see wrong value. It results in skipping to free empty
slab so your destroying logic could fail.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately
  2014-06-24  7:50     ` Joonsoo Kim
@ 2014-06-24  8:25       ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24  8:25 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Tue, Jun 24, 2014 at 04:50:11PM +0900, Joonsoo Kim wrote:
> On Fri, Jun 13, 2014 at 12:38:21AM +0400, Vladimir Davydov wrote:
> > @@ -3409,6 +3417,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
> >  	unsigned long flags;
> >  
> > +	if (memcg_cache_dead(s))
> > +		s->min_partial = 0;
> > +
> >  	if (!slabs_by_inuse) {
> >  		/*
> >  		 * Do not fail shrinking empty slabs if allocation of the
> 
> I think that you should move down n->nr_partial test after holding the
> lock in __kmem_cache_shrink(). Access to n->nr_partial without node lock
> is racy and you can see wrong value. It results in skipping to free empty
> slab so your destroying logic could fail.

You're right! Will fix this.

And there seems to be the same problem in SLAB, where we check
node->slabs_free list emptiness w/o holding node->list_lock (see
drain_freelist) while it can be modified concurrently by free_block.
This will be fixed automatically after we make __kmem_cache_shrink unset
node->free_limit (which must be done under the lock) though.

Thank you!

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately
@ 2014-06-24  8:25       ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24  8:25 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Tue, Jun 24, 2014 at 04:50:11PM +0900, Joonsoo Kim wrote:
> On Fri, Jun 13, 2014 at 12:38:21AM +0400, Vladimir Davydov wrote:
> > @@ -3409,6 +3417,9 @@ int __kmem_cache_shrink(struct kmem_cache *s)
> >  		kmalloc(sizeof(struct list_head) * objects, GFP_KERNEL);
> >  	unsigned long flags;
> >  
> > +	if (memcg_cache_dead(s))
> > +		s->min_partial = 0;
> > +
> >  	if (!slabs_by_inuse) {
> >  		/*
> >  		 * Do not fail shrinking empty slabs if allocation of the
> 
> I think that you should move down n->nr_partial test after holding the
> lock in __kmem_cache_shrink(). Access to n->nr_partial without node lock
> is racy and you can see wrong value. It results in skipping to free empty
> slab so your destroying logic could fail.

You're right! Will fix this.

And there seems to be the same problem in SLAB, where we check
node->slabs_free list emptiness w/o holding node->list_lock (see
drain_freelist) while it can be modified concurrently by free_block.
This will be fixed automatically after we make __kmem_cache_shrink unset
node->free_limit (which must be done under the lock) though.

Thank you!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH -mm] slub: kmem_cache_shrink: check if partial list is empty under list_lock
  2014-06-24  7:50     ` Joonsoo Kim
@ 2014-06-24  9:42       ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24  9:42 UTC (permalink / raw)
  To: akpm
  Cc: iamjoonsoo.kim, cl, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

SLUB's implementation of kmem_cache_shrink skips nodes that have
nr_partial=0, because they surely don't have any empty slabs to free.
This check is done w/o holding any locks, therefore it can race with
concurrent kfree adding an empty slab to a partial list. As a result, a
just shrinked cache can have empty slabs.

This is unacceptable for kmemcg, which needs to be sure that there will
be no empty slabs on dead memcg caches after kmem_cache_shrink was
called, because otherwise we may leak a dead cache.

Let's fix this race by checking if node partial list is empty under
node->list_lock. Since the nr_partial!=0 branch of kmem_cache_shrink
does nothing if the list is empty, we can simply remove the nr_partial=0
check.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slub.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 67da14d9ec70..891ac6cd78cc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3397,9 +3397,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
-		if (!n->nr_partial)
-			continue;
-
 		for (i = 0; i < objects; i++)
 			INIT_LIST_HEAD(slabs_by_inuse + i);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm] slub: kmem_cache_shrink: check if partial list is empty under list_lock
@ 2014-06-24  9:42       ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24  9:42 UTC (permalink / raw)
  To: akpm
  Cc: iamjoonsoo.kim, cl, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

SLUB's implementation of kmem_cache_shrink skips nodes that have
nr_partial=0, because they surely don't have any empty slabs to free.
This check is done w/o holding any locks, therefore it can race with
concurrent kfree adding an empty slab to a partial list. As a result, a
just shrinked cache can have empty slabs.

This is unacceptable for kmemcg, which needs to be sure that there will
be no empty slabs on dead memcg caches after kmem_cache_shrink was
called, because otherwise we may leak a dead cache.

Let's fix this race by checking if node partial list is empty under
node->list_lock. Since the nr_partial!=0 branch of kmem_cache_shrink
does nothing if the list is empty, we can simply remove the nr_partial=0
check.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Reported-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slub.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 67da14d9ec70..891ac6cd78cc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3397,9 +3397,6 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
-		if (!n->nr_partial)
-			continue;
-
 		for (i = 0; i < objects; i++)
 			INIT_LIST_HEAD(slabs_by_inuse + i);
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm] slab: set free_limit for dead caches to 0
  2014-06-24  7:25     ` Joonsoo Kim
@ 2014-06-24 12:28       ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24 12:28 UTC (permalink / raw)
  To: akpm
  Cc: iamjoonsoo.kim, cl, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

We mustn't keep empty slabs on dead memcg caches, because otherwise they
will never be destroyed.

Currently, we check if the cache is dead in free_block and drop empty
slab if so irrespective of the node's free_limit. This can be pretty
expensive. So let's avoid this additional check by zeroing nodes'
free_limit for dead caches on kmem_cache_shrink. This way no additional
overhead is added to free hot path.

Note, since ->free_limit can be updated on cpu/memory hotplug, we must
handle it properly there.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b35bf2120b96..6009e44a4d1d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1155,11 +1155,13 @@ static int init_cache_node_node(int node)
 			cachep->node[node] = n;
 		}
 
-		spin_lock_irq(&n->list_lock);
-		n->free_limit =
-			(1 + nr_cpus_node(node)) *
-			cachep->batchcount + cachep->num;
-		spin_unlock_irq(&n->list_lock);
+		if (!memcg_cache_dead(cachep)) {
+			spin_lock_irq(&n->list_lock);
+			n->free_limit =
+				(1 + nr_cpus_node(node)) *
+				cachep->batchcount + cachep->num;
+			spin_unlock_irq(&n->list_lock);
+		}
 	}
 	return 0;
 }
@@ -1193,7 +1195,8 @@ static void cpuup_canceled(long cpu)
 		spin_lock_irq(&n->list_lock);
 
 		/* Free limit for this kmem_cache_node */
-		n->free_limit -= cachep->batchcount;
+		if (!memcg_cache_dead(cachep))
+			n->free_limit -= cachep->batchcount;
 		if (nc)
 			free_block(cachep, nc->entry, nc->avail, node);
 
@@ -2544,6 +2547,12 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 
 	check_irq_on();
 	for_each_kmem_cache_node(cachep, node, n) {
+		if (memcg_cache_dead(cachep)) {
+			spin_lock_irq(&n->list_lock);
+			n->free_limit = 0;
+			spin_unlock_irq(&n->list_lock);
+		}
+
 		drain_freelist(cachep, n, slabs_tofree(cachep, n));
 
 		ret += !list_empty(&n->slabs_full) ||
@@ -3426,8 +3435,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 
 		/* fixup slab chains */
 		if (page->active == 0) {
-			if (n->free_objects > n->free_limit ||
-			    memcg_cache_dead(cachep)) {
+			if (n->free_objects > n->free_limit) {
 				n->free_objects -= cachep->num;
 				/* No need to drop any previously held
 				 * lock here, even if we have a off-slab slab
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH -mm] slab: set free_limit for dead caches to 0
@ 2014-06-24 12:28       ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-24 12:28 UTC (permalink / raw)
  To: akpm
  Cc: iamjoonsoo.kim, cl, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

We mustn't keep empty slabs on dead memcg caches, because otherwise they
will never be destroyed.

Currently, we check if the cache is dead in free_block and drop empty
slab if so irrespective of the node's free_limit. This can be pretty
expensive. So let's avoid this additional check by zeroing nodes'
free_limit for dead caches on kmem_cache_shrink. This way no additional
overhead is added to free hot path.

Note, since ->free_limit can be updated on cpu/memory hotplug, we must
handle it properly there.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Suggested-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slab.c |   24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b35bf2120b96..6009e44a4d1d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1155,11 +1155,13 @@ static int init_cache_node_node(int node)
 			cachep->node[node] = n;
 		}
 
-		spin_lock_irq(&n->list_lock);
-		n->free_limit =
-			(1 + nr_cpus_node(node)) *
-			cachep->batchcount + cachep->num;
-		spin_unlock_irq(&n->list_lock);
+		if (!memcg_cache_dead(cachep)) {
+			spin_lock_irq(&n->list_lock);
+			n->free_limit =
+				(1 + nr_cpus_node(node)) *
+				cachep->batchcount + cachep->num;
+			spin_unlock_irq(&n->list_lock);
+		}
 	}
 	return 0;
 }
@@ -1193,7 +1195,8 @@ static void cpuup_canceled(long cpu)
 		spin_lock_irq(&n->list_lock);
 
 		/* Free limit for this kmem_cache_node */
-		n->free_limit -= cachep->batchcount;
+		if (!memcg_cache_dead(cachep))
+			n->free_limit -= cachep->batchcount;
 		if (nc)
 			free_block(cachep, nc->entry, nc->avail, node);
 
@@ -2544,6 +2547,12 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 
 	check_irq_on();
 	for_each_kmem_cache_node(cachep, node, n) {
+		if (memcg_cache_dead(cachep)) {
+			spin_lock_irq(&n->list_lock);
+			n->free_limit = 0;
+			spin_unlock_irq(&n->list_lock);
+		}
+
 		drain_freelist(cachep, n, slabs_tofree(cachep, n));
 
 		ret += !list_empty(&n->slabs_full) ||
@@ -3426,8 +3435,7 @@ static void free_block(struct kmem_cache *cachep, void **objpp, int nr_objects,
 
 		/* fixup slab chains */
 		if (page->active == 0) {
-			if (n->free_objects > n->free_limit ||
-			    memcg_cache_dead(cachep)) {
+			if (n->free_objects > n->free_limit) {
 				n->free_objects -= cachep->num;
 				/* No need to drop any previously held
 				 * lock here, even if we have a off-slab slab
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-24  7:38     ` Joonsoo Kim
@ 2014-06-25 13:45       ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-25 13:45 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Tue, Jun 24, 2014 at 04:38:41PM +0900, Joonsoo Kim wrote:
> On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> And, you said that this way of implementation would be slow because
> there could be many object in dead caches and this implementation
> needs node spin_lock on each object freeing. Is it no problem now?
> 
> If you have any performance data about this implementation and
> alternative one, could you share it?

I ran some tests on a 2 CPU x 6 core x 2 HT box. The kernel was compiled
with a config taken from a popular distro, so it had most of debug
options turned off.

---

TEST #1: Each logical CPU executes a task that frees 1M objects
         allocated from the same cache. All frees are node-local.

RESULTS:

objsize (bytes) | cache is dead? | objects free time (ms)
----------------+----------------+-----------------------
          64    |       -        |       373 +- 5
           -    |       +        |      1300 +- 6
                |                |
         128    |       -        |       387 +- 6
           -    |       +        |      1337 +- 6
                |                |
         256    |       -        |       484 +- 4
           -    |       +        |      1407 +- 6
                |                |
         512    |       -        |       686 +-  5
           -    |       +        |      1561 +- 18
                |                |
        1024    |       -        |      1073 +- 11
           -    |       +        |      1897 +- 12

TEST #2: Each logical CPU executes a task that removes 1M empty files
         from its own RAMFS mount. All frees are node-local.

RESULTS:

 cache is dead? | files removal time (s)
----------------+----------------------------------
      -         |       15.57 +- 0.55   (base)
      +         |       16.80 +- 0.62   (base + 8%)

---

So, according to TEST #1 the relative slowdown introduced by zapping per
cpu arrays is really dreadful - it can be up to 4x! However, the
absolute numbers aren't that huge - ~1 second for 24 million objects.
If we do something else except kfree the slowdown shouldn't be that
visible IMO.

TEST #2 is an attempt to estimate how zapping of per cpu arrays will
affect FS objects destruction, which is the most common case of dead
caches usage. To avoid disk-bound operations it uses RAMFS. From the
test results it follows that the relative slowdown of massive file
deletion is within 2 stdev, which looks decent.

Anyway, the alternative approach (reaping dead caches periodically)
won't have this kfree slowdown at all. However, periodic reaping can
become a real disaster as the system evolves and the number of dead
caches grows. Currently I don't know how we can estimate real life
effects of this. If you have any ideas, please let me know.

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-25 13:45       ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-25 13:45 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Tue, Jun 24, 2014 at 04:38:41PM +0900, Joonsoo Kim wrote:
> On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> And, you said that this way of implementation would be slow because
> there could be many object in dead caches and this implementation
> needs node spin_lock on each object freeing. Is it no problem now?
> 
> If you have any performance data about this implementation and
> alternative one, could you share it?

I ran some tests on a 2 CPU x 6 core x 2 HT box. The kernel was compiled
with a config taken from a popular distro, so it had most of debug
options turned off.

---

TEST #1: Each logical CPU executes a task that frees 1M objects
         allocated from the same cache. All frees are node-local.

RESULTS:

objsize (bytes) | cache is dead? | objects free time (ms)
----------------+----------------+-----------------------
          64    |       -        |       373 +- 5
           -    |       +        |      1300 +- 6
                |                |
         128    |       -        |       387 +- 6
           -    |       +        |      1337 +- 6
                |                |
         256    |       -        |       484 +- 4
           -    |       +        |      1407 +- 6
                |                |
         512    |       -        |       686 +-  5
           -    |       +        |      1561 +- 18
                |                |
        1024    |       -        |      1073 +- 11
           -    |       +        |      1897 +- 12

TEST #2: Each logical CPU executes a task that removes 1M empty files
         from its own RAMFS mount. All frees are node-local.

RESULTS:

 cache is dead? | files removal time (s)
----------------+----------------------------------
      -         |       15.57 +- 0.55   (base)
      +         |       16.80 +- 0.62   (base + 8%)

---

So, according to TEST #1 the relative slowdown introduced by zapping per
cpu arrays is really dreadful - it can be up to 4x! However, the
absolute numbers aren't that huge - ~1 second for 24 million objects.
If we do something else except kfree the slowdown shouldn't be that
visible IMO.

TEST #2 is an attempt to estimate how zapping of per cpu arrays will
affect FS objects destruction, which is the most common case of dead
caches usage. To avoid disk-bound operations it uses RAMFS. From the
test results it follows that the relative slowdown of massive file
deletion is within 2 stdev, which looks decent.

Anyway, the alternative approach (reaping dead caches periodically)
won't have this kfree slowdown at all. However, periodic reaping can
become a real disaster as the system evolves and the number of dead
caches grows. Currently I don't know how we can estimate real life
effects of this. If you have any ideas, please let me know.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH] slab: document why cache can have no per cpu array on kfree
  2014-06-24  7:38     ` Joonsoo Kim
@ 2014-06-25 14:39       ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-25 14:39 UTC (permalink / raw)
  To: akpm
  Cc: iamjoonsoo.kim, cl, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/slab.c b/mm/slab.c
index 6009e44a4d1d..4cb2619277ff 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3530,6 +3530,10 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 	kmemcheck_slab_free(cachep, objp, cachep->object_size);
 
 #ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * Per cpu arrays are disabled for dead memcg caches in order not to
+	 * prevent self-destruction.
+	 */
 	if (unlikely(!ac)) {
 		int nodeid = page_to_nid(virt_to_page(objp));
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH] slab: document why cache can have no per cpu array on kfree
@ 2014-06-25 14:39       ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-06-25 14:39 UTC (permalink / raw)
  To: akpm
  Cc: iamjoonsoo.kim, cl, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
---
 mm/slab.c |    4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/slab.c b/mm/slab.c
index 6009e44a4d1d..4cb2619277ff 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3530,6 +3530,10 @@ static inline void __cache_free(struct kmem_cache *cachep, void *objp,
 	kmemcheck_slab_free(cachep, objp, cachep->object_size);
 
 #ifdef CONFIG_MEMCG_KMEM
+	/*
+	 * Per cpu arrays are disabled for dead memcg caches in order not to
+	 * prevent self-destruction.
+	 */
 	if (unlikely(!ac)) {
 		int nodeid = page_to_nid(virt_to_page(objp));
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH] slab: document why cache can have no per cpu array on kfree
  2014-06-25 14:39       ` Vladimir Davydov
@ 2014-06-25 16:19         ` Christoph Lameter
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-06-25 16:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm


Acked-by: Christoph Lameter <cl@linux.com>


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH] slab: document why cache can have no per cpu array on kfree
@ 2014-06-25 16:19         ` Christoph Lameter
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-06-25 16:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, iamjoonsoo.kim, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm


Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-25 13:45       ` Vladimir Davydov
@ 2014-06-27  6:05         ` Joonsoo Kim
  -1 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-06-27  6:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Wed, Jun 25, 2014 at 05:45:45PM +0400, Vladimir Davydov wrote:
> On Tue, Jun 24, 2014 at 04:38:41PM +0900, Joonsoo Kim wrote:
> > On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> > And, you said that this way of implementation would be slow because
> > there could be many object in dead caches and this implementation
> > needs node spin_lock on each object freeing. Is it no problem now?
> > 
> > If you have any performance data about this implementation and
> > alternative one, could you share it?
> 
> I ran some tests on a 2 CPU x 6 core x 2 HT box. The kernel was compiled
> with a config taken from a popular distro, so it had most of debug
> options turned off.
> 
> ---
> 
> TEST #1: Each logical CPU executes a task that frees 1M objects
>          allocated from the same cache. All frees are node-local.
> 
> RESULTS:
> 
> objsize (bytes) | cache is dead? | objects free time (ms)
> ----------------+----------------+-----------------------
>           64    |       -        |       373 +- 5
>            -    |       +        |      1300 +- 6
>                 |                |
>          128    |       -        |       387 +- 6
>            -    |       +        |      1337 +- 6
>                 |                |
>          256    |       -        |       484 +- 4
>            -    |       +        |      1407 +- 6
>                 |                |
>          512    |       -        |       686 +-  5
>            -    |       +        |      1561 +- 18
>                 |                |
>         1024    |       -        |      1073 +- 11
>            -    |       +        |      1897 +- 12
> 
> TEST #2: Each logical CPU executes a task that removes 1M empty files
>          from its own RAMFS mount. All frees are node-local.
> 
> RESULTS:
> 
>  cache is dead? | files removal time (s)
> ----------------+----------------------------------
>       -         |       15.57 +- 0.55   (base)
>       +         |       16.80 +- 0.62   (base + 8%)
> 
> ---
> 
> So, according to TEST #1 the relative slowdown introduced by zapping per
> cpu arrays is really dreadful - it can be up to 4x! However, the
> absolute numbers aren't that huge - ~1 second for 24 million objects.
> If we do something else except kfree the slowdown shouldn't be that
> visible IMO.
> 
> TEST #2 is an attempt to estimate how zapping of per cpu arrays will
> affect FS objects destruction, which is the most common case of dead
> caches usage. To avoid disk-bound operations it uses RAMFS. From the
> test results it follows that the relative slowdown of massive file
> deletion is within 2 stdev, which looks decent.
> 
> Anyway, the alternative approach (reaping dead caches periodically)
> won't have this kfree slowdown at all. However, periodic reaping can
> become a real disaster as the system evolves and the number of dead
> caches grows. Currently I don't know how we can estimate real life
> effects of this. If you have any ideas, please let me know.
> 

Hello,

I have no idea here. I don't have much experience on large scale
system. But, current implementation would also have big trouble if
system is larger than yours.

I think that Christoph can say something about this result.

Christoph,
Is it tolerable result for large scale system? Or do we need to find
another solution?

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-27  6:05         ` Joonsoo Kim
  0 siblings, 0 replies; 50+ messages in thread
From: Joonsoo Kim @ 2014-06-27  6:05 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: akpm, cl, rientjes, penberg, hannes, mhocko, linux-kernel, linux-mm

On Wed, Jun 25, 2014 at 05:45:45PM +0400, Vladimir Davydov wrote:
> On Tue, Jun 24, 2014 at 04:38:41PM +0900, Joonsoo Kim wrote:
> > On Fri, Jun 13, 2014 at 12:38:22AM +0400, Vladimir Davydov wrote:
> > And, you said that this way of implementation would be slow because
> > there could be many object in dead caches and this implementation
> > needs node spin_lock on each object freeing. Is it no problem now?
> > 
> > If you have any performance data about this implementation and
> > alternative one, could you share it?
> 
> I ran some tests on a 2 CPU x 6 core x 2 HT box. The kernel was compiled
> with a config taken from a popular distro, so it had most of debug
> options turned off.
> 
> ---
> 
> TEST #1: Each logical CPU executes a task that frees 1M objects
>          allocated from the same cache. All frees are node-local.
> 
> RESULTS:
> 
> objsize (bytes) | cache is dead? | objects free time (ms)
> ----------------+----------------+-----------------------
>           64    |       -        |       373 +- 5
>            -    |       +        |      1300 +- 6
>                 |                |
>          128    |       -        |       387 +- 6
>            -    |       +        |      1337 +- 6
>                 |                |
>          256    |       -        |       484 +- 4
>            -    |       +        |      1407 +- 6
>                 |                |
>          512    |       -        |       686 +-  5
>            -    |       +        |      1561 +- 18
>                 |                |
>         1024    |       -        |      1073 +- 11
>            -    |       +        |      1897 +- 12
> 
> TEST #2: Each logical CPU executes a task that removes 1M empty files
>          from its own RAMFS mount. All frees are node-local.
> 
> RESULTS:
> 
>  cache is dead? | files removal time (s)
> ----------------+----------------------------------
>       -         |       15.57 +- 0.55   (base)
>       +         |       16.80 +- 0.62   (base + 8%)
> 
> ---
> 
> So, according to TEST #1 the relative slowdown introduced by zapping per
> cpu arrays is really dreadful - it can be up to 4x! However, the
> absolute numbers aren't that huge - ~1 second for 24 million objects.
> If we do something else except kfree the slowdown shouldn't be that
> visible IMO.
> 
> TEST #2 is an attempt to estimate how zapping of per cpu arrays will
> affect FS objects destruction, which is the most common case of dead
> caches usage. To avoid disk-bound operations it uses RAMFS. From the
> test results it follows that the relative slowdown of massive file
> deletion is within 2 stdev, which looks decent.
> 
> Anyway, the alternative approach (reaping dead caches periodically)
> won't have this kfree slowdown at all. However, periodic reaping can
> become a real disaster as the system evolves and the number of dead
> caches grows. Currently I don't know how we can estimate real life
> effects of this. If you have any ideas, please let me know.
> 

Hello,

I have no idea here. I don't have much experience on large scale
system. But, current implementation would also have big trouble if
system is larger than yours.

I think that Christoph can say something about this result.

Christoph,
Is it tolerable result for large scale system? Or do we need to find
another solution?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-27  6:05         ` Joonsoo Kim
@ 2014-06-30 15:49           ` Christoph Lameter
  -1 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-06-30 15:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vladimir Davydov, akpm, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

On Fri, 27 Jun 2014, Joonsoo Kim wrote:

> Christoph,
> Is it tolerable result for large scale system? Or do we need to find
> another solution?


The overhead is pretty intense but then this is a rare event I guess?

It seems that it is much easier on the code and much faster to do the
periodic reaping. Why not simply go with that?



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-06-30 15:49           ` Christoph Lameter
  0 siblings, 0 replies; 50+ messages in thread
From: Christoph Lameter @ 2014-06-30 15:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vladimir Davydov, akpm, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

On Fri, 27 Jun 2014, Joonsoo Kim wrote:

> Christoph,
> Is it tolerable result for large scale system? Or do we need to find
> another solution?


The overhead is pretty intense but then this is a rare event I guess?

It seems that it is much easier on the code and much faster to do the
periodic reaping. Why not simply go with that?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
  2014-06-30 15:49           ` Christoph Lameter
@ 2014-07-01  7:46             ` Vladimir Davydov
  -1 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-07-01  7:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, akpm, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

On Mon, Jun 30, 2014 at 10:49:03AM -0500, Christoph Lameter wrote:
> On Fri, 27 Jun 2014, Joonsoo Kim wrote:
> 
> > Christoph,
> > Is it tolerable result for large scale system? Or do we need to find
> > another solution?
> 
> 
> The overhead is pretty intense but then this is a rare event I guess?

Yes, provided cgroups are created/destroyed rarely.

> It seems that it is much easier on the code and much faster to do the
> periodic reaping. Why not simply go with that?

A bad thing about the periodic reaping is that the time it may take
isn't predictable, because the number of dead caches is, in fact, only
limited by the amount of RAM.

We can have hundreds, if not thousands, copies of dcaches/icaches left
from cgroups destroyed some time ago. The dead caches will hang around
until memory pressure evicts all the objects they host, which may take
quite long on systems with a lot of memory.

With periodic reaping, we will have to iterate over all dead caches
trying to drain per cpu/node arrays each time, which might therefore
result in slowing down the whole system unexpectedly.

I'm not quite sure if such slowdowns are really a threat though.
Actually, cache_reap will only do something (take locks, drain
arrays/lists) only if there are free objects on the cache. Otherwise it
will, in fact, only check cpu_cache->avail, alien->avail, shared->avail,
and node->free_list, which shouldn't take much time, should it?

Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches
@ 2014-07-01  7:46             ` Vladimir Davydov
  0 siblings, 0 replies; 50+ messages in thread
From: Vladimir Davydov @ 2014-07-01  7:46 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Joonsoo Kim, akpm, rientjes, penberg, hannes, mhocko,
	linux-kernel, linux-mm

On Mon, Jun 30, 2014 at 10:49:03AM -0500, Christoph Lameter wrote:
> On Fri, 27 Jun 2014, Joonsoo Kim wrote:
> 
> > Christoph,
> > Is it tolerable result for large scale system? Or do we need to find
> > another solution?
> 
> 
> The overhead is pretty intense but then this is a rare event I guess?

Yes, provided cgroups are created/destroyed rarely.

> It seems that it is much easier on the code and much faster to do the
> periodic reaping. Why not simply go with that?

A bad thing about the periodic reaping is that the time it may take
isn't predictable, because the number of dead caches is, in fact, only
limited by the amount of RAM.

We can have hundreds, if not thousands, copies of dcaches/icaches left
from cgroups destroyed some time ago. The dead caches will hang around
until memory pressure evicts all the objects they host, which may take
quite long on systems with a lot of memory.

With periodic reaping, we will have to iterate over all dead caches
trying to drain per cpu/node arrays each time, which might therefore
result in slowing down the whole system unexpectedly.

I'm not quite sure if such slowdowns are really a threat though.
Actually, cache_reap will only do something (take locks, drain
arrays/lists) only if there are free objects on the cache. Otherwise it
will, in fact, only check cpu_cache->avail, alien->avail, shared->avail,
and node->free_list, which shouldn't take much time, should it?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2014-07-01  7:46 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-12 20:38 [PATCH -mm v3 0/8] memcg/slab: reintroduce dead cache self-destruction Vladimir Davydov
2014-06-12 20:38 ` Vladimir Davydov
2014-06-12 20:38 ` [PATCH -mm v3 1/8] memcg: cleanup memcg_cache_params refcnt usage Vladimir Davydov
2014-06-12 20:38   ` Vladimir Davydov
2014-06-12 20:38 ` [PATCH -mm v3 2/8] memcg: destroy kmem caches when last slab is freed Vladimir Davydov
2014-06-12 20:38   ` Vladimir Davydov
2014-06-12 20:38 ` [PATCH -mm v3 3/8] memcg: mark caches that belong to offline memcgs as dead Vladimir Davydov
2014-06-12 20:38   ` Vladimir Davydov
2014-06-12 20:38 ` [PATCH -mm v3 4/8] slub: don't fail kmem_cache_shrink if slab placement optimization fails Vladimir Davydov
2014-06-12 20:38   ` Vladimir Davydov
2014-06-12 20:38 ` [PATCH -mm v3 5/8] slub: make slab_free non-preemptable Vladimir Davydov
2014-06-12 20:38   ` Vladimir Davydov
2014-06-12 20:38 ` [PATCH -mm v3 6/8] memcg: wait for kfree's to finish before destroying cache Vladimir Davydov
2014-06-12 20:38   ` Vladimir Davydov
2014-06-12 20:38 ` [PATCH -mm v3 7/8] slub: make dead memcg caches discard free slabs immediately Vladimir Davydov
2014-06-12 20:38   ` Vladimir Davydov
2014-06-13 16:54   ` Christoph Lameter
2014-06-13 16:54     ` Christoph Lameter
2014-06-24  7:50   ` Joonsoo Kim
2014-06-24  7:50     ` Joonsoo Kim
2014-06-24  8:25     ` Vladimir Davydov
2014-06-24  8:25       ` Vladimir Davydov
2014-06-24  9:42     ` [PATCH -mm] slub: kmem_cache_shrink: check if partial list is empty under list_lock Vladimir Davydov
2014-06-24  9:42       ` Vladimir Davydov
2014-06-12 20:38 ` [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches Vladimir Davydov
2014-06-12 20:38   ` Vladimir Davydov
2014-06-12 20:41   ` Vladimir Davydov
2014-06-12 20:41     ` Vladimir Davydov
2014-06-24  7:25   ` Joonsoo Kim
2014-06-24  7:25     ` Joonsoo Kim
2014-06-24  7:42     ` Vladimir Davydov
2014-06-24  7:42       ` Vladimir Davydov
2014-06-24 12:28     ` [PATCH -mm] slab: set free_limit for dead caches to 0 Vladimir Davydov
2014-06-24 12:28       ` Vladimir Davydov
2014-06-24  7:38   ` [PATCH -mm v3 8/8] slab: do not keep free objects/slabs on dead memcg caches Joonsoo Kim
2014-06-24  7:38     ` Joonsoo Kim
2014-06-24  7:48     ` Vladimir Davydov
2014-06-24  7:48       ` Vladimir Davydov
2014-06-25 13:45     ` Vladimir Davydov
2014-06-25 13:45       ` Vladimir Davydov
2014-06-27  6:05       ` Joonsoo Kim
2014-06-27  6:05         ` Joonsoo Kim
2014-06-30 15:49         ` Christoph Lameter
2014-06-30 15:49           ` Christoph Lameter
2014-07-01  7:46           ` Vladimir Davydov
2014-07-01  7:46             ` Vladimir Davydov
2014-06-25 14:39     ` [PATCH] slab: document why cache can have no per cpu array on kfree Vladimir Davydov
2014-06-25 14:39       ` Vladimir Davydov
2014-06-25 16:19       ` Christoph Lameter
2014-06-25 16:19         ` Christoph Lameter

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.