linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET] slab: make memcg slab destruction scalable
@ 2017-01-14  5:54 Tejun Heo
  2017-01-14  5:54 ` [PATCH 1/9] Revert "slub: move synchronize_sched out of slab_mutex on shrink" Tejun Heo
                   ` (8 more replies)
  0 siblings, 9 replies; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.

I've seen machines which end up with hundred thousands of caches and
many millions of kernfs_nodes.  The current code is O(N^2) on the
total number of caches and has synchronous rcu_barrier() and
synchronize_sched() in cgroup offline / release path which is executed
while holding cgroup_mutex.  Combined, this leads to very expensive
and slow cache destruction operations which can easily keep running
for half a day.

This also messes up /proc/slabinfo along with other cache iterating
operations.  seq_file operates on 4k chunks and on each 4k boundary
tries to seek to the last position in the list.  With a huge number of
caches on the list, this becomes very slow and very prone to the list
content changing underneath it leading to a lot of missing and/or
duplicate entries.

This patchset addresses the scalability problem.

* Separate out root and memcg cache lists and add per-memcg list.
  Update each user to use the appropriate list.

* Replace rcu_barrier() and synchronize_rcu() with call_rcu() and
  call_rcu_sched().

* For dying empty slub caches, remove the sysfs files after
  deactivation so that we don't end up with millions of sysfs files
  without any useful information on them.

This patchset contains the following nine patches.

 0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
 0002-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
 0003-slab-simplify-shutdown_memcg_caches.patch
 0004-slab-reorganize-memcg_cache_params.patch
 0005-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
 0006-slab-don-t-put-memcg-caches-on-slab_caches-list.patch
 0007-slab-introduce-__kmemcg_cache_deactivate.patch
 0008-slab-remove-synchronous-synchronize_sched-from-memcg.patch
 0009-slab-remove-slub-sysfs-interface-files-early-for-emp.patch

0001 reverts an existing optimization to prepare for the following
changes.  0002 replaces rcu_barrier() in release path with call_rcu().
0003-0006 separate out the lists.  0007-0008 replace
synchronize_sched() in slub destruction path with call_rcu_sched().
0009 removes sysfs files early for empty dying caches.

This patchset is on top of the current linus#master a121103c9228 and
also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-kmemcg-scalability

diffstat follows.  Thanks.

 include/linux/memcontrol.h |    1 
 include/linux/slab.h       |   39 ++++-
 include/linux/slab_def.h   |    5 
 include/linux/slub_def.h   |    9 -
 mm/memcontrol.c            |    7 -
 mm/slab.c                  |    7 +
 mm/slab.h                  |   21 ++-
 mm/slab_common.c           |  306 ++++++++++++++++++++++++---------------------
 mm/slub.c                  |   54 +++++++
 9 files changed, 283 insertions(+), 166 deletions(-)

--
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/9] Revert "slub: move synchronize_sched out of slab_mutex on shrink"
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14  5:54 ` [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path Tejun Heo
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

This reverts commit 89e364db71fb5e7fc8d93228152abfa67daf35fa.

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

Moving synchronize_sched() out of slab_mutex isn't enough as it's
still inside cgroup_mutex.  The whole deactivation / release path will
be updated to avoid all synchronous RCU operations.  Revert this
insufficient optimization in preparation to ease future changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slab.c        |  4 ++--
 mm/slab.h        |  2 +-
 mm/slab_common.c | 27 ++-------------------------
 mm/slob.c        |  2 +-
 mm/slub.c        | 19 +++++++++++++++++--
 5 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 29bc6c0..767e8e4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2314,7 +2314,7 @@ static int drain_freelist(struct kmem_cache *cache,
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 {
 	int ret = 0;
 	int node;
@@ -2334,7 +2334,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-	return __kmem_cache_shrink(cachep);
+	return __kmem_cache_shrink(cachep, false);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index de6579d..4acc644 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -161,7 +161,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+int __kmem_cache_shrink(struct kmem_cache *, bool);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index ae32384..46ff746 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -579,29 +579,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	get_online_cpus();
 	get_online_mems();
 
-#ifdef CONFIG_SLUB
-	/*
-	 * In case of SLUB, we need to disable empty slab caching to
-	 * avoid pinning the offline memory cgroup by freeable kmem
-	 * pages charged to it. SLAB doesn't need this, as it
-	 * periodically purges unused slabs.
-	 */
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list) {
-		c = is_root_cache(s) ? cache_from_memcg_idx(s, idx) : NULL;
-		if (c) {
-			c->cpu_partial = 0;
-			c->min_partial = 0;
-		}
-	}
-	mutex_unlock(&slab_mutex);
-	/*
-	 * kmem_cache->cpu_partial is checked locklessly (see
-	 * put_cpu_partial()). Make sure the change is visible.
-	 */
-	synchronize_sched();
-#endif
-
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		if (!is_root_cache(s))
@@ -613,7 +590,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		if (!c)
 			continue;
 
-		__kmem_cache_shrink(c);
+		__kmem_cache_shrink(c, true);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
@@ -784,7 +761,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 	get_online_cpus();
 	get_online_mems();
 	kasan_cache_shrink(cachep);
-	ret = __kmem_cache_shrink(cachep);
+	ret = __kmem_cache_shrink(cachep, false);
 	put_online_mems();
 	put_online_cpus();
 	return ret;
diff --git a/mm/slob.c b/mm/slob.c
index eac04d4..5ec1580 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
 {
 	return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 067598a..68b84f9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3883,7 +3883,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 {
 	int node;
 	int i;
@@ -3895,6 +3895,21 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	unsigned long flags;
 	int ret = 0;
 
+	if (deactivate) {
+		/*
+		 * Disable empty slabs caching. Used to avoid pinning offline
+		 * memory cgroups by kmem pages that can be freed.
+		 */
+		s->cpu_partial = 0;
+		s->min_partial = 0;
+
+		/*
+		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
+		 * so we have to make sure the change is visible.
+		 */
+		synchronize_sched();
+	}
+
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
@@ -3951,7 +3966,7 @@ static int slab_mem_going_offline_callback(void *arg)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s);
+		__kmem_cache_shrink(s, false);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
  2017-01-14  5:54 ` [PATCH 1/9] Revert "slub: move synchronize_sched out of slab_mutex on shrink" Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14 13:19   ` Vladimir Davydov
  2017-01-14  5:54 ` [PATCH 3/9] slab: simplify shutdown_memcg_caches() Tejun Heo
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

SLAB_DESTORY_BY_RCU caches need a rcu grace period before destruction.
Currently, it's done synchronously with rcu_barrier().  As
rcu_barrier() is expensive time-wise, slab implements a batching
mechanism so that rcu_barrier() can be done for multiple caches at the
same time.

Unfortunately, the rcu_barrier() is in synchronous path which is
called while holding cgroup_mutex and the batching is too limited to
be actually helpful.  Besides, the batching is just a very degenerate
form of the actual RCU callback mechanism.

This patch updates the cache release path so that it simply uses
call_rcu() instead of the synchronous rcu_barrier() + custom batching.
This doesn't cost more while being logically simpler and way more
scalable.

* ->rcu_head is added to kmem_cache structs.  It shares storage space
  with ->list.

* slub sysfs removal and release are separated and the former is now
  called from __kmem_cache_shutdown() while the latter is called from
  the release path.  There's no reason to defer sysfs removal through
  RCU and this makes it unnecessary to bounce to workqueue from the
  RCU callback.

* release_caches() is removed and shutdown_cache() now either directly
  release the cache or schedules a RCU callback to do that.  This
  makes the cache inaccessible once shutdown_cache() is called and
  makes it impossible for shutdown_memcg_caches() to do memcg-specific
  cleanups afterwards.  Move memcg-specific part into a helper,
  unlink_memcg_cache(), and make shutdown_cache() call it directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/slab_def.h |  5 ++-
 include/linux/slub_def.h |  9 ++++--
 mm/slab.h                |  5 ++-
 mm/slab_common.c         | 84 ++++++++++++++++++++----------------------------
 mm/slub.c                |  9 +++++-
 5 files changed, 57 insertions(+), 55 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 4ad2c5a..b649629 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -39,7 +39,10 @@ struct kmem_cache {
 
 /* 4) cache creation/removal */
 	const char *name;
-	struct list_head list;
+	union {
+		struct list_head list;
+		struct rcu_head rcu_head;
+	};
 	int refcount;
 	int object_size;
 	int align;
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 75f56c2..7637b41 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -80,7 +80,10 @@ struct kmem_cache {
 	int align;		/* Alignment */
 	int reserved;		/* Reserved bytes at the end of slabs */
 	const char *name;	/* Name (only for display!) */
-	struct list_head list;	/* List of slab caches */
+	union {
+		struct list_head list;	/* List of slab caches */
+		struct rcu_head rcu_head;
+	};
 	int red_left_pad;	/* Left redzone padding size */
 #ifdef CONFIG_SYSFS
 	struct kobject kobj;	/* For sysfs */
@@ -113,9 +116,9 @@ struct kmem_cache {
 
 #ifdef CONFIG_SYSFS
 #define SLAB_SUPPORTS_SYSFS
-void sysfs_slab_remove(struct kmem_cache *);
+void sysfs_slab_release(struct kmem_cache *);
 #else
-static inline void sysfs_slab_remove(struct kmem_cache *s)
+static inline void sysfs_slab_release(struct kmem_cache *s)
 {
 }
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 4acc644..3fa2d77 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -24,7 +24,10 @@ struct kmem_cache {
 	const char *name;	/* Slab name for sysfs */
 	int refcount;		/* Use counter */
 	void (*ctor)(void *);	/* Called on object slot creation */
-	struct list_head list;	/* List of all slab caches on the system */
+	union {
+		struct list_head list;	/* List of all slab caches on the system */
+		struct rcu_head rcu_head;
+	};
 };
 
 #endif /* CONFIG_SLOB */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 46ff746..851c75e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -215,6 +215,11 @@ int memcg_update_all_caches(int num_memcgs)
 	mutex_unlock(&slab_mutex);
 	return ret;
 }
+
+static void unlink_memcg_cache(struct kmem_cache *s)
+{
+	list_del(&s->memcg_params.list);
+}
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
 		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
@@ -225,6 +230,10 @@ static inline int init_memcg_params(struct kmem_cache *s,
 static inline void destroy_memcg_params(struct kmem_cache *s)
 {
 }
+
+static inline void unlink_memcg_cache(struct kmem_cache *s)
+{
+}
 #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
 
 /*
@@ -458,33 +467,32 @@ kmem_cache_create(const char *name, size_t size, size_t align,
 }
 EXPORT_SYMBOL(kmem_cache_create);
 
-static int shutdown_cache(struct kmem_cache *s,
-		struct list_head *release, bool *need_rcu_barrier)
+static void slab_kmem_cache_release_rcufn(struct rcu_head *head)
 {
-	if (__kmem_cache_shutdown(s) != 0)
-		return -EBUSY;
+	struct kmem_cache *s = container_of(head, struct kmem_cache, rcu_head);
 
-	if (s->flags & SLAB_DESTROY_BY_RCU)
-		*need_rcu_barrier = true;
-
-	list_move(&s->list, release);
-	return 0;
+#ifdef SLAB_SUPPORTS_SYSFS
+	sysfs_slab_release(s);
+#else
+	slab_kmem_cache_release(s);
+#endif
 }
 
-static void release_caches(struct list_head *release, bool need_rcu_barrier)
+static int shutdown_cache(struct kmem_cache *s)
 {
-	struct kmem_cache *s, *s2;
+	if (__kmem_cache_shutdown(s) != 0)
+		return -EBUSY;
 
-	if (need_rcu_barrier)
-		rcu_barrier();
+	list_del(&s->list);
+	if (!is_root_cache(s))
+		unlink_memcg_cache(s);
 
-	list_for_each_entry_safe(s, s2, release, list) {
-#ifdef SLAB_SUPPORTS_SYSFS
-		sysfs_slab_remove(s);
-#else
-		slab_kmem_cache_release(s);
-#endif
-	}
+	if (s->flags & SLAB_DESTROY_BY_RCU)
+		call_rcu(&s->rcu_head, slab_kmem_cache_release_rcufn);
+	else
+		slab_kmem_cache_release_rcufn(&s->rcu_head);
+
+	return 0;
 }
 
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
@@ -599,22 +607,8 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	put_online_cpus();
 }
 
-static int __shutdown_memcg_cache(struct kmem_cache *s,
-		struct list_head *release, bool *need_rcu_barrier)
-{
-	BUG_ON(is_root_cache(s));
-
-	if (shutdown_cache(s, release, need_rcu_barrier))
-		return -EBUSY;
-
-	list_del(&s->memcg_params.list);
-	return 0;
-}
-
 void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
 {
-	LIST_HEAD(release);
-	bool need_rcu_barrier = false;
 	struct kmem_cache *s, *s2;
 
 	get_online_cpus();
@@ -628,18 +622,15 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
 		 * The cgroup is about to be freed and therefore has no charges
 		 * left. Hence, all its caches must be empty by now.
 		 */
-		BUG_ON(__shutdown_memcg_cache(s, &release, &need_rcu_barrier));
+		BUG_ON(shutdown_cache(s));
 	}
 	mutex_unlock(&slab_mutex);
 
 	put_online_mems();
 	put_online_cpus();
-
-	release_caches(&release, need_rcu_barrier);
 }
 
-static int shutdown_memcg_caches(struct kmem_cache *s,
-		struct list_head *release, bool *need_rcu_barrier)
+static int shutdown_memcg_caches(struct kmem_cache *s)
 {
 	struct memcg_cache_array *arr;
 	struct kmem_cache *c, *c2;
@@ -658,7 +649,7 @@ static int shutdown_memcg_caches(struct kmem_cache *s,
 		c = arr->entries[i];
 		if (!c)
 			continue;
-		if (__shutdown_memcg_cache(c, release, need_rcu_barrier))
+		if (shutdown_cache(c))
 			/*
 			 * The cache still has objects. Move it to a temporary
 			 * list so as not to try to destroy it for a second
@@ -681,7 +672,7 @@ static int shutdown_memcg_caches(struct kmem_cache *s,
 	 */
 	list_for_each_entry_safe(c, c2, &s->memcg_params.list,
 				 memcg_params.list)
-		__shutdown_memcg_cache(c, release, need_rcu_barrier);
+		shutdown_cache(c);
 
 	list_splice(&busy, &s->memcg_params.list);
 
@@ -694,8 +685,7 @@ static int shutdown_memcg_caches(struct kmem_cache *s,
 	return 0;
 }
 #else
-static inline int shutdown_memcg_caches(struct kmem_cache *s,
-		struct list_head *release, bool *need_rcu_barrier)
+static inline int shutdown_memcg_caches(struct kmem_cache *s)
 {
 	return 0;
 }
@@ -711,8 +701,6 @@ void slab_kmem_cache_release(struct kmem_cache *s)
 
 void kmem_cache_destroy(struct kmem_cache *s)
 {
-	LIST_HEAD(release);
-	bool need_rcu_barrier = false;
 	int err;
 
 	if (unlikely(!s))
@@ -728,9 +716,9 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (s->refcount)
 		goto out_unlock;
 
-	err = shutdown_memcg_caches(s, &release, &need_rcu_barrier);
+	err = shutdown_memcg_caches(s);
 	if (!err)
-		err = shutdown_cache(s, &release, &need_rcu_barrier);
+		err = shutdown_cache(s);
 
 	if (err) {
 		pr_err("kmem_cache_destroy %s: Slab cache still has objects\n",
@@ -742,8 +730,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
 
 	put_online_mems();
 	put_online_cpus();
-
-	release_caches(&release, need_rcu_barrier);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
diff --git a/mm/slub.c b/mm/slub.c
index 68b84f9..a26cb90 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -214,11 +214,13 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
 static int sysfs_slab_add(struct kmem_cache *);
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
 static void memcg_propagate_slab_attrs(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *);
 #else
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
 							{ return 0; }
 static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
+static inline void sysfs_slab_remove(struct kmem_cache *) { }
 #endif
 
 static inline void stat(const struct kmem_cache *s, enum stat_item si)
@@ -3679,6 +3681,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 		if (n->nr_partial || slabs_node(s, node))
 			return 1;
 	}
+	sysfs_slab_remove(s);
 	return 0;
 }
 
@@ -5629,7 +5632,7 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	goto out;
 }
 
-void sysfs_slab_remove(struct kmem_cache *s)
+static void sysfs_slab_remove(struct kmem_cache *s)
 {
 	if (slab_state < FULL)
 		/*
@@ -5643,6 +5646,10 @@ void sysfs_slab_remove(struct kmem_cache *s)
 #endif
 	kobject_uevent(&s->kobj, KOBJ_REMOVE);
 	kobject_del(&s->kobj);
+}
+
+void sysfs_slab_release(struct kmem_cache *s)
+{
 	kobject_put(&s->kobj);
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 3/9] slab: simplify shutdown_memcg_caches()
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
  2017-01-14  5:54 ` [PATCH 1/9] Revert "slub: move synchronize_sched out of slab_mutex on shrink" Tejun Heo
  2017-01-14  5:54 ` [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14 13:27   ` Vladimir Davydov
  2017-01-14  5:54 ` [PATCH 4/9] slab: reorganize memcg_cache_params Tejun Heo
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

shutdown_memcg_caches() shuts down all memcg caches associated with a
root cache.  It first walks the index table clearing and shutting down
each entry and then shuts down the ones on
root_cache->memcg_params.list.  As active caches are on both the table
and the list, they're stashed away from the list to avoid shutting
down twice and then get spliced back later.

This is unnecessarily complication.  All memcg caches are on
root_cache->memcg_params.list.  The function can simply clear the
index table and shut down all caches on the list.  There's no need to
muck with temporary stashing.

Simplify the code.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slab_common.c | 32 +++++---------------------------
 1 file changed, 5 insertions(+), 27 deletions(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 851c75e..45aa67c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -634,48 +634,26 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 {
 	struct memcg_cache_array *arr;
 	struct kmem_cache *c, *c2;
-	LIST_HEAD(busy);
 	int i;
 
 	BUG_ON(!is_root_cache(s));
 
 	/*
-	 * First, shutdown active caches, i.e. caches that belong to online
-	 * memory cgroups.
+	 * First, clear the pointers to all memcg caches so that they will
+	 * never be accessed even if the root cache stays alive.
 	 */
 	arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
 					lockdep_is_held(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = arr->entries[i];
-		if (!c)
-			continue;
-		if (shutdown_cache(c))
-			/*
-			 * The cache still has objects. Move it to a temporary
-			 * list so as not to try to destroy it for a second
-			 * time while iterating over inactive caches below.
-			 */
-			list_move(&c->memcg_params.list, &busy);
-		else
-			/*
-			 * The cache is empty and will be destroyed soon. Clear
-			 * the pointer to it in the memcg_caches array so that
-			 * it will never be accessed even if the root cache
-			 * stays alive.
-			 */
-			arr->entries[i] = NULL;
-	}
+	for_each_memcg_cache_index(i)
+		arr->entries[i] = NULL;
 
 	/*
-	 * Second, shutdown all caches left from memory cgroups that are now
-	 * offline.
+	 * Shutdown all caches.
 	 */
 	list_for_each_entry_safe(c, c2, &s->memcg_params.list,
 				 memcg_params.list)
 		shutdown_cache(c);
 
-	list_splice(&busy, &s->memcg_params.list);
-
 	/*
 	 * A cache being destroyed must be empty. In particular, this means
 	 * that all per memcg caches attached to it must be empty too.
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 4/9] slab: reorganize memcg_cache_params
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
                   ` (2 preceding siblings ...)
  2017-01-14  5:54 ` [PATCH 3/9] slab: simplify shutdown_memcg_caches() Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14 13:30   ` Vladimir Davydov
  2017-01-14  5:54 ` [PATCH 5/9] slab: link memcg kmem_caches on their associated memory cgroup Tejun Heo
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

We're gonna change how memcg caches are iterated.  In preparation,
clean up and reorganize memcg_cache_params.

* The shared ->list is replaced by ->children in root and
  ->children_node in children.

* ->is_root_cache is removed.  Instead ->root_cache is moved out of
  the child union and now used by both root and children.  NULL
  indicates root cache.  Non-NULL a memcg one.

This patch doesn't cause any observable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/slab.h | 33 ++++++++++++++++++++++++---------
 mm/slab.h            |  6 +++---
 mm/slab_common.c     | 21 +++++++++++----------
 3 files changed, 38 insertions(+), 22 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12b..b310173 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -545,22 +545,37 @@ struct memcg_cache_array {
  * array to be accessed without taking any locks, on relocation we free the old
  * version only after a grace period.
  *
- * Child caches will hold extra metadata needed for its operation. Fields are:
+ * Root and child caches hold different metadata.
  *
- * @memcg: pointer to the memcg this cache belongs to
- * @root_cache: pointer to the global, root cache, this cache was derived from
+ * @root_cache:	Common to root and child caches.  NULL for root, pointer to
+ *		the root cache for children.
  *
- * Both root and child caches of the same kind are linked into a list chained
- * through @list.
+ * The following fields are specific to root caches.
+ *
+ * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
+ *		used to index child cachces during allocation and cleared
+ *		early during shutdown.
+ *
+ * @children:	List of all child caches.  While the child caches are also
+ *		reachable through @memcg_caches, a child cache remains on
+ *		this list until it is actually destroyed.
+ *
+ * The following fields are specifci to child caches.
+ *
+ * @memcg:	Pointer to the memcg this cache belongs to.
+ *
+ * @children_node: List node for @root_cache->children list.
  */
 struct memcg_cache_params {
-	bool is_root_cache;
-	struct list_head list;
+	struct kmem_cache *root_cache;
 	union {
-		struct memcg_cache_array __rcu *memcg_caches;
+		struct {
+			struct memcg_cache_array __rcu *memcg_caches;
+			struct list_head children;
+		};
 		struct {
 			struct mem_cgroup *memcg;
-			struct kmem_cache *root_cache;
+			struct list_head children_node;
 		};
 	};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 3fa2d77..9b3ec3f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -203,12 +203,12 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
  * slab_mutex.
  */
 #define for_each_memcg_cache(iter, root) \
-	list_for_each_entry(iter, &(root)->memcg_params.list, \
-			    memcg_params.list)
+	list_for_each_entry(iter, &(root)->memcg_params.children, \
+			    memcg_params.children_node)
 
 static inline bool is_root_cache(struct kmem_cache *s)
 {
-	return s->memcg_params.is_root_cache;
+	return !s->memcg_params.root_cache;
 }
 
 static inline bool slab_equal_or_root(struct kmem_cache *s,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 45aa67c..fdf5b6d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -135,9 +135,9 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
 void slab_init_memcg_params(struct kmem_cache *s)
 {
-	s->memcg_params.is_root_cache = true;
-	INIT_LIST_HEAD(&s->memcg_params.list);
+	s->memcg_params.root_cache = NULL;
 	RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
+	INIT_LIST_HEAD(&s->memcg_params.children);
 }
 
 static int init_memcg_params(struct kmem_cache *s,
@@ -145,10 +145,10 @@ static int init_memcg_params(struct kmem_cache *s,
 {
 	struct memcg_cache_array *arr;
 
-	if (memcg) {
-		s->memcg_params.is_root_cache = false;
-		s->memcg_params.memcg = memcg;
+	if (root_cache) {
 		s->memcg_params.root_cache = root_cache;
+		s->memcg_params.memcg = memcg;
+		INIT_LIST_HEAD(&s->memcg_params.children_node);
 		return 0;
 	}
 
@@ -218,7 +218,7 @@ int memcg_update_all_caches(int num_memcgs)
 
 static void unlink_memcg_cache(struct kmem_cache *s)
 {
-	list_del(&s->memcg_params.list);
+	list_del(&s->memcg_params.children_node);
 }
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
@@ -559,7 +559,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 		goto out_unlock;
 	}
 
-	list_add(&s->memcg_params.list, &root_cache->memcg_params.list);
+	list_add(&s->memcg_params.children_node,
+		 &root_cache->memcg_params.children);
 
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
@@ -650,15 +651,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 	/*
 	 * Shutdown all caches.
 	 */
-	list_for_each_entry_safe(c, c2, &s->memcg_params.list,
-				 memcg_params.list)
+	list_for_each_entry_safe(c, c2, &s->memcg_params.children,
+				 memcg_params.children_node)
 		shutdown_cache(c);
 
 	/*
 	 * A cache being destroyed must be empty. In particular, this means
 	 * that all per memcg caches attached to it must be empty too.
 	 */
-	if (!list_empty(&s->memcg_params.list))
+	if (!list_empty(&s->memcg_params.children))
 		return -EBUSY;
 	return 0;
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 5/9] slab: link memcg kmem_caches on their associated memory cgroup
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
                   ` (3 preceding siblings ...)
  2017-01-14  5:54 ` [PATCH 4/9] slab: reorganize memcg_cache_params Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14 13:33   ` Vladimir Davydov
  2017-01-14  5:54 ` [PATCH 6/9] slab: don't put memcg caches on slab_caches list Tejun Heo
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

While a memcg kmem_cache is listed on its root cache's ->children
list, there is no direct way to iterate all kmem_caches which are
assocaited with a memory cgroup.  The only way to iterate them is
walking all caches while filtering out caches which don't match, which
would be most of them.

This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge.  This combined with the
synchronous RCU operations can tie up a CPU and affect the whole
machine for many hours when memory reclaim triggers offlining and
destruction of the stale memcgs.

This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg.  All memcg specific iterations, including
stat file access, are updated to use the new list instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h |  1 +
 include/linux/slab.h       |  3 +++
 mm/memcontrol.c            |  7 ++++---
 mm/slab.h                  |  6 +++---
 mm/slab_common.c           | 42 ++++++++++++++++++++++++++++++++----------
 5 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 61d20c1..4de925c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -253,6 +253,7 @@ struct mem_cgroup {
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
+	struct list_head kmem_caches;
 #endif
 
 	int last_scanned_node;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index b310173..54ec959 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -565,6 +565,8 @@ struct memcg_cache_array {
  * @memcg:	Pointer to the memcg this cache belongs to.
  *
  * @children_node: List node for @root_cache->children list.
+ *
+ * @kmem_caches_node: List node for @memcg->kmem_caches list.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
@@ -576,6 +578,7 @@ struct memcg_cache_params {
 		struct {
 			struct mem_cgroup *memcg;
 			struct list_head children_node;
+			struct list_head kmem_caches_node;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4048897..a2b20f7f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2839,6 +2839,7 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmemcg_id = memcg_id;
 	memcg->kmem_state = KMEM_ONLINE;
+	INIT_LIST_HEAD(&memcg->kmem_caches);
 
 	return 0;
 }
@@ -4004,9 +4005,9 @@ static struct cftype mem_cgroup_legacy_files[] = {
 #ifdef CONFIG_SLABINFO
 	{
 		.name = "kmem.slabinfo",
-		.seq_start = slab_start,
-		.seq_next = slab_next,
-		.seq_stop = slab_stop,
+		.seq_start = memcg_slab_start,
+		.seq_next = memcg_slab_next,
+		.seq_stop = memcg_slab_stop,
 		.seq_show = memcg_slab_show,
 	},
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 9b3ec3f..b5e0040 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -488,9 +488,9 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 
 #endif
 
-void *slab_start(struct seq_file *m, loff_t *pos);
-void *slab_next(struct seq_file *m, void *p, loff_t *pos);
-void slab_stop(struct seq_file *m, void *p);
+void *memcg_slab_start(struct seq_file *m, loff_t *pos);
+void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
+void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index fdf5b6d..74c36d8 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -149,6 +149,7 @@ static int init_memcg_params(struct kmem_cache *s,
 		s->memcg_params.root_cache = root_cache;
 		s->memcg_params.memcg = memcg;
 		INIT_LIST_HEAD(&s->memcg_params.children_node);
+		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
 		return 0;
 	}
 
@@ -219,6 +220,7 @@ int memcg_update_all_caches(int num_memcgs)
 static void unlink_memcg_cache(struct kmem_cache *s)
 {
 	list_del(&s->memcg_params.children_node);
+	list_del(&s->memcg_params.kmem_caches_node);
 }
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
@@ -561,6 +563,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 
 	list_add(&s->memcg_params.children_node,
 		 &root_cache->memcg_params.children);
+	list_add(&s->memcg_params.kmem_caches_node, &memcg->kmem_caches);
 
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
@@ -616,9 +619,8 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry_safe(s, s2, &slab_caches, list) {
-		if (is_root_cache(s) || s->memcg_params.memcg != memcg)
-			continue;
+	list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
+				 memcg_params.kmem_caches_node) {
 		/*
 		 * The cgroup is about to be freed and therefore has no charges
 		 * left. Hence, all its caches must be empty by now.
@@ -1077,18 +1079,18 @@ static void print_slabinfo_header(struct seq_file *m)
 	seq_putc(m, '\n');
 }
 
-void *slab_start(struct seq_file *m, loff_t *pos)
+static void *slab_start(struct seq_file *m, loff_t *pos)
 {
 	mutex_lock(&slab_mutex);
 	return seq_list_start(&slab_caches, *pos);
 }
 
-void *slab_next(struct seq_file *m, void *p, loff_t *pos)
+static void *slab_next(struct seq_file *m, void *p, loff_t *pos)
 {
 	return seq_list_next(p, &slab_caches, pos);
 }
 
-void slab_stop(struct seq_file *m, void *p)
+static void slab_stop(struct seq_file *m, void *p)
 {
 	mutex_unlock(&slab_mutex);
 }
@@ -1147,15 +1149,35 @@ static int slab_show(struct seq_file *m, void *p)
 }
 
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+void *memcg_slab_start(struct seq_file *m, loff_t *pos)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	mutex_lock(&slab_mutex);
+	return seq_list_start(&memcg->kmem_caches, *pos);
+}
+
+void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	return seq_list_next(p, &memcg->kmem_caches, pos);
+}
+
+void memcg_slab_stop(struct seq_file *m, void *p)
+{
+	mutex_unlock(&slab_mutex);
+}
+
 int memcg_slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
+	struct kmem_cache *s = list_entry(p, struct kmem_cache,
+					  memcg_params.kmem_caches_node);
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
 
-	if (p == slab_caches.next)
+	if (p == memcg->kmem_caches.next)
 		print_slabinfo_header(m);
-	if (!is_root_cache(s) && s->memcg_params.memcg == memcg)
-		cache_show(s, m);
+	cache_show(s, m);
 	return 0;
 }
 #endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 6/9] slab: don't put memcg caches on slab_caches list
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
                   ` (4 preceding siblings ...)
  2017-01-14  5:54 ` [PATCH 5/9] slab: link memcg kmem_caches on their associated memory cgroup Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14 13:39   ` Vladimir Davydov
  2017-01-14  5:54 ` [PATCH 7/9] slab: introduce __kmemcg_cache_deactivate() Tejun Heo
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slab_caches currently lists all caches including root and memcg ones.
This is the only data structure which lists the root caches and
iterating root caches can only be done by walking the list while
skipping over memcg caches.  As there can be a huge number of memcg
caches, this can become very expensive.

This also can make /proc/slabinfo behave very badly.  seq_file
processes reads in 4k chunks and seeks to the previous Nth position on
slab_caches list to resume after each chunk.  With a lot of memcg
cache churns on the list, reading /proc/slabinfo can become very slow
and its content often ends up with duplicate and/or missing entries.

As the previous patch made it unnecessary to walk slab_caches to
iterate memcg-specific caches, there is no reason to keep memcg caches
on the list.  This patch makes slab_caches include only the root
caches.  As this makes slab_cache->list unused for memcg caches,
->memcg_params.children_node is removed and ->list is used instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/slab.h |  3 ---
 mm/slab.h            |  3 +--
 mm/slab_common.c     | 58 +++++++++++++++++++++++++---------------------------
 3 files changed, 29 insertions(+), 35 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 54ec959..63d543d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -564,8 +564,6 @@ struct memcg_cache_array {
  *
  * @memcg:	Pointer to the memcg this cache belongs to.
  *
- * @children_node: List node for @root_cache->children list.
- *
  * @kmem_caches_node: List node for @memcg->kmem_caches list.
  */
 struct memcg_cache_params {
@@ -577,7 +575,6 @@ struct memcg_cache_params {
 		};
 		struct {
 			struct mem_cgroup *memcg;
-			struct list_head children_node;
 			struct list_head kmem_caches_node;
 		};
 	};
diff --git a/mm/slab.h b/mm/slab.h
index b5e0040..8f47a44 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -203,8 +203,7 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
  * slab_mutex.
  */
 #define for_each_memcg_cache(iter, root) \
-	list_for_each_entry(iter, &(root)->memcg_params.children, \
-			    memcg_params.children_node)
+	list_for_each_entry(iter, &(root)->memcg_params.children, list)
 
 static inline bool is_root_cache(struct kmem_cache *s)
 {
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 74c36d8..c0d0126 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -68,6 +68,22 @@ unsigned int kmem_cache_size(struct kmem_cache *s)
 EXPORT_SYMBOL(kmem_cache_size);
 
 #ifdef CONFIG_DEBUG_VM
+static void kmem_cache_verify_name(struct kmem_cache *s)
+{
+	char tmp;
+	int res;
+
+	/*
+	 * This happens when the module gets unloaded and doesn't destroy
+	 * its slab cache and no-one else reuses the vmalloc area of the
+	 * module.  Print a warning.
+	 */
+	res = probe_kernel_address(s->name, tmp);
+	if (res)
+		pr_err("Slab cache with size %d has lost its name\n",
+		       s->object_size);
+}
+
 static int kmem_cache_sanity_check(const char *name, size_t size)
 {
 	struct kmem_cache *s = NULL;
@@ -79,20 +95,12 @@ static int kmem_cache_sanity_check(const char *name, size_t size)
 	}
 
 	list_for_each_entry(s, &slab_caches, list) {
-		char tmp;
-		int res;
+		struct kmem_cache *c;
 
-		/*
-		 * This happens when the module gets unloaded and doesn't
-		 * destroy its slab cache and no-one else reuses the vmalloc
-		 * area of the module.  Print a warning.
-		 */
-		res = probe_kernel_address(s->name, tmp);
-		if (res) {
-			pr_err("Slab cache with size %d has lost its name\n",
-			       s->object_size);
-			continue;
-		}
+		kmem_cache_verify_name(s);
+
+		for_each_memcg_cache(c, s)
+			kmem_cache_verify_name(c);
 	}
 
 	WARN_ON(strchr(name, ' '));	/* It confuses parsers */
@@ -148,7 +156,6 @@ static int init_memcg_params(struct kmem_cache *s,
 	if (root_cache) {
 		s->memcg_params.root_cache = root_cache;
 		s->memcg_params.memcg = memcg;
-		INIT_LIST_HEAD(&s->memcg_params.children_node);
 		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
 		return 0;
 	}
@@ -178,9 +185,6 @@ static int update_memcg_params(struct kmem_cache *s, int new_array_size)
 {
 	struct memcg_cache_array *old, *new;
 
-	if (!is_root_cache(s))
-		return 0;
-
 	new = kzalloc(sizeof(struct memcg_cache_array) +
 		      new_array_size * sizeof(void *), GFP_KERNEL);
 	if (!new)
@@ -219,7 +223,6 @@ int memcg_update_all_caches(int num_memcgs)
 
 static void unlink_memcg_cache(struct kmem_cache *s)
 {
-	list_del(&s->memcg_params.children_node);
 	list_del(&s->memcg_params.kmem_caches_node);
 }
 #else
@@ -243,10 +246,10 @@ static inline void unlink_memcg_cache(struct kmem_cache *s)
  */
 int slab_unmergeable(struct kmem_cache *s)
 {
-	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
+	if (!is_root_cache(s))
 		return 1;
 
-	if (!is_root_cache(s))
+	if (slab_nomerge || (s->flags & SLAB_NEVER_MERGE))
 		return 1;
 
 	if (s->ctor)
@@ -360,7 +363,8 @@ static struct kmem_cache *create_cache(const char *name,
 		goto out_free_cache;
 
 	s->refcount = 1;
-	list_add(&s->list, &slab_caches);
+	if (is_root_cache(s))
+		list_add(&s->list, &slab_caches);
 out:
 	if (err)
 		return ERR_PTR(err);
@@ -561,8 +565,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 		goto out_unlock;
 	}
 
-	list_add(&s->memcg_params.children_node,
-		 &root_cache->memcg_params.children);
+	list_add(&s->list, &root_cache->memcg_params.children);
 	list_add(&s->memcg_params.kmem_caches_node, &memcg->kmem_caches);
 
 	/*
@@ -593,9 +596,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
-		if (!is_root_cache(s))
-			continue;
-
 		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
 						lockdep_is_held(&slab_mutex));
 		c = arr->entries[idx];
@@ -653,8 +653,7 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 	/*
 	 * Shutdown all caches.
 	 */
-	list_for_each_entry_safe(c, c2, &s->memcg_params.children,
-				 memcg_params.children_node)
+	list_for_each_entry_safe(c, c2, &s->memcg_params.children, list)
 		shutdown_cache(c);
 
 	/*
@@ -1143,8 +1142,7 @@ static int slab_show(struct seq_file *m, void *p)
 
 	if (p == slab_caches.next)
 		print_slabinfo_header(m);
-	if (is_root_cache(s))
-		cache_show(s, m);
+	cache_show(s, m);
 	return 0;
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 7/9] slab: introduce __kmemcg_cache_deactivate()
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
                   ` (5 preceding siblings ...)
  2017-01-14  5:54 ` [PATCH 6/9] slab: don't put memcg caches on slab_caches list Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14 13:42   ` Vladimir Davydov
  2017-01-14  5:54 ` [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path Tejun Heo
  2017-01-14  5:54 ` [PATCH 9/9] slab: remove slub sysfs interface files early for empty memcg caches Tejun Heo
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

__kmem_cache_shrink() is called with %true @deactivate only for memcg
caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
__kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
should implement it and it should deactivate and drain the cache.

This is to allow memcg cache deactivation behavior to further deviate
from simple shrinking without messing up __kmem_cache_shrink().

This is pure reorganization and doesn't introduce any observable
behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slab.c        | 11 +++++++++--
 mm/slab.h        |  5 ++++-
 mm/slab_common.c |  4 ++--
 mm/slob.c        |  2 +-
 mm/slub.c        | 39 ++++++++++++++++++++++-----------------
 5 files changed, 38 insertions(+), 23 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 767e8e4..65814f2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2314,7 +2314,7 @@ static int drain_freelist(struct kmem_cache *cache,
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *cachep)
 {
 	int ret = 0;
 	int node;
@@ -2332,9 +2332,16 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 	return (ret ? 1 : 0);
 }
 
+#ifdef CONFIG_MEMCG
+void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
+{
+	__kmem_cache_shrink(cachep);
+}
+#endif
+
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-	return __kmem_cache_shrink(cachep, false);
+	return __kmem_cache_shrink(cachep);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 8f47a44..73ed6b5 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -164,7 +164,10 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *, bool);
+int __kmem_cache_shrink(struct kmem_cache *);
+#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+void __kmemcg_cache_deactivate(struct kmem_cache *s);
+#endif
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index c0d0126..87e5535 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -602,7 +602,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		if (!c)
 			continue;
 
-		__kmem_cache_shrink(c, true);
+		__kmemcg_cache_deactivate(c);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
@@ -727,7 +727,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 	get_online_cpus();
 	get_online_mems();
 	kasan_cache_shrink(cachep);
-	ret = __kmem_cache_shrink(cachep, false);
+	ret = __kmem_cache_shrink(cachep);
 	put_online_mems();
 	put_online_cpus();
 	return ret;
diff --git a/mm/slob.c b/mm/slob.c
index 5ec1580..eac04d4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *d)
 {
 	return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index a26cb90..ef89a07 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3886,7 +3886,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3898,21 +3898,6 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 	unsigned long flags;
 	int ret = 0;
 
-	if (deactivate) {
-		/*
-		 * Disable empty slabs caching. Used to avoid pinning offline
-		 * memory cgroups by kmem pages that can be freed.
-		 */
-		s->cpu_partial = 0;
-		s->min_partial = 0;
-
-		/*
-		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
-		 * so we have to make sure the change is visible.
-		 */
-		synchronize_sched();
-	}
-
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
@@ -3963,13 +3948,33 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 	return ret;
 }
 
+#ifdef CONFIG_MEMCG
+void __kmemcg_cache_deactivate(struct kmem_cache *s)
+{
+	/*
+	 * Disable empty slabs caching. Used to avoid pinning offline
+	 * memory cgroups by kmem pages that can be freed.
+	 */
+	s->cpu_partial = 0;
+	s->min_partial = 0;
+
+	/*
+	 * s->cpu_partial is checked locklessly (see put_cpu_partial), so
+	 * we have to make sure the change is visible.
+	 */
+	synchronize_sched();
+
+	__kmem_cache_shrink(s);
+}
+#endif
+
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s, false);
+		__kmem_cache_shrink(s);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
                   ` (6 preceding siblings ...)
  2017-01-14  5:54 ` [PATCH 7/9] slab: introduce __kmemcg_cache_deactivate() Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14 13:57   ` Vladimir Davydov
  2017-01-14  5:54 ` [PATCH 9/9] slab: remove slub sysfs interface files early for empty memcg caches Tejun Heo
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't
scale when a huge number of caches are destroyed back-to-back.  While
there used to be a simple batching mechanism, the batching was too
restricted to be helpful.

This patch implements slab_deactivate_memcg_cache_rcu_sched() which
slub can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex.  While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/slab.h |  6 ++++++
 mm/slab.h            |  2 ++
 mm/slab_common.c     | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/slub.c            | 12 +++++++----
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 63d543d..84701bb 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -576,6 +576,12 @@ struct memcg_cache_params {
 		struct {
 			struct mem_cgroup *memcg;
 			struct list_head kmem_caches_node;
+
+			void (*deact_fn)(struct kmem_cache *);
+			union {
+				struct rcu_head deact_rcu_head;
+				struct work_struct deact_work;
+			};
 		};
 	};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 73ed6b5..b67ac8f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -299,6 +299,8 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
+extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+				void (*deact_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG && !CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 87e5535..4730ef6 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -583,6 +583,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
+static void kmemcg_deactivate_workfn(struct work_struct *work)
+{
+	struct kmem_cache *s = container_of(work, struct kmem_cache,
+					    memcg_params.deact_work);
+
+	get_online_cpus();
+	get_online_mems();
+
+	mutex_lock(&slab_mutex);
+
+	s->memcg_params.deact_fn(s);
+
+	mutex_unlock(&slab_mutex);
+
+	put_online_mems();
+	put_online_cpus();
+
+	/* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
+	css_put(&s->memcg_params.memcg->css);
+}
+
+static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+{
+	struct kmem_cache *s = container_of(head, struct kmem_cache,
+					    memcg_params.deact_rcu_head);
+
+	/*
+	 * We need to grab blocking locks.  Bounce to ->deact_work.  The
+	 * work item shares the space with the RCU head and can't be
+	 * initialized eariler.
+	 */
+	INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
+	schedule_work(&s->memcg_params.deact_work);
+}
+
+/**
+ * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
+ *					   sched RCU grace period
+ * @s: target kmem_cache
+ * @deact_fn: deactivation function to call
+ *
+ * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
+ * held after a sched RCU grace period.  The slab is guaranteed to stay
+ * alive until @deact_fn is finished.  This is to be used from
+ * __kmemcg_cache_deactivate().
+ */
+void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+					   void (*deact_fn)(struct kmem_cache *))
+{
+	if (WARN_ON_ONCE(is_root_cache(s)) ||
+	    WARN_ON_ONCE(s->memcg_params.deact_fn))
+		return;
+
+	/* pin memcg so that @s doesn't get destroyed in the middle */
+	css_get(&s->memcg_params.memcg->css);
+
+	s->memcg_params.deact_fn = deact_fn;
+	call_rcu_sched(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn);
+}
+
 void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 {
 	int idx;
diff --git a/mm/slub.c b/mm/slub.c
index ef89a07..8621940 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3949,6 +3949,12 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG
+static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
+{
+	/* called with all the locks held after a sched RCU grace period */
+	__kmem_cache_shrink(s);
+}
+
 void __kmemcg_cache_deactivate(struct kmem_cache *s)
 {
 	/*
@@ -3960,11 +3966,9 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s)
 
 	/*
 	 * s->cpu_partial is checked locklessly (see put_cpu_partial), so
-	 * we have to make sure the change is visible.
+	 * we have to make sure the change is visible before shrinking.
 	 */
-	synchronize_sched();
-
-	__kmem_cache_shrink(s);
+	slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu);
 }
 #endif
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 9/9] slab: remove slub sysfs interface files early for empty memcg caches
  2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
                   ` (7 preceding siblings ...)
  2017-01-14  5:54 ` [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path Tejun Heo
@ 2017-01-14  5:54 ` Tejun Heo
  2017-01-14 14:00   ` Vladimir Davydov
  8 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14  5:54 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

Each cache has a number of sysfs interface files under
/sys/kernel/slab.  On a system with a lot of memory and transient
memcgs, the number of interface files which have to be removed once
memory reclaim kicks in can reach millions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slub.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 8621940..41a3da7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3951,8 +3951,20 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 #ifdef CONFIG_MEMCG
 static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
 {
-	/* called with all the locks held after a sched RCU grace period */
-	__kmem_cache_shrink(s);
+	/*
+	 * Called with all the locks held after a sched RCU grace period.
+	 * Even if @s becomes empty after shrinking, we can't know that @s
+	 * doesn't have allocations already in-flight and thus can't
+	 * destroy @s until the associated memcg is released.
+	 *
+	 * However, let's remove the sysfs files for empty caches here.
+	 * Each cache has a lot of interface files which aren't
+	 * particularly useful for empty draining caches; otherwise, we can
+	 * easily end up with millions of unnecessary sysfs files on
+	 * systems which have a lot of memory and transient cgroups.
+	 */
+	if (!__kmem_cache_shrink(s))
+		sysfs_slab_remove(s);
 }
 
 void __kmemcg_cache_deactivate(struct kmem_cache *s)
@@ -5650,6 +5662,15 @@ static void sysfs_slab_remove(struct kmem_cache *s)
 		 */
 		return;
 
+	if (!s->kobj.state_in_sysfs)
+		/*
+		 * For a memcg cache, this may be called during
+		 * deactivation and again on shutdown.  Remove only once.
+		 * A cache is never shut down before deactivation is
+		 * complete, so no need to worry about synchronization.
+		 */
+		return;
+
 #ifdef CONFIG_MEMCG
 	kset_unregister(s->memcg_kset);
 #endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
  2017-01-14  5:54 ` [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path Tejun Heo
@ 2017-01-14 13:19   ` Vladimir Davydov
  2017-01-14 15:19     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Vladimir Davydov @ 2017-01-14 13:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

Hello Tejun,

Thanks a lot for looking into this issue as it seems to affect a lot of
users!

On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> This patch updates the cache release path so that it simply uses
> call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> This doesn't cost more while being logically simpler and way more
> scalable.

The point of rcu_barrier() is to wait until all rcu calls freeing slabs
from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
I'm not sure if call_rcu() guarantees that for all rcu implementations
too. If it did, why would we need rcu_barrier() at all?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/9] slab: simplify shutdown_memcg_caches()
  2017-01-14  5:54 ` [PATCH 3/9] slab: simplify shutdown_memcg_caches() Tejun Heo
@ 2017-01-14 13:27   ` Vladimir Davydov
  2017-01-14 15:38     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Vladimir Davydov @ 2017-01-14 13:27 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 12:54:43AM -0500, Tejun Heo wrote:
> shutdown_memcg_caches() shuts down all memcg caches associated with a
> root cache.  It first walks the index table clearing and shutting down
> each entry and then shuts down the ones on
> root_cache->memcg_params.list.  As active caches are on both the table
> and the list, they're stashed away from the list to avoid shutting
> down twice and then get spliced back later.
> 
> This is unnecessarily complication.  All memcg caches are on
> root_cache->memcg_params.list.  The function can simply clear the
> index table and shut down all caches on the list.  There's no need to
> muck with temporary stashing.
> 
> Simplify the code.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  mm/slab_common.c | 32 +++++---------------------------
>  1 file changed, 5 insertions(+), 27 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 851c75e..45aa67c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -634,48 +634,26 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>  {
>  	struct memcg_cache_array *arr;
>  	struct kmem_cache *c, *c2;
> -	LIST_HEAD(busy);
>  	int i;
>  
>  	BUG_ON(!is_root_cache(s));
>  
>  	/*
> -	 * First, shutdown active caches, i.e. caches that belong to online
> -	 * memory cgroups.
> +	 * First, clear the pointers to all memcg caches so that they will
> +	 * never be accessed even if the root cache stays alive.
>  	 */
>  	arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
>  					lockdep_is_held(&slab_mutex));
> -	for_each_memcg_cache_index(i) {
> -		c = arr->entries[i];
> -		if (!c)
> -			continue;
> -		if (shutdown_cache(c))
> -			/*
> -			 * The cache still has objects. Move it to a temporary
> -			 * list so as not to try to destroy it for a second
> -			 * time while iterating over inactive caches below.
> -			 */
> -			list_move(&c->memcg_params.list, &busy);
> -		else
> -			/*
> -			 * The cache is empty and will be destroyed soon. Clear
> -			 * the pointer to it in the memcg_caches array so that
> -			 * it will never be accessed even if the root cache
> -			 * stays alive.
> -			 */
> -			arr->entries[i] = NULL;
> -	}
> +	for_each_memcg_cache_index(i)
> +		arr->entries[i] = NULL;
>  
>  	/*
> -	 * Second, shutdown all caches left from memory cgroups that are now
> -	 * offline.
> +	 * Shutdown all caches.
>  	 */
>  	list_for_each_entry_safe(c, c2, &s->memcg_params.list,
>  				 memcg_params.list)
>  		shutdown_cache(c);

The point of this complexity was to leave caches that happen to have
objects when kmem_cache_destroy() is called on the list, so that they
could be reused later. This behavior was inherited from the global
caches - if kmem_cache_destroy() is called on a cache that still has
object, we print a warning message and don't destroy the cache. This
patch changes this behavior.

>  
> -	list_splice(&busy, &s->memcg_params.list);
> -
>  	/*
>  	 * A cache being destroyed must be empty. In particular, this means
>  	 * that all per memcg caches attached to it must be empty too.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 4/9] slab: reorganize memcg_cache_params
  2017-01-14  5:54 ` [PATCH 4/9] slab: reorganize memcg_cache_params Tejun Heo
@ 2017-01-14 13:30   ` Vladimir Davydov
  0 siblings, 0 replies; 26+ messages in thread
From: Vladimir Davydov @ 2017-01-14 13:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 12:54:44AM -0500, Tejun Heo wrote:
> We're gonna change how memcg caches are iterated.  In preparation,
> clean up and reorganize memcg_cache_params.
> 
> * The shared ->list is replaced by ->children in root and
>   ->children_node in children.
> 
> * ->is_root_cache is removed.  Instead ->root_cache is moved out of
>   the child union and now used by both root and children.  NULL
>   indicates root cache.  Non-NULL a memcg one.
> 
> This patch doesn't cause any observable behavior changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 5/9] slab: link memcg kmem_caches on their associated memory cgroup
  2017-01-14  5:54 ` [PATCH 5/9] slab: link memcg kmem_caches on their associated memory cgroup Tejun Heo
@ 2017-01-14 13:33   ` Vladimir Davydov
  0 siblings, 0 replies; 26+ messages in thread
From: Vladimir Davydov @ 2017-01-14 13:33 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 12:54:45AM -0500, Tejun Heo wrote:
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.  This is one of the patches to address the issue.
> 
> While a memcg kmem_cache is listed on its root cache's ->children
> list, there is no direct way to iterate all kmem_caches which are
> assocaited with a memory cgroup.  The only way to iterate them is
> walking all caches while filtering out caches which don't match, which
> would be most of them.
> 
> This makes memcg destruction operations O(N^2) where N is the total
> number of slab caches which can be huge.  This combined with the
> synchronous RCU operations can tie up a CPU and affect the whole
> machine for many hours when memory reclaim triggers offlining and
> destruction of the stale memcgs.
> 
> This patch adds mem_cgroup->kmem_caches list which goes through
> memcg_cache_params->kmem_caches_node of all kmem_caches which are
> associated with the memcg.  All memcg specific iterations, including
> stat file access, are updated to use the new list instead.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Jay Vana <jsvana@fb.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/9] slab: don't put memcg caches on slab_caches list
  2017-01-14  5:54 ` [PATCH 6/9] slab: don't put memcg caches on slab_caches list Tejun Heo
@ 2017-01-14 13:39   ` Vladimir Davydov
  2017-01-14 15:39     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Vladimir Davydov @ 2017-01-14 13:39 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 12:54:46AM -0500, Tejun Heo wrote:
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.  This is one of the patches to address the issue.
> 
> slab_caches currently lists all caches including root and memcg ones.
> This is the only data structure which lists the root caches and
> iterating root caches can only be done by walking the list while
> skipping over memcg caches.  As there can be a huge number of memcg
> caches, this can become very expensive.
> 
> This also can make /proc/slabinfo behave very badly.  seq_file
> processes reads in 4k chunks and seeks to the previous Nth position on
> slab_caches list to resume after each chunk.  With a lot of memcg
> cache churns on the list, reading /proc/slabinfo can become very slow
> and its content often ends up with duplicate and/or missing entries.
> 
> As the previous patch made it unnecessary to walk slab_caches to
> iterate memcg-specific caches, there is no reason to keep memcg caches
> on the list.  This patch makes slab_caches include only the root
> caches.  As this makes slab_cache->list unused for memcg caches,
> ->memcg_params.children_node is removed and ->list is used instead.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Jay Vana <jsvana@fb.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/slab.h |  3 ---
>  mm/slab.h            |  3 +--
>  mm/slab_common.c     | 58 +++++++++++++++++++++++++---------------------------
>  3 files changed, 29 insertions(+), 35 deletions(-)

IIRC the slab_caches list is also used on cpu/mem online/offline, so you
have to patch those places to ensure that memcg caches get updated too.
Other than that the patch looks good to me.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/9] slab: introduce __kmemcg_cache_deactivate()
  2017-01-14  5:54 ` [PATCH 7/9] slab: introduce __kmemcg_cache_deactivate() Tejun Heo
@ 2017-01-14 13:42   ` Vladimir Davydov
  2017-01-14 15:39     ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Vladimir Davydov @ 2017-01-14 13:42 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 12:54:47AM -0500, Tejun Heo wrote:
> __kmem_cache_shrink() is called with %true @deactivate only for memcg
> caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
> __kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
> should implement it and it should deactivate and drain the cache.
> 
> This is to allow memcg cache deactivation behavior to further deviate
> from simple shrinking without messing up __kmem_cache_shrink().
> 
> This is pure reorganization and doesn't introduce any observable
> behavior changes.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

...
> diff --git a/mm/slab.h b/mm/slab.h
> index 8f47a44..73ed6b5 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -164,7 +164,10 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
>  
>  int __kmem_cache_shutdown(struct kmem_cache *);
>  void __kmem_cache_release(struct kmem_cache *);
> -int __kmem_cache_shrink(struct kmem_cache *, bool);
> +int __kmem_cache_shrink(struct kmem_cache *);
> +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> +void __kmemcg_cache_deactivate(struct kmem_cache *s);
> +#endif

nit: ifdef is not necessary

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
  2017-01-14  5:54 ` [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path Tejun Heo
@ 2017-01-14 13:57   ` Vladimir Davydov
  0 siblings, 0 replies; 26+ messages in thread
From: Vladimir Davydov @ 2017-01-14 13:57 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 12:54:48AM -0500, Tejun Heo wrote:
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.  This is one of the patches to address the issue.
> 
> slub uses synchronize_sched() to deactivate a memcg cache.
> synchronize_sched() is an expensive and slow operation and doesn't
> scale when a huge number of caches are destroyed back-to-back.  While
> there used to be a simple batching mechanism, the batching was too
> restricted to be helpful.
> 
> This patch implements slab_deactivate_memcg_cache_rcu_sched() which
> slub can use to schedule sched RCU callback instead of performing
> synchronize_sched() synchronously while holding cgroup_mutex.  While
> this adds online cpus, mems and slab_mutex operations, operating on
> these locks back-to-back from the same kworker, which is what's gonna
> happen when there are many to deactivate, isn't expensive at all and
> this gets rid of the scalability problem completely.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Jay Vana <jsvana@fb.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>

I don't think there's much point in having the infrastructure for this
in slab_common.c, as only SLUB needs it, but it isn't a show stopper.

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 9/9] slab: remove slub sysfs interface files early for empty memcg caches
  2017-01-14  5:54 ` [PATCH 9/9] slab: remove slub sysfs interface files early for empty memcg caches Tejun Heo
@ 2017-01-14 14:00   ` Vladimir Davydov
  0 siblings, 0 replies; 26+ messages in thread
From: Vladimir Davydov @ 2017-01-14 14:00 UTC (permalink / raw)
  To: Tejun Heo
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 12:54:49AM -0500, Tejun Heo wrote:
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.  This is one of the patches to address the issue.
> 
> Each cache has a number of sysfs interface files under
> /sys/kernel/slab.  On a system with a lot of memory and transient
> memcgs, the number of interface files which have to be removed once
> memory reclaim kicks in can reach millions.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Jay Vana <jsvana@fb.com>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>

Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
  2017-01-14 13:19   ` Vladimir Davydov
@ 2017-01-14 15:19     ` Tejun Heo
  2017-01-17  0:07       ` Joonsoo Kim
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14 15:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

Hello, Vladimir.

On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote:
> On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> > This patch updates the cache release path so that it simply uses
> > call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> > This doesn't cost more while being logically simpler and way more
> > scalable.
> 
> The point of rcu_barrier() is to wait until all rcu calls freeing slabs
> from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
> I'm not sure if call_rcu() guarantees that for all rcu implementations
> too. If it did, why would we need rcu_barrier() at all?

Yeah, I had a similar question and scanned its users briefly.  Looks
like it's used in combination with ctors so that its users can
opportunistically dereference objects and e.g. check ids / state /
whatever without worrying about the objects' lifetimes.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/9] slab: simplify shutdown_memcg_caches()
  2017-01-14 13:27   ` Vladimir Davydov
@ 2017-01-14 15:38     ` Tejun Heo
  2017-01-14 15:53       ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-14 15:38 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 04:27:22PM +0300, Vladimir Davydov wrote:
> > -	 * Second, shutdown all caches left from memory cgroups that are now
> > -	 * offline.
> > +	 * Shutdown all caches.
> >  	 */
> >  	list_for_each_entry_safe(c, c2, &s->memcg_params.list,
> >  				 memcg_params.list)
> >  		shutdown_cache(c);
> 
> The point of this complexity was to leave caches that happen to have
> objects when kmem_cache_destroy() is called on the list, so that they
> could be reused later. This behavior was inherited from the global

Ah, right, I misread the branch.  I don't quite get how the cache can
be reused later tho?  This is called when the memcg gets released and
a clear error condition - the caller, kmem_cache_destroy(), handles it
as an error condition too.

> caches - if kmem_cache_destroy() is called on a cache that still has
> object, we print a warning message and don't destroy the cache. This
> patch changes this behavior.

Hmm... yeah, we're missing the error return propagation.  I think
that's the only meaningful difference tho, right?  Will update the
patch.

Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/9] slab: don't put memcg caches on slab_caches list
  2017-01-14 13:39   ` Vladimir Davydov
@ 2017-01-14 15:39     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-01-14 15:39 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 04:39:18PM +0300, Vladimir Davydov wrote:
> IIRC the slab_caches list is also used on cpu/mem online/offline, so you
> have to patch those places to ensure that memcg caches get updated too.
> Other than that the patch looks good to me.

Right, will update.  Thanks!

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/9] slab: introduce __kmemcg_cache_deactivate()
  2017-01-14 13:42   ` Vladimir Davydov
@ 2017-01-14 15:39     ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-01-14 15:39 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 04:42:11PM +0300, Vladimir Davydov wrote:
> > +#if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
> > +void __kmemcg_cache_deactivate(struct kmem_cache *s);
> > +#endif
> 
> nit: ifdef is not necessary

Will drop, thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/9] slab: simplify shutdown_memcg_caches()
  2017-01-14 15:38     ` Tejun Heo
@ 2017-01-14 15:53       ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-01-14 15:53 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: cl, penberg, rientjes, iamjoonsoo.kim, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 10:38:01AM -0500, Tejun Heo wrote:
> On Sat, Jan 14, 2017 at 04:27:22PM +0300, Vladimir Davydov wrote:
> > > -	 * Second, shutdown all caches left from memory cgroups that are now
> > > -	 * offline.
> > > +	 * Shutdown all caches.
> > >  	 */
> > >  	list_for_each_entry_safe(c, c2, &s->memcg_params.list,
> > >  				 memcg_params.list)
> > >  		shutdown_cache(c);
> > 
> > The point of this complexity was to leave caches that happen to have
> > objects when kmem_cache_destroy() is called on the list, so that they
> > could be reused later. This behavior was inherited from the global
> 
> Ah, right, I misread the branch.  I don't quite get how the cache can
> be reused later tho?  This is called when the memcg gets released and
> a clear error condition - the caller, kmem_cache_destroy(), handles it
> as an error condition too.

I think I understand it now.  This is the alias being able to find and
reuse the cache.  Heh, that's a weird optimization for a corner error
case.  Anyways, I'll drop this patch.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
  2017-01-14 15:19     ` Tejun Heo
@ 2017-01-17  0:07       ` Joonsoo Kim
  2017-01-17 16:37         ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Joonsoo Kim @ 2017-01-17  0:07 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Vladimir Davydov, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 10:19:21AM -0500, Tejun Heo wrote:
> Hello, Vladimir.
> 
> On Sat, Jan 14, 2017 at 04:19:39PM +0300, Vladimir Davydov wrote:
> > On Sat, Jan 14, 2017 at 12:54:42AM -0500, Tejun Heo wrote:
> > > This patch updates the cache release path so that it simply uses
> > > call_rcu() instead of the synchronous rcu_barrier() + custom batching.
> > > This doesn't cost more while being logically simpler and way more
> > > scalable.
> > 
> > The point of rcu_barrier() is to wait until all rcu calls freeing slabs
> > from the cache being destroyed are over (rcu_free_slab, kmem_rcu_free).
> > I'm not sure if call_rcu() guarantees that for all rcu implementations
> > too. If it did, why would we need rcu_barrier() at all?
> 
> Yeah, I had a similar question and scanned its users briefly.  Looks
> like it's used in combination with ctors so that its users can
> opportunistically dereference objects and e.g. check ids / state /
> whatever without worrying about the objects' lifetimes.

Hello, Tejun.

Long time no see! :)

IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all
slab pages in it are freed. These slab pages are freed through call_rcu().

Your patch changes it to another call_rcu() and, I think, if sequence of
executing rcu callbacks is the same with sequence of adding rcu
callbacks, it would work. However, I'm not sure that it is
guaranteed by RCU API. Am I missing something?

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
  2017-01-17  0:07       ` Joonsoo Kim
@ 2017-01-17 16:37         ` Tejun Heo
  2017-01-17 17:02           ` Tejun Heo
  0 siblings, 1 reply; 26+ messages in thread
From: Tejun Heo @ 2017-01-17 16:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vladimir Davydov, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

Hello, Joonsoo.

On Tue, Jan 17, 2017 at 09:07:54AM +0900, Joonsoo Kim wrote:
> Long time no see! :)

Yeah, happy new year!

> IIUC, rcu_barrier() here prevents to destruct the kmem_cache until all
> slab pages in it are freed. These slab pages are freed through call_rcu().

Hmm... why do we need that tho?  SLAB_DESTROY_BY_RCU only needs to
protect the slab pages, not kmem cache struct.  I thought that this
was because kmem cache destruction is allowed to release pages w/o RCU
delaying it.

> Your patch changes it to another call_rcu() and, I think, if sequence of
> executing rcu callbacks is the same with sequence of adding rcu
> callbacks, it would work. However, I'm not sure that it is
> guaranteed by RCU API. Am I missing something?

The call sequence doesn't matter.  Whether you're using call_rcu() or
rcu_barrier(), you're just waiting for a grace period to pass before
continuing.  It doens't give any other ordering guarantees, so the new
code should be equivalent to the old one except for being asynchronous.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path
  2017-01-17 16:37         ` Tejun Heo
@ 2017-01-17 17:02           ` Tejun Heo
  0 siblings, 0 replies; 26+ messages in thread
From: Tejun Heo @ 2017-01-17 17:02 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Vladimir Davydov, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Tue, Jan 17, 2017 at 08:37:45AM -0800, Tejun Heo wrote:
> The call sequence doesn't matter.  Whether you're using call_rcu() or
> rcu_barrier(), you're just waiting for a grace period to pass before
> continuing.  It doens't give any other ordering guarantees, so the new
> code should be equivalent to the old one except for being asynchronous.

Oh I was confusing synchronize_rcu() with rcu_barrier(), so you're
right, kmem_cache struct needs to stay around for the slab pages to be
freed after RCU grace period.  Will revise the patch accordingly,
thanks.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-01-17 19:07 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-14  5:54 [PATCHSET] slab: make memcg slab destruction scalable Tejun Heo
2017-01-14  5:54 ` [PATCH 1/9] Revert "slub: move synchronize_sched out of slab_mutex on shrink" Tejun Heo
2017-01-14  5:54 ` [PATCH 2/9] slab: remove synchronous rcu_barrier() call in memcg cache release path Tejun Heo
2017-01-14 13:19   ` Vladimir Davydov
2017-01-14 15:19     ` Tejun Heo
2017-01-17  0:07       ` Joonsoo Kim
2017-01-17 16:37         ` Tejun Heo
2017-01-17 17:02           ` Tejun Heo
2017-01-14  5:54 ` [PATCH 3/9] slab: simplify shutdown_memcg_caches() Tejun Heo
2017-01-14 13:27   ` Vladimir Davydov
2017-01-14 15:38     ` Tejun Heo
2017-01-14 15:53       ` Tejun Heo
2017-01-14  5:54 ` [PATCH 4/9] slab: reorganize memcg_cache_params Tejun Heo
2017-01-14 13:30   ` Vladimir Davydov
2017-01-14  5:54 ` [PATCH 5/9] slab: link memcg kmem_caches on their associated memory cgroup Tejun Heo
2017-01-14 13:33   ` Vladimir Davydov
2017-01-14  5:54 ` [PATCH 6/9] slab: don't put memcg caches on slab_caches list Tejun Heo
2017-01-14 13:39   ` Vladimir Davydov
2017-01-14 15:39     ` Tejun Heo
2017-01-14  5:54 ` [PATCH 7/9] slab: introduce __kmemcg_cache_deactivate() Tejun Heo
2017-01-14 13:42   ` Vladimir Davydov
2017-01-14 15:39     ` Tejun Heo
2017-01-14  5:54 ` [PATCH 8/9] slab: remove synchronous synchronize_sched() from memcg cache deactivation path Tejun Heo
2017-01-14 13:57   ` Vladimir Davydov
2017-01-14  5:54 ` [PATCH 9/9] slab: remove slub sysfs interface files early for empty memcg caches Tejun Heo
2017-01-14 14:00   ` Vladimir Davydov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).