linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v2] slab: make memcg slab destruction scalable
@ 2017-01-14 18:48 Tejun Heo
  2017-01-14 18:48 ` [PATCH 1/8] Revert "slub: move synchronize_sched out of slab_mutex on shrink" Tejun Heo
                   ` (8 more replies)
  0 siblings, 9 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team

This is v2.  Changes from the last version[L] are

* 0002-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch was
  incorrect and dropped.

* 0006-slab-don-t-put-memcg-caches-on-slab_caches-list.patch
  incorrectly converted places which needed to walk all caches.
  Replaced with 0005-slab-implement-slab_root_caches-list.patch which
  adds root-only list instead of converting slab_caches list to list
  only root caches.

* Misc fixes.

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.

I've seen machines which end up with hundred thousands of caches and
many millions of kernfs_nodes.  The current code is O(N^2) on the
total number of caches and has synchronous rcu_barrier() and
synchronize_sched() in cgroup offline / release path which is executed
while holding cgroup_mutex.  Combined, this leads to very expensive
and slow cache destruction operations which can easily keep running
for half a day.

This also messes up /proc/slabinfo along with other cache iterating
operations.  seq_file operates on 4k chunks and on each 4k boundary
tries to seek to the last position in the list.  With a huge number of
caches on the list, this becomes very slow and very prone to the list
content changing underneath it leading to a lot of missing and/or
duplicate entries.

This patchset addresses the scalability problem.

* Add root and per-memcg lists.  Update each user to use the
  appropriate list.

* Replace rcu_barrier() and synchronize_rcu() with call_rcu() and
  call_rcu_sched().

* For dying empty slub caches, remove the sysfs files after
  deactivation so that we don't end up with millions of sysfs files
  without any useful information on them.

This patchset contains the following nine patches.

 0001-Revert-slub-move-synchronize_sched-out-of-slab_mutex.patch
 0002-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch
 0003-slab-reorganize-memcg_cache_params.patch
 0004-slab-link-memcg-kmem_caches-on-their-associated-memo.patch
 0005-slab-implement-slab_root_caches-list.patch
 0006-slab-introduce-__kmemcg_cache_deactivate.patch
 0007-slab-remove-synchronous-synchronize_sched-from-memcg.patch
 0008-slab-remove-slub-sysfs-interface-files-early-for-emp.patch

0001 reverts an existing optimization to prepare for the following
changes.  0002 replaces rcu_barrier() in release path with call_rcu().
0004-0005 separate out the lists.  0006-0007 replace
synchronize_sched() in slub destruction path with call_rcu_sched().
0008 removes sysfs files early for empty dying caches.

This patchset is on top of the current linus#master a121103c9228 and
also available in the following git branch.

 git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-kmemcg-scalability

diffstat follows.  Thanks.

 include/linux/memcontrol.h |    1 
 include/linux/slab.h       |   45 ++++++-
 include/linux/slab_def.h   |    5 
 include/linux/slub_def.h   |    9 +
 mm/memcontrol.c            |    7 -
 mm/slab.c                  |    7 +
 mm/slab.h                  |   32 ++++-
 mm/slab_common.c           |  269 +++++++++++++++++++++++++++------------------
 mm/slub.c                  |   55 +++++++++
 9 files changed, 302 insertions(+), 128 deletions(-)

--
tejun

[L] http://lkml.kernel.org/r/<20170114055449.11044-1-tj@kernel.org>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/8] Revert "slub: move synchronize_sched out of slab_mutex on shrink"
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
@ 2017-01-14 18:48 ` Tejun Heo
  2017-01-14 18:48 ` [PATCH 2/8] slab: remove synchronous rcu_barrier() call in memcg cache release path Tejun Heo
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

This reverts commit 89e364db71fb5e7fc8d93228152abfa67daf35fa.

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

Moving synchronize_sched() out of slab_mutex isn't enough as it's
still inside cgroup_mutex.  The whole deactivation / release path will
be updated to avoid all synchronous RCU operations.  Revert this
insufficient optimization in preparation to ease future changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slab.c        |  4 ++--
 mm/slab.h        |  2 +-
 mm/slab_common.c | 27 ++-------------------------
 mm/slob.c        |  2 +-
 mm/slub.c        | 19 +++++++++++++++++--
 5 files changed, 23 insertions(+), 31 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 29bc6c0..767e8e4 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2314,7 +2314,7 @@ static int drain_freelist(struct kmem_cache *cache,
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep)
+int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 {
 	int ret = 0;
 	int node;
@@ -2334,7 +2334,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep)
 
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-	return __kmem_cache_shrink(cachep);
+	return __kmem_cache_shrink(cachep, false);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index de6579d..4acc644 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -161,7 +161,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *);
+int __kmem_cache_shrink(struct kmem_cache *, bool);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index ae32384..46ff746 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -579,29 +579,6 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	get_online_cpus();
 	get_online_mems();
 
-#ifdef CONFIG_SLUB
-	/*
-	 * In case of SLUB, we need to disable empty slab caching to
-	 * avoid pinning the offline memory cgroup by freeable kmem
-	 * pages charged to it. SLAB doesn't need this, as it
-	 * periodically purges unused slabs.
-	 */
-	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list) {
-		c = is_root_cache(s) ? cache_from_memcg_idx(s, idx) : NULL;
-		if (c) {
-			c->cpu_partial = 0;
-			c->min_partial = 0;
-		}
-	}
-	mutex_unlock(&slab_mutex);
-	/*
-	 * kmem_cache->cpu_partial is checked locklessly (see
-	 * put_cpu_partial()). Make sure the change is visible.
-	 */
-	synchronize_sched();
-#endif
-
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		if (!is_root_cache(s))
@@ -613,7 +590,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		if (!c)
 			continue;
 
-		__kmem_cache_shrink(c);
+		__kmem_cache_shrink(c, true);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
@@ -784,7 +761,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 	get_online_cpus();
 	get_online_mems();
 	kasan_cache_shrink(cachep);
-	ret = __kmem_cache_shrink(cachep);
+	ret = __kmem_cache_shrink(cachep, false);
 	put_online_mems();
 	put_online_cpus();
 	return ret;
diff --git a/mm/slob.c b/mm/slob.c
index eac04d4..5ec1580 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d)
+int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
 {
 	return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 067598a..68b84f9 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3883,7 +3883,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 {
 	int node;
 	int i;
@@ -3895,6 +3895,21 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 	unsigned long flags;
 	int ret = 0;
 
+	if (deactivate) {
+		/*
+		 * Disable empty slabs caching. Used to avoid pinning offline
+		 * memory cgroups by kmem pages that can be freed.
+		 */
+		s->cpu_partial = 0;
+		s->min_partial = 0;
+
+		/*
+		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
+		 * so we have to make sure the change is visible.
+		 */
+		synchronize_sched();
+	}
+
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
@@ -3951,7 +3966,7 @@ static int slab_mem_going_offline_callback(void *arg)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s);
+		__kmem_cache_shrink(s, false);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/8] slab: remove synchronous rcu_barrier() call in memcg cache release path
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
  2017-01-14 18:48 ` [PATCH 1/8] Revert "slub: move synchronize_sched out of slab_mutex on shrink" Tejun Heo
@ 2017-01-14 18:48 ` Tejun Heo
  2017-01-14 18:48 ` [PATCH 3/8] slab: reorganize memcg_cache_params Tejun Heo
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

SLAB_DESTORY_BY_RCU caches need a rcu grace period before destruction.
Currently, it's done synchronously with rcu_barrier().  As
rcu_barrier() is expensive time-wise, slab implements a batching
mechanism so that rcu_barrier() can be done for multiple caches at the
same time.

Unfortunately, the rcu_barrier() is in synchronous path which is
called while holding cgroup_mutex and the batching is too limited to
be actually helpful.  Besides, the batching is just a very degenerate
form of the actual RCU callback mechanism.

This patch updates the cache release path so that it simply uses
call_rcu() instead of the synchronous rcu_barrier() + custom batching.
This doesn't cost more while being logically simpler and way more
scalable.

* ->rcu_head is added to kmem_cache structs.  It shares storage space
  with ->list.

* slub sysfs removal and release are separated and the former is now
  called from __kmem_cache_shutdown() while the latter is called from
  the release path.  There's no reason to defer sysfs removal through
  RCU and this makes it unnecessary to bounce to workqueue from the
  RCU callback.

* release_caches() is removed and shutdown_cache() now either directly
  release the cache or schedules a RCU callback to do that.  This
  makes the cache inaccessible once shutdown_cache() is called and
  makes it impossible for shutdown_memcg_caches() to do memcg-specific
  cleanups afterwards.  Move memcg-specific part into a helper,
  unlink_memcg_cache(), and make shutdown_cache() call it directly.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/slab_def.h |  5 ++-
 include/linux/slub_def.h |  9 ++++--
 mm/slab.h                |  5 ++-
 mm/slab_common.c         | 84 ++++++++++++++++++++----------------------------
 mm/slub.c                |  9 +++++-
 5 files changed, 57 insertions(+), 55 deletions(-)

diff --git a/include/linux/slab_def.h b/include/linux/slab_def.h
index 4ad2c5a..b649629 100644
--- a/include/linux/slab_def.h
+++ b/include/linux/slab_def.h
@@ -39,7 +39,10 @@ struct kmem_cache {
 
 /* 4) cache creation/removal */
 	const char *name;
-	struct list_head list;
+	union {
+		struct list_head list;
+		struct rcu_head rcu_head;
+	};
 	int refcount;
 	int object_size;
 	int align;
diff --git a/include/linux/slub_def.h b/include/linux/slub_def.h
index 75f56c2..7637b41 100644
--- a/include/linux/slub_def.h
+++ b/include/linux/slub_def.h
@@ -80,7 +80,10 @@ struct kmem_cache {
 	int align;		/* Alignment */
 	int reserved;		/* Reserved bytes at the end of slabs */
 	const char *name;	/* Name (only for display!) */
-	struct list_head list;	/* List of slab caches */
+	union {
+		struct list_head list;	/* List of slab caches */
+		struct rcu_head rcu_head;
+	};
 	int red_left_pad;	/* Left redzone padding size */
 #ifdef CONFIG_SYSFS
 	struct kobject kobj;	/* For sysfs */
@@ -113,9 +116,9 @@ struct kmem_cache {
 
 #ifdef CONFIG_SYSFS
 #define SLAB_SUPPORTS_SYSFS
-void sysfs_slab_remove(struct kmem_cache *);
+void sysfs_slab_release(struct kmem_cache *);
 #else
-static inline void sysfs_slab_remove(struct kmem_cache *s)
+static inline void sysfs_slab_release(struct kmem_cache *s)
 {
 }
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 4acc644..3fa2d77 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -24,7 +24,10 @@ struct kmem_cache {
 	const char *name;	/* Slab name for sysfs */
 	int refcount;		/* Use counter */
 	void (*ctor)(void *);	/* Called on object slot creation */
-	struct list_head list;	/* List of all slab caches on the system */
+	union {
+		struct list_head list;	/* List of all slab caches on the system */
+		struct rcu_head rcu_head;
+	};
 };
 
 #endif /* CONFIG_SLOB */
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 46ff746..851c75e 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -215,6 +215,11 @@ int memcg_update_all_caches(int num_memcgs)
 	mutex_unlock(&slab_mutex);
 	return ret;
 }
+
+static void unlink_memcg_cache(struct kmem_cache *s)
+{
+	list_del(&s->memcg_params.list);
+}
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
 		struct mem_cgroup *memcg, struct kmem_cache *root_cache)
@@ -225,6 +230,10 @@ static inline int init_memcg_params(struct kmem_cache *s,
 static inline void destroy_memcg_params(struct kmem_cache *s)
 {
 }
+
+static inline void unlink_memcg_cache(struct kmem_cache *s)
+{
+}
 #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
 
 /*
@@ -458,33 +467,32 @@ kmem_cache_create(const char *name, size_t size, size_t align,
 }
 EXPORT_SYMBOL(kmem_cache_create);
 
-static int shutdown_cache(struct kmem_cache *s,
-		struct list_head *release, bool *need_rcu_barrier)
+static void slab_kmem_cache_release_rcufn(struct rcu_head *head)
 {
-	if (__kmem_cache_shutdown(s) != 0)
-		return -EBUSY;
+	struct kmem_cache *s = container_of(head, struct kmem_cache, rcu_head);
 
-	if (s->flags & SLAB_DESTROY_BY_RCU)
-		*need_rcu_barrier = true;
-
-	list_move(&s->list, release);
-	return 0;
+#ifdef SLAB_SUPPORTS_SYSFS
+	sysfs_slab_release(s);
+#else
+	slab_kmem_cache_release(s);
+#endif
 }
 
-static void release_caches(struct list_head *release, bool need_rcu_barrier)
+static int shutdown_cache(struct kmem_cache *s)
 {
-	struct kmem_cache *s, *s2;
+	if (__kmem_cache_shutdown(s) != 0)
+		return -EBUSY;
 
-	if (need_rcu_barrier)
-		rcu_barrier();
+	list_del(&s->list);
+	if (!is_root_cache(s))
+		unlink_memcg_cache(s);
 
-	list_for_each_entry_safe(s, s2, release, list) {
-#ifdef SLAB_SUPPORTS_SYSFS
-		sysfs_slab_remove(s);
-#else
-		slab_kmem_cache_release(s);
-#endif
-	}
+	if (s->flags & SLAB_DESTROY_BY_RCU)
+		call_rcu(&s->rcu_head, slab_kmem_cache_release_rcufn);
+	else
+		slab_kmem_cache_release_rcufn(&s->rcu_head);
+
+	return 0;
 }
 
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
@@ -599,22 +607,8 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	put_online_cpus();
 }
 
-static int __shutdown_memcg_cache(struct kmem_cache *s,
-		struct list_head *release, bool *need_rcu_barrier)
-{
-	BUG_ON(is_root_cache(s));
-
-	if (shutdown_cache(s, release, need_rcu_barrier))
-		return -EBUSY;
-
-	list_del(&s->memcg_params.list);
-	return 0;
-}
-
 void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
 {
-	LIST_HEAD(release);
-	bool need_rcu_barrier = false;
 	struct kmem_cache *s, *s2;
 
 	get_online_cpus();
@@ -628,18 +622,15 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
 		 * The cgroup is about to be freed and therefore has no charges
 		 * left. Hence, all its caches must be empty by now.
 		 */
-		BUG_ON(__shutdown_memcg_cache(s, &release, &need_rcu_barrier));
+		BUG_ON(shutdown_cache(s));
 	}
 	mutex_unlock(&slab_mutex);
 
 	put_online_mems();
 	put_online_cpus();
-
-	release_caches(&release, need_rcu_barrier);
 }
 
-static int shutdown_memcg_caches(struct kmem_cache *s,
-		struct list_head *release, bool *need_rcu_barrier)
+static int shutdown_memcg_caches(struct kmem_cache *s)
 {
 	struct memcg_cache_array *arr;
 	struct kmem_cache *c, *c2;
@@ -658,7 +649,7 @@ static int shutdown_memcg_caches(struct kmem_cache *s,
 		c = arr->entries[i];
 		if (!c)
 			continue;
-		if (__shutdown_memcg_cache(c, release, need_rcu_barrier))
+		if (shutdown_cache(c))
 			/*
 			 * The cache still has objects. Move it to a temporary
 			 * list so as not to try to destroy it for a second
@@ -681,7 +672,7 @@ static int shutdown_memcg_caches(struct kmem_cache *s,
 	 */
 	list_for_each_entry_safe(c, c2, &s->memcg_params.list,
 				 memcg_params.list)
-		__shutdown_memcg_cache(c, release, need_rcu_barrier);
+		shutdown_cache(c);
 
 	list_splice(&busy, &s->memcg_params.list);
 
@@ -694,8 +685,7 @@ static int shutdown_memcg_caches(struct kmem_cache *s,
 	return 0;
 }
 #else
-static inline int shutdown_memcg_caches(struct kmem_cache *s,
-		struct list_head *release, bool *need_rcu_barrier)
+static inline int shutdown_memcg_caches(struct kmem_cache *s)
 {
 	return 0;
 }
@@ -711,8 +701,6 @@ void slab_kmem_cache_release(struct kmem_cache *s)
 
 void kmem_cache_destroy(struct kmem_cache *s)
 {
-	LIST_HEAD(release);
-	bool need_rcu_barrier = false;
 	int err;
 
 	if (unlikely(!s))
@@ -728,9 +716,9 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (s->refcount)
 		goto out_unlock;
 
-	err = shutdown_memcg_caches(s, &release, &need_rcu_barrier);
+	err = shutdown_memcg_caches(s);
 	if (!err)
-		err = shutdown_cache(s, &release, &need_rcu_barrier);
+		err = shutdown_cache(s);
 
 	if (err) {
 		pr_err("kmem_cache_destroy %s: Slab cache still has objects\n",
@@ -742,8 +730,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
 
 	put_online_mems();
 	put_online_cpus();
-
-	release_caches(&release, need_rcu_barrier);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
diff --git a/mm/slub.c b/mm/slub.c
index 68b84f9..2b78c82 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -214,11 +214,13 @@ enum track_item { TRACK_ALLOC, TRACK_FREE };
 static int sysfs_slab_add(struct kmem_cache *);
 static int sysfs_slab_alias(struct kmem_cache *, const char *);
 static void memcg_propagate_slab_attrs(struct kmem_cache *s);
+static void sysfs_slab_remove(struct kmem_cache *s);
 #else
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
 							{ return 0; }
 static inline void memcg_propagate_slab_attrs(struct kmem_cache *s) { }
+static inline void sysfs_slab_remove(struct kmem_cache *s) { }
 #endif
 
 static inline void stat(const struct kmem_cache *s, enum stat_item si)
@@ -3679,6 +3681,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s)
 		if (n->nr_partial || slabs_node(s, node))
 			return 1;
 	}
+	sysfs_slab_remove(s);
 	return 0;
 }
 
@@ -5629,7 +5632,7 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	goto out;
 }
 
-void sysfs_slab_remove(struct kmem_cache *s)
+static void sysfs_slab_remove(struct kmem_cache *s)
 {
 	if (slab_state < FULL)
 		/*
@@ -5643,6 +5646,10 @@ void sysfs_slab_remove(struct kmem_cache *s)
 #endif
 	kobject_uevent(&s->kobj, KOBJ_REMOVE);
 	kobject_del(&s->kobj);
+}
+
+void sysfs_slab_release(struct kmem_cache *s)
+{
 	kobject_put(&s->kobj);
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/8] slab: reorganize memcg_cache_params
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
  2017-01-14 18:48 ` [PATCH 1/8] Revert "slub: move synchronize_sched out of slab_mutex on shrink" Tejun Heo
  2017-01-14 18:48 ` [PATCH 2/8] slab: remove synchronous rcu_barrier() call in memcg cache release path Tejun Heo
@ 2017-01-14 18:48 ` Tejun Heo
  2017-01-14 18:48 ` [PATCH 4/8] slab: link memcg kmem_caches on their associated memory cgroup Tejun Heo
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

We're gonna change how memcg caches are iterated.  In preparation,
clean up and reorganize memcg_cache_params.

* The shared ->list is replaced by ->children in root and
  ->children_node in children.

* ->is_root_cache is removed.  Instead ->root_cache is moved out of
  the child union and now used by both root and children.  NULL
  indicates root cache.  Non-NULL a memcg one.

This patch doesn't cause any observable behavior changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/slab.h | 33 ++++++++++++++++++++++++---------
 mm/slab.h            |  6 +++---
 mm/slab_common.c     | 25 +++++++++++++------------
 3 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 084b12b..2e83922 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -545,22 +545,37 @@ struct memcg_cache_array {
  * array to be accessed without taking any locks, on relocation we free the old
  * version only after a grace period.
  *
- * Child caches will hold extra metadata needed for its operation. Fields are:
+ * Root and child caches hold different metadata.
  *
- * @memcg: pointer to the memcg this cache belongs to
- * @root_cache: pointer to the global, root cache, this cache was derived from
+ * @root_cache:	Common to root and child caches.  NULL for root, pointer to
+ *		the root cache for children.
  *
- * Both root and child caches of the same kind are linked into a list chained
- * through @list.
+ * The following fields are specific to root caches.
+ *
+ * @memcg_caches: kmemcg ID indexed table of child caches.  This table is
+ *		used to index child cachces during allocation and cleared
+ *		early during shutdown.
+ *
+ * @children:	List of all child caches.  While the child caches are also
+ *		reachable through @memcg_caches, a child cache remains on
+ *		this list until it is actually destroyed.
+ *
+ * The following fields are specific to child caches.
+ *
+ * @memcg:	Pointer to the memcg this cache belongs to.
+ *
+ * @children_node: List node for @root_cache->children list.
  */
 struct memcg_cache_params {
-	bool is_root_cache;
-	struct list_head list;
+	struct kmem_cache *root_cache;
 	union {
-		struct memcg_cache_array __rcu *memcg_caches;
+		struct {
+			struct memcg_cache_array __rcu *memcg_caches;
+			struct list_head children;
+		};
 		struct {
 			struct mem_cgroup *memcg;
-			struct kmem_cache *root_cache;
+			struct list_head children_node;
 		};
 	};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 3fa2d77..9b3ec3f 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -203,12 +203,12 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
  * slab_mutex.
  */
 #define for_each_memcg_cache(iter, root) \
-	list_for_each_entry(iter, &(root)->memcg_params.list, \
-			    memcg_params.list)
+	list_for_each_entry(iter, &(root)->memcg_params.children, \
+			    memcg_params.children_node)
 
 static inline bool is_root_cache(struct kmem_cache *s)
 {
-	return s->memcg_params.is_root_cache;
+	return !s->memcg_params.root_cache;
 }
 
 static inline bool slab_equal_or_root(struct kmem_cache *s,
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 851c75e..ad9cc55 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -135,9 +135,9 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
 void slab_init_memcg_params(struct kmem_cache *s)
 {
-	s->memcg_params.is_root_cache = true;
-	INIT_LIST_HEAD(&s->memcg_params.list);
+	s->memcg_params.root_cache = NULL;
 	RCU_INIT_POINTER(s->memcg_params.memcg_caches, NULL);
+	INIT_LIST_HEAD(&s->memcg_params.children);
 }
 
 static int init_memcg_params(struct kmem_cache *s,
@@ -145,10 +145,10 @@ static int init_memcg_params(struct kmem_cache *s,
 {
 	struct memcg_cache_array *arr;
 
-	if (memcg) {
-		s->memcg_params.is_root_cache = false;
-		s->memcg_params.memcg = memcg;
+	if (root_cache) {
 		s->memcg_params.root_cache = root_cache;
+		s->memcg_params.memcg = memcg;
+		INIT_LIST_HEAD(&s->memcg_params.children_node);
 		return 0;
 	}
 
@@ -218,7 +218,7 @@ int memcg_update_all_caches(int num_memcgs)
 
 static void unlink_memcg_cache(struct kmem_cache *s)
 {
-	list_del(&s->memcg_params.list);
+	list_del(&s->memcg_params.children_node);
 }
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
@@ -559,7 +559,8 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 		goto out_unlock;
 	}
 
-	list_add(&s->memcg_params.list, &root_cache->memcg_params.list);
+	list_add(&s->memcg_params.children_node,
+		 &root_cache->memcg_params.children);
 
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
@@ -655,7 +656,7 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 			 * list so as not to try to destroy it for a second
 			 * time while iterating over inactive caches below.
 			 */
-			list_move(&c->memcg_params.list, &busy);
+			list_move(&c->memcg_params.children_node, &busy);
 		else
 			/*
 			 * The cache is empty and will be destroyed soon. Clear
@@ -670,17 +671,17 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
 	 * Second, shutdown all caches left from memory cgroups that are now
 	 * offline.
 	 */
-	list_for_each_entry_safe(c, c2, &s->memcg_params.list,
-				 memcg_params.list)
+	list_for_each_entry_safe(c, c2, &s->memcg_params.children,
+				 memcg_params.children_node)
 		shutdown_cache(c);
 
-	list_splice(&busy, &s->memcg_params.list);
+	list_splice(&busy, &s->memcg_params.children);
 
 	/*
 	 * A cache being destroyed must be empty. In particular, this means
 	 * that all per memcg caches attached to it must be empty too.
 	 */
-	if (!list_empty(&s->memcg_params.list))
+	if (!list_empty(&s->memcg_params.children))
 		return -EBUSY;
 	return 0;
 }
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/8] slab: link memcg kmem_caches on their associated memory cgroup
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
                   ` (2 preceding siblings ...)
  2017-01-14 18:48 ` [PATCH 3/8] slab: reorganize memcg_cache_params Tejun Heo
@ 2017-01-14 18:48 ` Tejun Heo
  2017-01-14 18:48 ` [PATCH 5/8] slab: implement slab_root_caches list Tejun Heo
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

While a memcg kmem_cache is listed on its root cache's ->children
list, there is no direct way to iterate all kmem_caches which are
assocaited with a memory cgroup.  The only way to iterate them is
walking all caches while filtering out caches which don't match, which
would be most of them.

This makes memcg destruction operations O(N^2) where N is the total
number of slab caches which can be huge.  This combined with the
synchronous RCU operations can tie up a CPU and affect the whole
machine for many hours when memory reclaim triggers offlining and
destruction of the stale memcgs.

This patch adds mem_cgroup->kmem_caches list which goes through
memcg_cache_params->kmem_caches_node of all kmem_caches which are
associated with the memcg.  All memcg specific iterations, including
stat file access, are updated to use the new list instead.

v2: Initial version made slab_{start|next|stop}() static; however,
    mm/slab.c still needs them.  Leave them global.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/memcontrol.h |  1 +
 include/linux/slab.h       |  3 +++
 mm/memcontrol.c            |  7 ++++---
 mm/slab.h                  |  3 +++
 mm/slab_common.c           | 36 +++++++++++++++++++++++++++++-------
 5 files changed, 40 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 61d20c1..4de925c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -253,6 +253,7 @@ struct mem_cgroup {
         /* Index in the kmem_cache->memcg_params.memcg_caches array */
 	int kmemcg_id;
 	enum memcg_kmem_state kmem_state;
+	struct list_head kmem_caches;
 #endif
 
 	int last_scanned_node;
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 2e83922..95b4d9d 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -565,6 +565,8 @@ struct memcg_cache_array {
  * @memcg:	Pointer to the memcg this cache belongs to.
  *
  * @children_node: List node for @root_cache->children list.
+ *
+ * @kmem_caches_node: List node for @memcg->kmem_caches list.
  */
 struct memcg_cache_params {
 	struct kmem_cache *root_cache;
@@ -576,6 +578,7 @@ struct memcg_cache_params {
 		struct {
 			struct mem_cgroup *memcg;
 			struct list_head children_node;
+			struct list_head kmem_caches_node;
 		};
 	};
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4048897..a2b20f7f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2839,6 +2839,7 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
 	 */
 	memcg->kmemcg_id = memcg_id;
 	memcg->kmem_state = KMEM_ONLINE;
+	INIT_LIST_HEAD(&memcg->kmem_caches);
 
 	return 0;
 }
@@ -4004,9 +4005,9 @@ static struct cftype mem_cgroup_legacy_files[] = {
 #ifdef CONFIG_SLABINFO
 	{
 		.name = "kmem.slabinfo",
-		.seq_start = slab_start,
-		.seq_next = slab_next,
-		.seq_stop = slab_stop,
+		.seq_start = memcg_slab_start,
+		.seq_next = memcg_slab_next,
+		.seq_stop = memcg_slab_stop,
 		.seq_show = memcg_slab_show,
 	},
 #endif
diff --git a/mm/slab.h b/mm/slab.h
index 9b3ec3f..ac9575d 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -491,6 +491,9 @@ static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 void *slab_start(struct seq_file *m, loff_t *pos);
 void *slab_next(struct seq_file *m, void *p, loff_t *pos);
 void slab_stop(struct seq_file *m, void *p);
+void *memcg_slab_start(struct seq_file *m, loff_t *pos);
+void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos);
+void memcg_slab_stop(struct seq_file *m, void *p);
 int memcg_slab_show(struct seq_file *m, void *p);
 
 void ___cache_free(struct kmem_cache *cache, void *x, unsigned long addr);
diff --git a/mm/slab_common.c b/mm/slab_common.c
index ad9cc55..0382d41 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -149,6 +149,7 @@ static int init_memcg_params(struct kmem_cache *s,
 		s->memcg_params.root_cache = root_cache;
 		s->memcg_params.memcg = memcg;
 		INIT_LIST_HEAD(&s->memcg_params.children_node);
+		INIT_LIST_HEAD(&s->memcg_params.kmem_caches_node);
 		return 0;
 	}
 
@@ -219,6 +220,7 @@ int memcg_update_all_caches(int num_memcgs)
 static void unlink_memcg_cache(struct kmem_cache *s)
 {
 	list_del(&s->memcg_params.children_node);
+	list_del(&s->memcg_params.kmem_caches_node);
 }
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
@@ -561,6 +563,7 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 
 	list_add(&s->memcg_params.children_node,
 		 &root_cache->memcg_params.children);
+	list_add(&s->memcg_params.kmem_caches_node, &memcg->kmem_caches);
 
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
@@ -616,9 +619,8 @@ void memcg_destroy_kmem_caches(struct mem_cgroup *memcg)
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry_safe(s, s2, &slab_caches, list) {
-		if (is_root_cache(s) || s->memcg_params.memcg != memcg)
-			continue;
+	list_for_each_entry_safe(s, s2, &memcg->kmem_caches,
+				 memcg_params.kmem_caches_node) {
 		/*
 		 * The cgroup is about to be freed and therefore has no charges
 		 * left. Hence, all its caches must be empty by now.
@@ -1169,15 +1171,35 @@ static int slab_show(struct seq_file *m, void *p)
 }
 
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+void *memcg_slab_start(struct seq_file *m, loff_t *pos)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	mutex_lock(&slab_mutex);
+	return seq_list_start(&memcg->kmem_caches, *pos);
+}
+
+void *memcg_slab_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
+
+	return seq_list_next(p, &memcg->kmem_caches, pos);
+}
+
+void memcg_slab_stop(struct seq_file *m, void *p)
+{
+	mutex_unlock(&slab_mutex);
+}
+
 int memcg_slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
+	struct kmem_cache *s = list_entry(p, struct kmem_cache,
+					  memcg_params.kmem_caches_node);
 	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m));
 
-	if (p == slab_caches.next)
+	if (p == memcg->kmem_caches.next)
 		print_slabinfo_header(m);
-	if (!is_root_cache(s) && s->memcg_params.memcg == memcg)
-		cache_show(s, m);
+	cache_show(s, m);
 	return 0;
 }
 #endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/8] slab: implement slab_root_caches list
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
                   ` (3 preceding siblings ...)
  2017-01-14 18:48 ` [PATCH 4/8] slab: link memcg kmem_caches on their associated memory cgroup Tejun Heo
@ 2017-01-14 18:48 ` Tejun Heo
  2017-01-14 18:48 ` [PATCH 6/8] slab: introduce __kmemcg_cache_deactivate() Tejun Heo
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slab_caches currently lists all caches including root and memcg ones.
This is the only data structure which lists the root caches and
iterating root caches can only be done by walking the list while
skipping over memcg caches.  As there can be a huge number of memcg
caches, this can become very expensive.

This also can make /proc/slabinfo behave very badly.  seq_file
processes reads in 4k chunks and seeks to the previous Nth position on
slab_caches list to resume after each chunk.  With a lot of memcg
cache churns on the list, reading /proc/slabinfo can become very slow
and its content often ends up with duplicate and/or missing entries.

This patch adds a new list slab_root_caches which lists only the root
caches.  When memcg is not enabled, it becomes just an alias of
slab_caches.  memcg specific list operations are collected into
memcg_[un]link_cache().

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/slab.h |  3 +++
 mm/slab.h            | 15 +++++++++++++
 mm/slab_common.c     | 59 ++++++++++++++++++++++++++++++----------------------
 mm/slub.c            |  1 +
 4 files changed, 53 insertions(+), 25 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 95b4d9d..41c49cc 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -556,6 +556,8 @@ struct memcg_cache_array {
  *		used to index child cachces during allocation and cleared
  *		early during shutdown.
  *
+ * @root_caches_node: List node for slab_root_caches list.
+ *
  * @children:	List of all child caches.  While the child caches are also
  *		reachable through @memcg_caches, a child cache remains on
  *		this list until it is actually destroyed.
@@ -573,6 +575,7 @@ struct memcg_cache_params {
 	union {
 		struct {
 			struct memcg_cache_array __rcu *memcg_caches;
+			struct list_head __root_caches_node;
 			struct list_head children;
 		};
 		struct {
diff --git a/mm/slab.h b/mm/slab.h
index ac9575d..d4d9270 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -198,6 +198,11 @@ void __kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
 int __kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
 
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+
+/* List of all root caches. */
+extern struct list_head		slab_root_caches;
+#define root_caches_node	memcg_params.__root_caches_node
+
 /*
  * Iterate over all memcg caches of the given root cache. The caller must hold
  * slab_mutex.
@@ -297,9 +302,14 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 }
 
 extern void slab_init_memcg_params(struct kmem_cache *);
+extern void memcg_link_cache(struct kmem_cache *s);
 
 #else /* CONFIG_MEMCG && !CONFIG_SLOB */
 
+/* If !memcg, all caches are root. */
+#define slab_root_caches	slab_caches
+#define root_caches_node	list
+
 #define for_each_memcg_cache(iter, root) \
 	for ((void)(iter), (void)(root); 0; )
 
@@ -344,6 +354,11 @@ static inline void memcg_uncharge_slab(struct page *page, int order,
 static inline void slab_init_memcg_params(struct kmem_cache *s)
 {
 }
+
+static inline void memcg_link_cache(struct kmem_cache *s)
+{
+}
+
 #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
 
 static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 0382d41..bae6a63 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -133,6 +133,9 @@ int __kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t nr,
 }
 
 #if defined(CONFIG_MEMCG) && !defined(CONFIG_SLOB)
+
+LIST_HEAD(slab_root_caches);
+
 void slab_init_memcg_params(struct kmem_cache *s)
 {
 	s->memcg_params.root_cache = NULL;
@@ -178,9 +181,6 @@ static int update_memcg_params(struct kmem_cache *s, int new_array_size)
 {
 	struct memcg_cache_array *old, *new;
 
-	if (!is_root_cache(s))
-		return 0;
-
 	new = kzalloc(sizeof(struct memcg_cache_array) +
 		      new_array_size * sizeof(void *), GFP_KERNEL);
 	if (!new)
@@ -204,7 +204,7 @@ int memcg_update_all_caches(int num_memcgs)
 	int ret = 0;
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list) {
+	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
 		ret = update_memcg_params(s, num_memcgs);
 		/*
 		 * Instead of freeing the memory, we'll just leave the caches
@@ -217,10 +217,26 @@ int memcg_update_all_caches(int num_memcgs)
 	return ret;
 }
 
-static void unlink_memcg_cache(struct kmem_cache *s)
+void memcg_link_cache(struct kmem_cache *s)
+{
+	if (is_root_cache(s)) {
+		list_add(&s->root_caches_node, &slab_root_caches);
+	} else {
+		list_add(&s->memcg_params.children_node,
+			 &s->memcg_params.root_cache->memcg_params.children);
+		list_add(&s->memcg_params.kmem_caches_node,
+			 &s->memcg_params.memcg->kmem_caches);
+	}
+}
+
+static void memcg_unlink_cache(struct kmem_cache *s)
 {
-	list_del(&s->memcg_params.children_node);
-	list_del(&s->memcg_params.kmem_caches_node);
+	if (is_root_cache(s)) {
+		list_del(&s->root_caches_node);
+	} else {
+		list_del(&s->memcg_params.children_node);
+		list_del(&s->memcg_params.kmem_caches_node);
+	}
 }
 #else
 static inline int init_memcg_params(struct kmem_cache *s,
@@ -233,7 +249,7 @@ static inline void destroy_memcg_params(struct kmem_cache *s)
 {
 }
 
-static inline void unlink_memcg_cache(struct kmem_cache *s)
+static inline void memcg_unlink_cache(struct kmem_cache *s)
 {
 }
 #endif /* CONFIG_MEMCG && !CONFIG_SLOB */
@@ -277,7 +293,7 @@ struct kmem_cache *find_mergeable(size_t size, size_t align,
 	size = ALIGN(size, align);
 	flags = kmem_cache_flags(size, flags, name, NULL);
 
-	list_for_each_entry_reverse(s, &slab_caches, list) {
+	list_for_each_entry_reverse(s, &slab_root_caches, root_caches_node) {
 		if (slab_unmergeable(s))
 			continue;
 
@@ -361,6 +377,7 @@ static struct kmem_cache *create_cache(const char *name,
 
 	s->refcount = 1;
 	list_add(&s->list, &slab_caches);
+	memcg_link_cache(s);
 out:
 	if (err)
 		return ERR_PTR(err);
@@ -485,9 +502,8 @@ static int shutdown_cache(struct kmem_cache *s)
 	if (__kmem_cache_shutdown(s) != 0)
 		return -EBUSY;
 
+	memcg_unlink_cache(s);
 	list_del(&s->list);
-	if (!is_root_cache(s))
-		unlink_memcg_cache(s);
 
 	if (s->flags & SLAB_DESTROY_BY_RCU)
 		call_rcu(&s->rcu_head, slab_kmem_cache_release_rcufn);
@@ -561,10 +577,6 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 		goto out_unlock;
 	}
 
-	list_add(&s->memcg_params.children_node,
-		 &root_cache->memcg_params.children);
-	list_add(&s->memcg_params.kmem_caches_node, &memcg->kmem_caches);
-
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
@@ -592,10 +604,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	get_online_mems();
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list) {
-		if (!is_root_cache(s))
-			continue;
-
+	list_for_each_entry(s, &slab_root_caches, root_caches_node) {
 		arr = rcu_dereference_protected(s->memcg_params.memcg_caches,
 						lockdep_is_held(&slab_mutex));
 		c = arr->entries[idx];
@@ -794,6 +803,7 @@ struct kmem_cache *__init create_kmalloc_cache(const char *name, size_t size,
 
 	create_boot_cache(s, name, size, flags);
 	list_add(&s->list, &slab_caches);
+	memcg_link_cache(s);
 	s->refcount = 1;
 	return s;
 }
@@ -1104,12 +1114,12 @@ static void print_slabinfo_header(struct seq_file *m)
 void *slab_start(struct seq_file *m, loff_t *pos)
 {
 	mutex_lock(&slab_mutex);
-	return seq_list_start(&slab_caches, *pos);
+	return seq_list_start(&slab_root_caches, *pos);
 }
 
 void *slab_next(struct seq_file *m, void *p, loff_t *pos)
 {
-	return seq_list_next(p, &slab_caches, pos);
+	return seq_list_next(p, &slab_root_caches, pos);
 }
 
 void slab_stop(struct seq_file *m, void *p)
@@ -1161,12 +1171,11 @@ static void cache_show(struct kmem_cache *s, struct seq_file *m)
 
 static int slab_show(struct seq_file *m, void *p)
 {
-	struct kmem_cache *s = list_entry(p, struct kmem_cache, list);
+	struct kmem_cache *s = list_entry(p, struct kmem_cache, root_caches_node);
 
-	if (p == slab_caches.next)
+	if (p == slab_root_caches.next)
 		print_slabinfo_header(m);
-	if (is_root_cache(s))
-		cache_show(s, m);
+	cache_show(s, m);
 	return 0;
 }
 
diff --git a/mm/slub.c b/mm/slub.c
index 2b78c82..8f37896 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4119,6 +4119,7 @@ static struct kmem_cache * __init bootstrap(struct kmem_cache *static_cache)
 	}
 	slab_init_memcg_params(s);
 	list_add(&s->list, &slab_caches);
+	memcg_link_cache(s);
 	return s;
 }
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/8] slab: introduce __kmemcg_cache_deactivate()
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
                   ` (4 preceding siblings ...)
  2017-01-14 18:48 ` [PATCH 5/8] slab: implement slab_root_caches list Tejun Heo
@ 2017-01-14 18:48 ` Tejun Heo
  2017-01-14 18:48 ` [PATCH 7/8] slab: remove synchronous synchronize_sched() from memcg cache deactivation path Tejun Heo
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

__kmem_cache_shrink() is called with %true @deactivate only for memcg
caches.  Remove @deactivate from __kmem_cache_shrink() and introduce
__kmemcg_cache_deactivate() instead.  Each memcg-supporting allocator
should implement it and it should deactivate and drain the cache.

This is to allow memcg cache deactivation behavior to further deviate
from simple shrinking without messing up __kmem_cache_shrink().

This is pure reorganization and doesn't introduce any observable
behavior changes.

v2: Dropped unnecessary ifdef in mm/slab.h as suggested by Vladimir.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slab.c        | 11 +++++++++--
 mm/slab.h        |  3 ++-
 mm/slab_common.c |  4 ++--
 mm/slob.c        |  2 +-
 mm/slub.c        | 39 ++++++++++++++++++++++-----------------
 5 files changed, 36 insertions(+), 23 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 767e8e4..65814f2 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2314,7 +2314,7 @@ static int drain_freelist(struct kmem_cache *cache,
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *cachep)
 {
 	int ret = 0;
 	int node;
@@ -2332,9 +2332,16 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 	return (ret ? 1 : 0);
 }
 
+#ifdef CONFIG_MEMCG
+void __kmemcg_cache_deactivate(struct kmem_cache *cachep)
+{
+	__kmem_cache_shrink(cachep);
+}
+#endif
+
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-	return __kmem_cache_shrink(cachep, false);
+	return __kmem_cache_shrink(cachep);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index d4d9270..0946d97 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -164,7 +164,8 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *, bool);
+int __kmem_cache_shrink(struct kmem_cache *);
+void __kmemcg_cache_deactivate(struct kmem_cache *s);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index bae6a63..cd81cad 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -611,7 +611,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		if (!c)
 			continue;
 
-		__kmem_cache_shrink(c, true);
+		__kmemcg_cache_deactivate(c);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
@@ -759,7 +759,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 	get_online_cpus();
 	get_online_mems();
 	kasan_cache_shrink(cachep);
-	ret = __kmem_cache_shrink(cachep, false);
+	ret = __kmem_cache_shrink(cachep);
 	put_online_mems();
 	put_online_cpus();
 	return ret;
diff --git a/mm/slob.c b/mm/slob.c
index 5ec1580..eac04d4 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *d)
 {
 	return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 8f37896..c754ea0 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3886,7 +3886,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3898,21 +3898,6 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 	unsigned long flags;
 	int ret = 0;
 
-	if (deactivate) {
-		/*
-		 * Disable empty slabs caching. Used to avoid pinning offline
-		 * memory cgroups by kmem pages that can be freed.
-		 */
-		s->cpu_partial = 0;
-		s->min_partial = 0;
-
-		/*
-		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
-		 * so we have to make sure the change is visible.
-		 */
-		synchronize_sched();
-	}
-
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
@@ -3963,13 +3948,33 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 	return ret;
 }
 
+#ifdef CONFIG_MEMCG
+void __kmemcg_cache_deactivate(struct kmem_cache *s)
+{
+	/*
+	 * Disable empty slabs caching. Used to avoid pinning offline
+	 * memory cgroups by kmem pages that can be freed.
+	 */
+	s->cpu_partial = 0;
+	s->min_partial = 0;
+
+	/*
+	 * s->cpu_partial is checked locklessly (see put_cpu_partial), so
+	 * we have to make sure the change is visible.
+	 */
+	synchronize_sched();
+
+	__kmem_cache_shrink(s);
+}
+#endif
+
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s, false);
+		__kmem_cache_shrink(s);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 7/8] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
                   ` (5 preceding siblings ...)
  2017-01-14 18:48 ` [PATCH 6/8] slab: introduce __kmemcg_cache_deactivate() Tejun Heo
@ 2017-01-14 18:48 ` Tejun Heo
  2017-01-17  0:26   ` Joonsoo Kim
  2017-01-14 18:48 ` [PATCH 8/8] slab: remove slub sysfs interface files early for empty memcg caches Tejun Heo
  2017-01-17  0:12 ` [PATCHSET v2] slab: make memcg slab destruction scalable Joonsoo Kim
  8 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

slub uses synchronize_sched() to deactivate a memcg cache.
synchronize_sched() is an expensive and slow operation and doesn't
scale when a huge number of caches are destroyed back-to-back.  While
there used to be a simple batching mechanism, the batching was too
restricted to be helpful.

This patch implements slab_deactivate_memcg_cache_rcu_sched() which
slub can use to schedule sched RCU callback instead of performing
synchronize_sched() synchronously while holding cgroup_mutex.  While
this adds online cpus, mems and slab_mutex operations, operating on
these locks back-to-back from the same kworker, which is what's gonna
happen when there are many to deactivate, isn't expensive at all and
this gets rid of the scalability problem completely.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 include/linux/slab.h |  6 ++++++
 mm/slab.h            |  2 ++
 mm/slab_common.c     | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/slub.c            | 12 +++++++----
 4 files changed, 76 insertions(+), 4 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 41c49cc..5ca8778 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -582,6 +582,12 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head children_node;
 			struct list_head kmem_caches_node;
+
+			void (*deact_fn)(struct kmem_cache *);
+			union {
+				struct rcu_head deact_rcu_head;
+				struct work_struct deact_work;
+			};
 		};
 	};
 };
diff --git a/mm/slab.h b/mm/slab.h
index 0946d97..2fe07d7 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -304,6 +304,8 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
 
 extern void slab_init_memcg_params(struct kmem_cache *);
 extern void memcg_link_cache(struct kmem_cache *s);
+extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+				void (*deact_fn)(struct kmem_cache *));
 
 #else /* CONFIG_MEMCG && !CONFIG_SLOB */
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index cd81cad..4a0605c 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -592,6 +592,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
 	put_online_cpus();
 }
 
+static void kmemcg_deactivate_workfn(struct work_struct *work)
+{
+	struct kmem_cache *s = container_of(work, struct kmem_cache,
+					    memcg_params.deact_work);
+
+	get_online_cpus();
+	get_online_mems();
+
+	mutex_lock(&slab_mutex);
+
+	s->memcg_params.deact_fn(s);
+
+	mutex_unlock(&slab_mutex);
+
+	put_online_mems();
+	put_online_cpus();
+
+	/* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
+	css_put(&s->memcg_params.memcg->css);
+}
+
+static void kmemcg_deactivate_rcufn(struct rcu_head *head)
+{
+	struct kmem_cache *s = container_of(head, struct kmem_cache,
+					    memcg_params.deact_rcu_head);
+
+	/*
+	 * We need to grab blocking locks.  Bounce to ->deact_work.  The
+	 * work item shares the space with the RCU head and can't be
+	 * initialized eariler.
+	 */
+	INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
+	schedule_work(&s->memcg_params.deact_work);
+}
+
+/**
+ * slab_deactivate_memcg_cache_rcu_sched - schedule deactivation after a
+ *					   sched RCU grace period
+ * @s: target kmem_cache
+ * @deact_fn: deactivation function to call
+ *
+ * Schedule @deact_fn to be invoked with online cpus, mems and slab_mutex
+ * held after a sched RCU grace period.  The slab is guaranteed to stay
+ * alive until @deact_fn is finished.  This is to be used from
+ * __kmemcg_cache_deactivate().
+ */
+void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
+					   void (*deact_fn)(struct kmem_cache *))
+{
+	if (WARN_ON_ONCE(is_root_cache(s)) ||
+	    WARN_ON_ONCE(s->memcg_params.deact_fn))
+		return;
+
+	/* pin memcg so that @s doesn't get destroyed in the middle */
+	css_get(&s->memcg_params.memcg->css);
+
+	s->memcg_params.deact_fn = deact_fn;
+	call_rcu_sched(&s->memcg_params.deact_rcu_head, kmemcg_deactivate_rcufn);
+}
+
 void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 {
 	int idx;
diff --git a/mm/slub.c b/mm/slub.c
index c754ea0..184f80b 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3949,6 +3949,12 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 }
 
 #ifdef CONFIG_MEMCG
+static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
+{
+	/* called with all the locks held after a sched RCU grace period */
+	__kmem_cache_shrink(s);
+}
+
 void __kmemcg_cache_deactivate(struct kmem_cache *s)
 {
 	/*
@@ -3960,11 +3966,9 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s)
 
 	/*
 	 * s->cpu_partial is checked locklessly (see put_cpu_partial), so
-	 * we have to make sure the change is visible.
+	 * we have to make sure the change is visible before shrinking.
 	 */
-	synchronize_sched();
-
-	__kmem_cache_shrink(s);
+	slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu);
 }
 #endif
 
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 8/8] slab: remove slub sysfs interface files early for empty memcg caches
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
                   ` (6 preceding siblings ...)
  2017-01-14 18:48 ` [PATCH 7/8] slab: remove synchronous synchronize_sched() from memcg cache deactivation path Tejun Heo
@ 2017-01-14 18:48 ` Tejun Heo
  2017-01-17  0:12 ` [PATCHSET v2] slab: make memcg slab destruction scalable Joonsoo Kim
  8 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-14 18:48 UTC (permalink / raw)
  To: vdavydov.dev, cl, penberg, rientjes, iamjoonsoo.kim, akpm
  Cc: jsvana, hannes, linux-kernel, linux-mm, cgroups, kernel-team, Tejun Heo

With kmem cgroup support enabled, kmem_caches can be created and
destroyed frequently and a great number of near empty kmem_caches can
accumulate if there are a lot of transient cgroups and the system is
not under memory pressure.  When memory reclaim starts under such
conditions, it can lead to consecutive deactivation and destruction of
many kmem_caches, easily hundreds of thousands on moderately large
systems, exposing scalability issues in the current slab management
code.  This is one of the patches to address the issue.

Each cache has a number of sysfs interface files under
/sys/kernel/slab.  On a system with a lot of memory and transient
memcgs, the number of interface files which have to be removed once
memory reclaim kicks in can reach millions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Jay Vana <jsvana@fb.com>
Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
 mm/slub.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 184f80b..5bffa1f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3951,8 +3951,20 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 #ifdef CONFIG_MEMCG
 static void kmemcg_cache_deact_after_rcu(struct kmem_cache *s)
 {
-	/* called with all the locks held after a sched RCU grace period */
-	__kmem_cache_shrink(s);
+	/*
+	 * Called with all the locks held after a sched RCU grace period.
+	 * Even if @s becomes empty after shrinking, we can't know that @s
+	 * doesn't have allocations already in-flight and thus can't
+	 * destroy @s until the associated memcg is released.
+	 *
+	 * However, let's remove the sysfs files for empty caches here.
+	 * Each cache has a lot of interface files which aren't
+	 * particularly useful for empty draining caches; otherwise, we can
+	 * easily end up with millions of unnecessary sysfs files on
+	 * systems which have a lot of memory and transient cgroups.
+	 */
+	if (!__kmem_cache_shrink(s))
+		sysfs_slab_remove(s);
 }
 
 void __kmemcg_cache_deactivate(struct kmem_cache *s)
@@ -5651,6 +5663,15 @@ static void sysfs_slab_remove(struct kmem_cache *s)
 		 */
 		return;
 
+	if (!s->kobj.state_in_sysfs)
+		/*
+		 * For a memcg cache, this may be called during
+		 * deactivation and again on shutdown.  Remove only once.
+		 * A cache is never shut down before deactivation is
+		 * complete, so no need to worry about synchronization.
+		 */
+		return;
+
 #ifdef CONFIG_MEMCG
 	kset_unregister(s->memcg_kset);
 #endif
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2] slab: make memcg slab destruction scalable
  2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
                   ` (7 preceding siblings ...)
  2017-01-14 18:48 ` [PATCH 8/8] slab: remove slub sysfs interface files early for empty memcg caches Tejun Heo
@ 2017-01-17  0:12 ` Joonsoo Kim
  2017-01-17 16:49   ` Tejun Heo
  8 siblings, 1 reply; 15+ messages in thread
From: Joonsoo Kim @ 2017-01-17  0:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: vdavydov.dev, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 01:48:26PM -0500, Tejun Heo wrote:
> This is v2.  Changes from the last version[L] are
> 
> * 0002-slab-remove-synchronous-rcu_barrier-call-in-memcg-ca.patch was
>   incorrect and dropped.
> 
> * 0006-slab-don-t-put-memcg-caches-on-slab_caches-list.patch
>   incorrectly converted places which needed to walk all caches.
>   Replaced with 0005-slab-implement-slab_root_caches-list.patch which
>   adds root-only list instead of converting slab_caches list to list
>   only root caches.
> 
> * Misc fixes.
> 
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.
> 
> I've seen machines which end up with hundred thousands of caches and
> many millions of kernfs_nodes.  The current code is O(N^2) on the
> total number of caches and has synchronous rcu_barrier() and
> synchronize_sched() in cgroup offline / release path which is executed
> while holding cgroup_mutex.  Combined, this leads to very expensive
> and slow cache destruction operations which can easily keep running
> for half a day.
> 
> This also messes up /proc/slabinfo along with other cache iterating
> operations.  seq_file operates on 4k chunks and on each 4k boundary
> tries to seek to the last position in the list.  With a huge number of
> caches on the list, this becomes very slow and very prone to the list
> content changing underneath it leading to a lot of missing and/or
> duplicate entries.
> 
> This patchset addresses the scalability problem.
> 
> * Add root and per-memcg lists.  Update each user to use the
>   appropriate list.
> 
> * Replace rcu_barrier() and synchronize_rcu() with call_rcu() and
>   call_rcu_sched().
> 
> * For dying empty slub caches, remove the sysfs files after
>   deactivation so that we don't end up with millions of sysfs files
>   without any useful information on them.

Could you confirm that your series solves the problem that is reported
by Doug? It would be great if the result is mentioned to the patch
description.

https://bugzilla.kernel.org/show_bug.cgi?id=172991

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 7/8] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
  2017-01-14 18:48 ` [PATCH 7/8] slab: remove synchronous synchronize_sched() from memcg cache deactivation path Tejun Heo
@ 2017-01-17  0:26   ` Joonsoo Kim
  2017-01-17 16:42     ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Joonsoo Kim @ 2017-01-17  0:26 UTC (permalink / raw)
  To: Tejun Heo
  Cc: vdavydov.dev, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Sat, Jan 14, 2017 at 01:48:33PM -0500, Tejun Heo wrote:
> With kmem cgroup support enabled, kmem_caches can be created and
> destroyed frequently and a great number of near empty kmem_caches can
> accumulate if there are a lot of transient cgroups and the system is
> not under memory pressure.  When memory reclaim starts under such
> conditions, it can lead to consecutive deactivation and destruction of
> many kmem_caches, easily hundreds of thousands on moderately large
> systems, exposing scalability issues in the current slab management
> code.  This is one of the patches to address the issue.
> 
> slub uses synchronize_sched() to deactivate a memcg cache.
> synchronize_sched() is an expensive and slow operation and doesn't
> scale when a huge number of caches are destroyed back-to-back.  While
> there used to be a simple batching mechanism, the batching was too
> restricted to be helpful.
> 
> This patch implements slab_deactivate_memcg_cache_rcu_sched() which
> slub can use to schedule sched RCU callback instead of performing
> synchronize_sched() synchronously while holding cgroup_mutex.  While
> this adds online cpus, mems and slab_mutex operations, operating on
> these locks back-to-back from the same kworker, which is what's gonna
> happen when there are many to deactivate, isn't expensive at all and
> this gets rid of the scalability problem completely.
> 
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Reported-by: Jay Vana <jsvana@fb.com>
> Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> ---
>  include/linux/slab.h |  6 ++++++
>  mm/slab.h            |  2 ++
>  mm/slab_common.c     | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  mm/slub.c            | 12 +++++++----
>  4 files changed, 76 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/slab.h b/include/linux/slab.h
> index 41c49cc..5ca8778 100644
> --- a/include/linux/slab.h
> +++ b/include/linux/slab.h
> @@ -582,6 +582,12 @@ struct memcg_cache_params {
>  			struct mem_cgroup *memcg;
>  			struct list_head children_node;
>  			struct list_head kmem_caches_node;
> +
> +			void (*deact_fn)(struct kmem_cache *);
> +			union {
> +				struct rcu_head deact_rcu_head;
> +				struct work_struct deact_work;
> +			};
>  		};
>  	};
>  };
> diff --git a/mm/slab.h b/mm/slab.h
> index 0946d97..2fe07d7 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -304,6 +304,8 @@ static __always_inline void memcg_uncharge_slab(struct page *page, int order,
>  
>  extern void slab_init_memcg_params(struct kmem_cache *);
>  extern void memcg_link_cache(struct kmem_cache *s);
> +extern void slab_deactivate_memcg_cache_rcu_sched(struct kmem_cache *s,
> +				void (*deact_fn)(struct kmem_cache *));
>  
>  #else /* CONFIG_MEMCG && !CONFIG_SLOB */
>  
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index cd81cad..4a0605c 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -592,6 +592,66 @@ void memcg_create_kmem_cache(struct mem_cgroup *memcg,
>  	put_online_cpus();
>  }
>  
> +static void kmemcg_deactivate_workfn(struct work_struct *work)
> +{
> +	struct kmem_cache *s = container_of(work, struct kmem_cache,
> +					    memcg_params.deact_work);
> +
> +	get_online_cpus();
> +	get_online_mems();
> +
> +	mutex_lock(&slab_mutex);
> +
> +	s->memcg_params.deact_fn(s);
> +
> +	mutex_unlock(&slab_mutex);
> +
> +	put_online_mems();
> +	put_online_cpus();
> +
> +	/* done, put the ref from slab_deactivate_memcg_cache_rcu_sched() */
> +	css_put(&s->memcg_params.memcg->css);
> +}
> +
> +static void kmemcg_deactivate_rcufn(struct rcu_head *head)
> +{
> +	struct kmem_cache *s = container_of(head, struct kmem_cache,
> +					    memcg_params.deact_rcu_head);
> +
> +	/*
> +	 * We need to grab blocking locks.  Bounce to ->deact_work.  The
> +	 * work item shares the space with the RCU head and can't be
> +	 * initialized eariler.
> +	 */
> +	INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
> +	schedule_work(&s->memcg_params.deact_work);
> +}

Isn't it better to submit one work item for each memcg like as
Vladimir did? Or, could you submit this work to the ordered workqueue?
I'm not an expert about workqueue like as you, but, I think
that there is a chance to create a lot of threads if there is
the slab_mutex lock contention.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 7/8] slab: remove synchronous synchronize_sched() from memcg cache deactivation path
  2017-01-17  0:26   ` Joonsoo Kim
@ 2017-01-17 16:42     ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-17 16:42 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: vdavydov.dev, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Tue, Jan 17, 2017 at 09:26:11AM +0900, Joonsoo Kim wrote:
> > +	INIT_WORK(&s->memcg_params.deact_work, kmemcg_deactivate_workfn);
> > +	schedule_work(&s->memcg_params.deact_work);
> > +}
> 
> Isn't it better to submit one work item for each memcg like as
> Vladimir did? Or, could you submit this work to the ordered workqueue?
> I'm not an expert about workqueue like as you, but, I think
> that there is a chance to create a lot of threads if there is
> the slab_mutex lock contention.

Yeah, good point.  I'll switch it to its own workqueue w/ concurrency
limited to one.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2] slab: make memcg slab destruction scalable
  2017-01-17  0:12 ` [PATCHSET v2] slab: make memcg slab destruction scalable Joonsoo Kim
@ 2017-01-17 16:49   ` Tejun Heo
  2017-01-18  7:54     ` Joonsoo Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Tejun Heo @ 2017-01-17 16:49 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: vdavydov.dev, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

Hello,

On Tue, Jan 17, 2017 at 09:12:57AM +0900, Joonsoo Kim wrote:
> Could you confirm that your series solves the problem that is reported
> by Doug? It would be great if the result is mentioned to the patch
> description.
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=172991

So, that's an issue in the creation path which is already resolved by
switching to an ordered workqueue (it'd probably be better to use
per-cpu wq w/ @max_active == 1 tho).  This patchset is about relesae
path.  slab_mutex contention would definitely go down with this but
I don't think there's more connection to it than that.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2] slab: make memcg slab destruction scalable
  2017-01-17 16:49   ` Tejun Heo
@ 2017-01-18  7:54     ` Joonsoo Kim
  2017-01-18 21:01       ` Tejun Heo
  0 siblings, 1 reply; 15+ messages in thread
From: Joonsoo Kim @ 2017-01-18  7:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: vdavydov.dev, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

On Tue, Jan 17, 2017 at 08:49:13AM -0800, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jan 17, 2017 at 09:12:57AM +0900, Joonsoo Kim wrote:
> > Could you confirm that your series solves the problem that is reported
> > by Doug? It would be great if the result is mentioned to the patch
> > description.
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=172991
> 
> So, that's an issue in the creation path which is already resolved by
> switching to an ordered workqueue (it'd probably be better to use
> per-cpu wq w/ @max_active == 1 tho).  This patchset is about relesae
> path.  slab_mutex contention would definitely go down with this but
> I don't think there's more connection to it than that.

That problem is caused by slow release path and then contention on the
slab_mutex. With an ordered workqueue, kworker would not be created a
lot but it can be possible that a lot of work items to create a new
cache for memcg is pending for a long time due to slow release path.

Your patchset replaces optimization for release path so it's better to
check that the work isn't pending for a long time in above workload.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCHSET v2] slab: make memcg slab destruction scalable
  2017-01-18  7:54     ` Joonsoo Kim
@ 2017-01-18 21:01       ` Tejun Heo
  0 siblings, 0 replies; 15+ messages in thread
From: Tejun Heo @ 2017-01-18 21:01 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: vdavydov.dev, cl, penberg, rientjes, akpm, jsvana, hannes,
	linux-kernel, linux-mm, cgroups, kernel-team

Hello,

On Wed, Jan 18, 2017 at 04:54:48PM +0900, Joonsoo Kim wrote:
> That problem is caused by slow release path and then contention on the
> slab_mutex. With an ordered workqueue, kworker would not be created a
> lot but it can be possible that a lot of work items to create a new
> cache for memcg is pending for a long time due to slow release path.

How many work items are pending and how many workers are on them
shouldn't affect the actual completion time that much when most of
them are serialized by a mutex.  Anyways, this patchset moves all the
slow parts out of slab_mutex, so none of this is a problem anymore.

> Your patchset replaces optimization for release path so it's better to
> check that the work isn't pending for a long time in above workload.

Yeap, it seems to work fine.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-01-18 21:09 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-14 18:48 [PATCHSET v2] slab: make memcg slab destruction scalable Tejun Heo
2017-01-14 18:48 ` [PATCH 1/8] Revert "slub: move synchronize_sched out of slab_mutex on shrink" Tejun Heo
2017-01-14 18:48 ` [PATCH 2/8] slab: remove synchronous rcu_barrier() call in memcg cache release path Tejun Heo
2017-01-14 18:48 ` [PATCH 3/8] slab: reorganize memcg_cache_params Tejun Heo
2017-01-14 18:48 ` [PATCH 4/8] slab: link memcg kmem_caches on their associated memory cgroup Tejun Heo
2017-01-14 18:48 ` [PATCH 5/8] slab: implement slab_root_caches list Tejun Heo
2017-01-14 18:48 ` [PATCH 6/8] slab: introduce __kmemcg_cache_deactivate() Tejun Heo
2017-01-14 18:48 ` [PATCH 7/8] slab: remove synchronous synchronize_sched() from memcg cache deactivation path Tejun Heo
2017-01-17  0:26   ` Joonsoo Kim
2017-01-17 16:42     ` Tejun Heo
2017-01-14 18:48 ` [PATCH 8/8] slab: remove slub sysfs interface files early for empty memcg caches Tejun Heo
2017-01-17  0:12 ` [PATCHSET v2] slab: make memcg slab destruction scalable Joonsoo Kim
2017-01-17 16:49   ` Tejun Heo
2017-01-18  7:54     ` Joonsoo Kim
2017-01-18 21:01       ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).