All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-01 13:56 ` Vladimir Davydov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2016-10-01 13:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Christoph Lameter, David Rientjes,
	Johannes Weiner, Joonsoo Kim, Michal Hocko, Pekka Enberg

Creating a lot of cgroups at the same time might stall all worker
threads with kmem cache creation works, because kmem cache creation is
done with the slab_mutex held. To prevent that from happening, let's use
a special workqueue for kmem cache creation with max in-flight work
items equal to 1.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 mm/memcontrol.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4be518d4e68a..c1efe59e3a20 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2175,6 +2175,8 @@ struct memcg_kmem_cache_create_work {
 	struct work_struct work;
 };
 
+static struct workqueue_struct *memcg_kmem_cache_create_wq;
+
 static void memcg_kmem_cache_create_func(struct work_struct *w)
 {
 	struct memcg_kmem_cache_create_work *cw =
@@ -2206,7 +2208,7 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
 
-	schedule_work(&cw->work);
+	queue_work(memcg_kmem_cache_create_wq, &cw->work);
 }
 
 static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
@@ -5794,6 +5796,17 @@ static int __init mem_cgroup_init(void)
 {
 	int cpu, node;
 
+#ifndef CONFIG_SLOB
+	/*
+	 * Kmem cache creation is mostly done with the slab_mutex held,
+	 * so use a special workqueue to avoid stalling all worker
+	 * threads in case lots of cgroups are created simultaneously.
+	 */
+	memcg_kmem_cache_create_wq =
+		alloc_workqueue("memcg_kmem_cache_create", 0, 1);
+	BUG_ON(!memcg_kmem_cache_create_wq);
+#endif
+
 	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 
 	for_each_possible_cpu(cpu)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-01 13:56 ` Vladimir Davydov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2016-10-01 13:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Christoph Lameter, David Rientjes,
	Johannes Weiner, Joonsoo Kim, Michal Hocko, Pekka Enberg

Creating a lot of cgroups at the same time might stall all worker
threads with kmem cache creation works, because kmem cache creation is
done with the slab_mutex held. To prevent that from happening, let's use
a special workqueue for kmem cache creation with max in-flight work
items equal to 1.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 mm/memcontrol.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4be518d4e68a..c1efe59e3a20 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2175,6 +2175,8 @@ struct memcg_kmem_cache_create_work {
 	struct work_struct work;
 };
 
+static struct workqueue_struct *memcg_kmem_cache_create_wq;
+
 static void memcg_kmem_cache_create_func(struct work_struct *w)
 {
 	struct memcg_kmem_cache_create_work *cw =
@@ -2206,7 +2208,7 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
 
-	schedule_work(&cw->work);
+	queue_work(memcg_kmem_cache_create_wq, &cw->work);
 }
 
 static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
@@ -5794,6 +5796,17 @@ static int __init mem_cgroup_init(void)
 {
 	int cpu, node;
 
+#ifndef CONFIG_SLOB
+	/*
+	 * Kmem cache creation is mostly done with the slab_mutex held,
+	 * so use a special workqueue to avoid stalling all worker
+	 * threads in case lots of cgroups are created simultaneously.
+	 */
+	memcg_kmem_cache_create_wq =
+		alloc_workqueue("memcg_kmem_cache_create", 0, 1);
+	BUG_ON(!memcg_kmem_cache_create_wq);
+#endif
+
 	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 
 	for_each_possible_cpu(cpu)
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink
  2016-10-01 13:56 ` Vladimir Davydov
@ 2016-10-01 13:56   ` Vladimir Davydov
  -1 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2016-10-01 13:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Christoph Lameter, David Rientjes,
	Johannes Weiner, Joonsoo Kim, Michal Hocko, Pekka Enberg

synchronize_sched() is a heavy operation and calling it per each cache
owned by a memory cgroup being destroyed may take quite some time. What
is worse, it's currently called under the slab_mutex, stalling all works
doing cache creation/destruction.

Actually, there isn't much point in calling synchronize_sched() for each
cache - it's enough to call it just once - after setting cpu_partial for
all caches and before shrinking them. This way, we can also move it out
of the slab_mutex, which we have to hold for iterating over the slab
cache list.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 mm/slab.c        |  4 ++--
 mm/slab.h        |  2 +-
 mm/slab_common.c | 27 +++++++++++++++++++++++++--
 mm/slob.c        |  2 +-
 mm/slub.c        | 19 ++-----------------
 5 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b67271024135..fb9b0b06f6b9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2339,7 +2339,7 @@ out:
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *cachep)
 {
 	int ret = 0;
 	int node;
@@ -2359,7 +2359,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-	return __kmem_cache_shrink(cachep, false);
+	return __kmem_cache_shrink(cachep);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 9653f2e2591a..36382b24ba98 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -146,7 +146,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *, bool);
+int __kmem_cache_shrink(struct kmem_cache *);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 71f0b28a1bec..3fac1e2ca67b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -573,6 +573,29 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	get_online_cpus();
 	get_online_mems();
 
+#ifdef CONFIG_SLUB
+	/*
+	 * In case of SLUB, we need to disable empty slab caching to
+	 * avoid pinning the offline memory cgroup by freeable kmem
+	 * pages charged to it. SLAB doesn't need this, as it
+	 * periodically purges unused slabs.
+	 */
+	mutex_lock(&slab_mutex);
+	list_for_each_entry(s, &slab_caches, list) {
+		c = is_root_cache(s) ? cache_from_memcg_idx(s, idx) : NULL;
+		if (c) {
+			c->cpu_partial = 0;
+			c->min_partial = 0;
+		}
+	}
+	mutex_unlock(&slab_mutex);
+	/*
+	 * kmem_cache->cpu_partial is checked locklessly (see
+	 * put_cpu_partial()). Make sure the change is visible.
+	 */
+	synchronize_sched();
+#endif
+
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		if (!is_root_cache(s))
@@ -584,7 +607,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		if (!c)
 			continue;
 
-		__kmem_cache_shrink(c, true);
+		__kmem_cache_shrink(c);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
@@ -755,7 +778,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 	get_online_cpus();
 	get_online_mems();
 	kasan_cache_shrink(cachep);
-	ret = __kmem_cache_shrink(cachep, false);
+	ret = __kmem_cache_shrink(cachep);
 	put_online_mems();
 	put_online_cpus();
 	return ret;
diff --git a/mm/slob.c b/mm/slob.c
index 5ec158054ffe..eac04d4357ec 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *d)
 {
 	return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 9adae58462f8..379b7963e48e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3868,7 +3868,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3880,21 +3880,6 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 	unsigned long flags;
 	int ret = 0;
 
-	if (deactivate) {
-		/*
-		 * Disable empty slabs caching. Used to avoid pinning offline
-		 * memory cgroups by kmem pages that can be freed.
-		 */
-		s->cpu_partial = 0;
-		s->min_partial = 0;
-
-		/*
-		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
-		 * so we have to make sure the change is visible.
-		 */
-		synchronize_sched();
-	}
-
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
@@ -3951,7 +3936,7 @@ static int slab_mem_going_offline_callback(void *arg)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s, false);
+		__kmem_cache_shrink(s);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink
@ 2016-10-01 13:56   ` Vladimir Davydov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2016-10-01 13:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Christoph Lameter, David Rientjes,
	Johannes Weiner, Joonsoo Kim, Michal Hocko, Pekka Enberg

synchronize_sched() is a heavy operation and calling it per each cache
owned by a memory cgroup being destroyed may take quite some time. What
is worse, it's currently called under the slab_mutex, stalling all works
doing cache creation/destruction.

Actually, there isn't much point in calling synchronize_sched() for each
cache - it's enough to call it just once - after setting cpu_partial for
all caches and before shrinking them. This way, we can also move it out
of the slab_mutex, which we have to hold for iterating over the slab
cache list.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 mm/slab.c        |  4 ++--
 mm/slab.h        |  2 +-
 mm/slab_common.c | 27 +++++++++++++++++++++++++--
 mm/slob.c        |  2 +-
 mm/slub.c        | 19 ++-----------------
 5 files changed, 31 insertions(+), 23 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index b67271024135..fb9b0b06f6b9 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -2339,7 +2339,7 @@ out:
 	return nr_freed;
 }
 
-int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *cachep)
 {
 	int ret = 0;
 	int node;
@@ -2359,7 +2359,7 @@ int __kmem_cache_shrink(struct kmem_cache *cachep, bool deactivate)
 
 int __kmem_cache_shutdown(struct kmem_cache *cachep)
 {
-	return __kmem_cache_shrink(cachep, false);
+	return __kmem_cache_shrink(cachep);
 }
 
 void __kmem_cache_release(struct kmem_cache *cachep)
diff --git a/mm/slab.h b/mm/slab.h
index 9653f2e2591a..36382b24ba98 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -146,7 +146,7 @@ static inline unsigned long kmem_cache_flags(unsigned long object_size,
 
 int __kmem_cache_shutdown(struct kmem_cache *);
 void __kmem_cache_release(struct kmem_cache *);
-int __kmem_cache_shrink(struct kmem_cache *, bool);
+int __kmem_cache_shrink(struct kmem_cache *);
 void slab_kmem_cache_release(struct kmem_cache *);
 
 struct seq_file;
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 71f0b28a1bec..3fac1e2ca67b 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -573,6 +573,29 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 	get_online_cpus();
 	get_online_mems();
 
+#ifdef CONFIG_SLUB
+	/*
+	 * In case of SLUB, we need to disable empty slab caching to
+	 * avoid pinning the offline memory cgroup by freeable kmem
+	 * pages charged to it. SLAB doesn't need this, as it
+	 * periodically purges unused slabs.
+	 */
+	mutex_lock(&slab_mutex);
+	list_for_each_entry(s, &slab_caches, list) {
+		c = is_root_cache(s) ? cache_from_memcg_idx(s, idx) : NULL;
+		if (c) {
+			c->cpu_partial = 0;
+			c->min_partial = 0;
+		}
+	}
+	mutex_unlock(&slab_mutex);
+	/*
+	 * kmem_cache->cpu_partial is checked locklessly (see
+	 * put_cpu_partial()). Make sure the change is visible.
+	 */
+	synchronize_sched();
+#endif
+
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list) {
 		if (!is_root_cache(s))
@@ -584,7 +607,7 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *memcg)
 		if (!c)
 			continue;
 
-		__kmem_cache_shrink(c, true);
+		__kmem_cache_shrink(c);
 		arr->entries[idx] = NULL;
 	}
 	mutex_unlock(&slab_mutex);
@@ -755,7 +778,7 @@ int kmem_cache_shrink(struct kmem_cache *cachep)
 	get_online_cpus();
 	get_online_mems();
 	kasan_cache_shrink(cachep);
-	ret = __kmem_cache_shrink(cachep, false);
+	ret = __kmem_cache_shrink(cachep);
 	put_online_mems();
 	put_online_cpus();
 	return ret;
diff --git a/mm/slob.c b/mm/slob.c
index 5ec158054ffe..eac04d4357ec 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -634,7 +634,7 @@ void __kmem_cache_release(struct kmem_cache *c)
 {
 }
 
-int __kmem_cache_shrink(struct kmem_cache *d, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *d)
 {
 	return 0;
 }
diff --git a/mm/slub.c b/mm/slub.c
index 9adae58462f8..379b7963e48e 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3868,7 +3868,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
+int __kmem_cache_shrink(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -3880,21 +3880,6 @@ int __kmem_cache_shrink(struct kmem_cache *s, bool deactivate)
 	unsigned long flags;
 	int ret = 0;
 
-	if (deactivate) {
-		/*
-		 * Disable empty slabs caching. Used to avoid pinning offline
-		 * memory cgroups by kmem pages that can be freed.
-		 */
-		s->cpu_partial = 0;
-		s->min_partial = 0;
-
-		/*
-		 * s->cpu_partial is checked locklessly (see put_cpu_partial),
-		 * so we have to make sure the change is visible.
-		 */
-		synchronize_sched();
-	}
-
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
@@ -3951,7 +3936,7 @@ static int slab_mem_going_offline_callback(void *arg)
 
 	mutex_lock(&slab_mutex);
 	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s, false);
+		__kmem_cache_shrink(s);
 	mutex_unlock(&slab_mutex);
 
 	return 0;
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
  2016-10-01 13:56 ` Vladimir Davydov
@ 2016-10-03 12:06   ` Michal Hocko
  -1 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-10-03 12:06 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, linux-mm, linux-kernel, Christoph Lameter,
	David Rientjes, Johannes Weiner, Joonsoo Kim, Pekka Enberg

On Sat 01-10-16 16:56:47, Vladimir Davydov wrote:
> Creating a lot of cgroups at the same time might stall all worker
> threads with kmem cache creation works, because kmem cache creation is
> done with the slab_mutex held. To prevent that from happening, let's use
> a special workqueue for kmem cache creation with max in-flight work
> items equal to 1.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981

This looks like a regression but I am not really sure I understand what
has caused it. We had the WQ based cache creation since kmem was
introduced more or less. So is it 801faf0db894 ("mm/slab: lockless
decision to grow cache") which was pointed by bisection that changed the
timing resp. relaxed the cache creation to the point that would allow
this runaway? This would be really useful for the stable backport
consideration.

Also, if I understand the fix correctly, now we do limit the number of
workers to 1 thread. Is this really what we want? Wouldn't it be
possible that few memcgs could starve others fromm having their cache
created? What would be the result, missed charges?

> Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Reported-by: Doug Smythies <dsmythies@telus.net>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Pekka Enberg <penberg@kernel.org>
> ---
>  mm/memcontrol.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4be518d4e68a..c1efe59e3a20 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2175,6 +2175,8 @@ struct memcg_kmem_cache_create_work {
>  	struct work_struct work;
>  };
>  
> +static struct workqueue_struct *memcg_kmem_cache_create_wq;
> +
>  static void memcg_kmem_cache_create_func(struct work_struct *w)
>  {
>  	struct memcg_kmem_cache_create_work *cw =
> @@ -2206,7 +2208,7 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
>  	cw->cachep = cachep;
>  	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
>  
> -	schedule_work(&cw->work);
> +	queue_work(memcg_kmem_cache_create_wq, &cw->work);
>  }
>  
>  static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
> @@ -5794,6 +5796,17 @@ static int __init mem_cgroup_init(void)
>  {
>  	int cpu, node;
>  
> +#ifndef CONFIG_SLOB
> +	/*
> +	 * Kmem cache creation is mostly done with the slab_mutex held,
> +	 * so use a special workqueue to avoid stalling all worker
> +	 * threads in case lots of cgroups are created simultaneously.
> +	 */
> +	memcg_kmem_cache_create_wq =
> +		alloc_workqueue("memcg_kmem_cache_create", 0, 1);
> +	BUG_ON(!memcg_kmem_cache_create_wq);
> +#endif
> +
>  	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  
>  	for_each_possible_cpu(cpu)
> -- 
> 2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-03 12:06   ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-10-03 12:06 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, linux-mm, linux-kernel, Christoph Lameter,
	David Rientjes, Johannes Weiner, Joonsoo Kim, Pekka Enberg

On Sat 01-10-16 16:56:47, Vladimir Davydov wrote:
> Creating a lot of cgroups at the same time might stall all worker
> threads with kmem cache creation works, because kmem cache creation is
> done with the slab_mutex held. To prevent that from happening, let's use
> a special workqueue for kmem cache creation with max in-flight work
> items equal to 1.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981

This looks like a regression but I am not really sure I understand what
has caused it. We had the WQ based cache creation since kmem was
introduced more or less. So is it 801faf0db894 ("mm/slab: lockless
decision to grow cache") which was pointed by bisection that changed the
timing resp. relaxed the cache creation to the point that would allow
this runaway? This would be really useful for the stable backport
consideration.

Also, if I understand the fix correctly, now we do limit the number of
workers to 1 thread. Is this really what we want? Wouldn't it be
possible that few memcgs could starve others fromm having their cache
created? What would be the result, missed charges?

> Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Reported-by: Doug Smythies <dsmythies@telus.net>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Pekka Enberg <penberg@kernel.org>
> ---
>  mm/memcontrol.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4be518d4e68a..c1efe59e3a20 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2175,6 +2175,8 @@ struct memcg_kmem_cache_create_work {
>  	struct work_struct work;
>  };
>  
> +static struct workqueue_struct *memcg_kmem_cache_create_wq;
> +
>  static void memcg_kmem_cache_create_func(struct work_struct *w)
>  {
>  	struct memcg_kmem_cache_create_work *cw =
> @@ -2206,7 +2208,7 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
>  	cw->cachep = cachep;
>  	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
>  
> -	schedule_work(&cw->work);
> +	queue_work(memcg_kmem_cache_create_wq, &cw->work);
>  }
>  
>  static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
> @@ -5794,6 +5796,17 @@ static int __init mem_cgroup_init(void)
>  {
>  	int cpu, node;
>  
> +#ifndef CONFIG_SLOB
> +	/*
> +	 * Kmem cache creation is mostly done with the slab_mutex held,
> +	 * so use a special workqueue to avoid stalling all worker
> +	 * threads in case lots of cgroups are created simultaneously.
> +	 */
> +	memcg_kmem_cache_create_wq =
> +		alloc_workqueue("memcg_kmem_cache_create", 0, 1);
> +	BUG_ON(!memcg_kmem_cache_create_wq);
> +#endif
> +
>  	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  
>  	for_each_possible_cpu(cpu)
> -- 
> 2.1.4

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
  2016-10-03 12:06   ` Michal Hocko
@ 2016-10-03 12:35     ` Vladimir Davydov
  -1 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2016-10-03 12:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Christoph Lameter,
	David Rientjes, Johannes Weiner, Joonsoo Kim, Pekka Enberg

On Mon, Oct 03, 2016 at 02:06:42PM +0200, Michal Hocko wrote:
> On Sat 01-10-16 16:56:47, Vladimir Davydov wrote:
> > Creating a lot of cgroups at the same time might stall all worker
> > threads with kmem cache creation works, because kmem cache creation is
> > done with the slab_mutex held. To prevent that from happening, let's use
> > a special workqueue for kmem cache creation with max in-flight work
> > items equal to 1.
> > 
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
> 
> This looks like a regression but I am not really sure I understand what
> has caused it. We had the WQ based cache creation since kmem was
> introduced more or less. So is it 801faf0db894 ("mm/slab: lockless
> decision to grow cache") which was pointed by bisection that changed the
> timing resp. relaxed the cache creation to the point that would allow
> this runaway?

It is in case of SLAB. For SLUB the issue was caused by commit
81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
synchronize_sched() in kmem_cache_shrink()").

> This would be really useful for the stable backport
> consideration.
> 
> Also, if I understand the fix correctly, now we do limit the number of
> workers to 1 thread. Is this really what we want? Wouldn't it be
> possible that few memcgs could starve others fromm having their cache
> created? What would be the result, missed charges?

Now kmem caches are created in FIFO order, i.e. if one memcg called
kmem_cache_alloc on a non-existent cache before another, it will be
served first. Since the number of caches that can be created by a single
memcg is obviously limited, I don't see any possibility of starvation.
Actually, this patch doesn't introduce any functional changes regarding
the order in which kmem caches are created, as the work function holds
the global slab_mutex during its whole runtime anyway. We only avoid
creating a thread per each work by making the queue single-threaded.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-03 12:35     ` Vladimir Davydov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2016-10-03 12:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-mm, linux-kernel, Christoph Lameter,
	David Rientjes, Johannes Weiner, Joonsoo Kim, Pekka Enberg

On Mon, Oct 03, 2016 at 02:06:42PM +0200, Michal Hocko wrote:
> On Sat 01-10-16 16:56:47, Vladimir Davydov wrote:
> > Creating a lot of cgroups at the same time might stall all worker
> > threads with kmem cache creation works, because kmem cache creation is
> > done with the slab_mutex held. To prevent that from happening, let's use
> > a special workqueue for kmem cache creation with max in-flight work
> > items equal to 1.
> > 
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
> 
> This looks like a regression but I am not really sure I understand what
> has caused it. We had the WQ based cache creation since kmem was
> introduced more or less. So is it 801faf0db894 ("mm/slab: lockless
> decision to grow cache") which was pointed by bisection that changed the
> timing resp. relaxed the cache creation to the point that would allow
> this runaway?

It is in case of SLAB. For SLUB the issue was caused by commit
81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
synchronize_sched() in kmem_cache_shrink()").

> This would be really useful for the stable backport
> consideration.
> 
> Also, if I understand the fix correctly, now we do limit the number of
> workers to 1 thread. Is this really what we want? Wouldn't it be
> possible that few memcgs could starve others fromm having their cache
> created? What would be the result, missed charges?

Now kmem caches are created in FIFO order, i.e. if one memcg called
kmem_cache_alloc on a non-existent cache before another, it will be
served first. Since the number of caches that can be created by a single
memcg is obviously limited, I don't see any possibility of starvation.
Actually, this patch doesn't introduce any functional changes regarding
the order in which kmem caches are created, as the work function holds
the global slab_mutex during its whole runtime anyway. We only avoid
creating a thread per each work by making the queue single-threaded.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
  2016-10-03 12:35     ` Vladimir Davydov
@ 2016-10-03 13:19       ` Michal Hocko
  -1 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-10-03 13:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, linux-mm, linux-kernel, Christoph Lameter,
	David Rientjes, Johannes Weiner, Joonsoo Kim, Pekka Enberg

On Mon 03-10-16 15:35:06, Vladimir Davydov wrote:
> On Mon, Oct 03, 2016 at 02:06:42PM +0200, Michal Hocko wrote:
> > On Sat 01-10-16 16:56:47, Vladimir Davydov wrote:
> > > Creating a lot of cgroups at the same time might stall all worker
> > > threads with kmem cache creation works, because kmem cache creation is
> > > done with the slab_mutex held. To prevent that from happening, let's use
> > > a special workqueue for kmem cache creation with max in-flight work
> > > items equal to 1.
> > > 
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
> > 
> > This looks like a regression but I am not really sure I understand what
> > has caused it. We had the WQ based cache creation since kmem was
> > introduced more or less. So is it 801faf0db894 ("mm/slab: lockless
> > decision to grow cache") which was pointed by bisection that changed the
> > timing resp. relaxed the cache creation to the point that would allow
> > this runaway?
> 
> It is in case of SLAB. For SLUB the issue was caused by commit
> 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> synchronize_sched() in kmem_cache_shrink()").

OK, thanks for the confirmation. This would be useful in the changelog
imho.

> > This would be really useful for the stable backport
> > consideration.
> > 
> > Also, if I understand the fix correctly, now we do limit the number of
> > workers to 1 thread. Is this really what we want? Wouldn't it be
> > possible that few memcgs could starve others fromm having their cache
> > created? What would be the result, missed charges?
> 
> Now kmem caches are created in FIFO order, i.e. if one memcg called
> kmem_cache_alloc on a non-existent cache before another, it will be
> served first.

I do not see where this FIFO is guaranteed.
__memcg_schedule_kmem_cache_create doesn't seem to be using ordered WQ.

> Since the number of caches that can be created by a single
> memcg is obviously limited,

by the number of existing caches, right?

> I don't see any possibility of starvation.

What I meant was that while now workers can contend on the slab_mutex
with the patch there will be a real ordering in place AFAIU and so an
unlucky memcg can be waiting for N(memcgs) * N (caches) to be served.
Not that the current implementation gives us anything because the
ordering should be more or less scheduling and workers dependent. Or I
am missing something. A per-cache memcg WQ would mitigate to some
extent.

> Actually, this patch doesn't introduce any functional changes regarding
> the order in which kmem caches are created, as the work function holds
> the global slab_mutex during its whole runtime anyway. We only avoid
> creating a thread per each work by making the queue single-threaded.

OK please put this information into the changelog.

That being said I am not opposing the current solution I just wanted to
understand all the consequences and would appreciate more information in
the changelog as this seems like the stable material.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-03 13:19       ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-10-03 13:19 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, linux-mm, linux-kernel, Christoph Lameter,
	David Rientjes, Johannes Weiner, Joonsoo Kim, Pekka Enberg

On Mon 03-10-16 15:35:06, Vladimir Davydov wrote:
> On Mon, Oct 03, 2016 at 02:06:42PM +0200, Michal Hocko wrote:
> > On Sat 01-10-16 16:56:47, Vladimir Davydov wrote:
> > > Creating a lot of cgroups at the same time might stall all worker
> > > threads with kmem cache creation works, because kmem cache creation is
> > > done with the slab_mutex held. To prevent that from happening, let's use
> > > a special workqueue for kmem cache creation with max in-flight work
> > > items equal to 1.
> > > 
> > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
> > 
> > This looks like a regression but I am not really sure I understand what
> > has caused it. We had the WQ based cache creation since kmem was
> > introduced more or less. So is it 801faf0db894 ("mm/slab: lockless
> > decision to grow cache") which was pointed by bisection that changed the
> > timing resp. relaxed the cache creation to the point that would allow
> > this runaway?
> 
> It is in case of SLAB. For SLUB the issue was caused by commit
> 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> synchronize_sched() in kmem_cache_shrink()").

OK, thanks for the confirmation. This would be useful in the changelog
imho.

> > This would be really useful for the stable backport
> > consideration.
> > 
> > Also, if I understand the fix correctly, now we do limit the number of
> > workers to 1 thread. Is this really what we want? Wouldn't it be
> > possible that few memcgs could starve others fromm having their cache
> > created? What would be the result, missed charges?
> 
> Now kmem caches are created in FIFO order, i.e. if one memcg called
> kmem_cache_alloc on a non-existent cache before another, it will be
> served first.

I do not see where this FIFO is guaranteed.
__memcg_schedule_kmem_cache_create doesn't seem to be using ordered WQ.

> Since the number of caches that can be created by a single
> memcg is obviously limited,

by the number of existing caches, right?

> I don't see any possibility of starvation.

What I meant was that while now workers can contend on the slab_mutex
with the patch there will be a real ordering in place AFAIU and so an
unlucky memcg can be waiting for N(memcgs) * N (caches) to be served.
Not that the current implementation gives us anything because the
ordering should be more or less scheduling and workers dependent. Or I
am missing something. A per-cache memcg WQ would mitigate to some
extent.

> Actually, this patch doesn't introduce any functional changes regarding
> the order in which kmem caches are created, as the work function holds
> the global slab_mutex during its whole runtime anyway. We only avoid
> creating a thread per each work by making the queue single-threaded.

OK please put this information into the changelog.

That being said I am not opposing the current solution I just wanted to
understand all the consequences and would appreciate more information in
the changelog as this seems like the stable material.

Thanks!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
  2016-10-03 13:19       ` Michal Hocko
@ 2016-10-04 13:14         ` Vladimir Davydov
  -1 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2016-10-04 13:14 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, linux-kernel

On Mon, Oct 03, 2016 at 03:19:31PM +0200, Michal Hocko wrote:
> On Mon 03-10-16 15:35:06, Vladimir Davydov wrote:
> > On Mon, Oct 03, 2016 at 02:06:42PM +0200, Michal Hocko wrote:
> > > On Sat 01-10-16 16:56:47, Vladimir Davydov wrote:
> > > > Creating a lot of cgroups at the same time might stall all worker
> > > > threads with kmem cache creation works, because kmem cache creation is
> > > > done with the slab_mutex held. To prevent that from happening, let's use
> > > > a special workqueue for kmem cache creation with max in-flight work
> > > > items equal to 1.
> > > > 
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
> > > 
> > > This looks like a regression but I am not really sure I understand what
> > > has caused it. We had the WQ based cache creation since kmem was
> > > introduced more or less. So is it 801faf0db894 ("mm/slab: lockless
> > > decision to grow cache") which was pointed by bisection that changed the
> > > timing resp. relaxed the cache creation to the point that would allow
> > > this runaway?
> > 
> > It is in case of SLAB. For SLUB the issue was caused by commit
> > 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> > synchronize_sched() in kmem_cache_shrink()").
> 
> OK, thanks for the confirmation. This would be useful in the changelog
> imho.
> 
> > > This would be really useful for the stable backport
> > > consideration.
> > > 
> > > Also, if I understand the fix correctly, now we do limit the number of
> > > workers to 1 thread. Is this really what we want? Wouldn't it be
> > > possible that few memcgs could starve others fromm having their cache
> > > created? What would be the result, missed charges?
> > 
> > Now kmem caches are created in FIFO order, i.e. if one memcg called
> > kmem_cache_alloc on a non-existent cache before another, it will be
> > served first.
> 
> I do not see where this FIFO is guaranteed.
> __memcg_schedule_kmem_cache_create doesn't seem to be using ordered WQ.

Yeah, you're right - I thought max_active implies ordering, but it
doesn't. Then we can use an ordered workqueue. Here's the updated
patch:

>From 10f5f126800912c6a4b78a8b615138c1322694ad Mon Sep 17 00:00:00 2001
From: Vladimir Davydov <vdavydov.dev@gmail.com>
Date: Sat, 1 Oct 2016 16:39:09 +0300
Subject: [PATCH] mm: memcontrol: use special workqueue for creating per-memcg
 caches

Creating a lot of cgroups at the same time might stall all worker
threads with kmem cache creation works, because kmem cache creation is
done with the slab_mutex held. The problem was amplified by commits
801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
increased the maximal time the slab_mutex can be held.

To prevent that from happening, let's use a special ordered single
threaded workqueue for kmem cache creation. This shouldn't introduce any
functional changes regarding how kmem caches are created, as the work
function holds the global slab_mutex during its whole runtime anyway,
making it impossible to run more than one work at a time. By using a
single threaded workqueue, we just avoid creating a thread per each
work. Ordering is required to avoid a situation when a cgroup's work is
put off indefinitely because there are other cgroups to serve, in other
words to guarantee fairness.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
Reported-by: Doug Smythies <dsmythies@telus.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Pekka Enberg <penberg@kernel.org>

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4be518d4e68a..8d753d87ca37 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2175,6 +2175,8 @@ struct memcg_kmem_cache_create_work {
 	struct work_struct work;
 };
 
+static struct workqueue_struct *memcg_kmem_cache_create_wq;
+
 static void memcg_kmem_cache_create_func(struct work_struct *w)
 {
 	struct memcg_kmem_cache_create_work *cw =
@@ -2206,7 +2208,7 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
 	cw->cachep = cachep;
 	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
 
-	schedule_work(&cw->work);
+	queue_work(memcg_kmem_cache_create_wq, &cw->work);
 }
 
 static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
@@ -5794,6 +5796,17 @@ static int __init mem_cgroup_init(void)
 {
 	int cpu, node;
 
+#ifndef CONFIG_SLOB
+	/*
+	 * Kmem cache creation is mostly done with the slab_mutex held,
+	 * so use a special workqueue to avoid stalling all worker
+	 * threads in case lots of cgroups are created simultaneously.
+	 */
+	memcg_kmem_cache_create_wq =
+		alloc_ordered_workqueue("memcg_kmem_cache_create", 0);
+	BUG_ON(!memcg_kmem_cache_create_wq);
+#endif
+
 	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
 
 	for_each_possible_cpu(cpu)

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-04 13:14         ` Vladimir Davydov
  0 siblings, 0 replies; 20+ messages in thread
From: Vladimir Davydov @ 2016-10-04 13:14 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton; +Cc: linux-mm, linux-kernel

On Mon, Oct 03, 2016 at 03:19:31PM +0200, Michal Hocko wrote:
> On Mon 03-10-16 15:35:06, Vladimir Davydov wrote:
> > On Mon, Oct 03, 2016 at 02:06:42PM +0200, Michal Hocko wrote:
> > > On Sat 01-10-16 16:56:47, Vladimir Davydov wrote:
> > > > Creating a lot of cgroups at the same time might stall all worker
> > > > threads with kmem cache creation works, because kmem cache creation is
> > > > done with the slab_mutex held. To prevent that from happening, let's use
> > > > a special workqueue for kmem cache creation with max in-flight work
> > > > items equal to 1.
> > > > 
> > > > Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
> > > 
> > > This looks like a regression but I am not really sure I understand what
> > > has caused it. We had the WQ based cache creation since kmem was
> > > introduced more or less. So is it 801faf0db894 ("mm/slab: lockless
> > > decision to grow cache") which was pointed by bisection that changed the
> > > timing resp. relaxed the cache creation to the point that would allow
> > > this runaway?
> > 
> > It is in case of SLAB. For SLUB the issue was caused by commit
> > 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> > synchronize_sched() in kmem_cache_shrink()").
> 
> OK, thanks for the confirmation. This would be useful in the changelog
> imho.
> 
> > > This would be really useful for the stable backport
> > > consideration.
> > > 
> > > Also, if I understand the fix correctly, now we do limit the number of
> > > workers to 1 thread. Is this really what we want? Wouldn't it be
> > > possible that few memcgs could starve others fromm having their cache
> > > created? What would be the result, missed charges?
> > 
> > Now kmem caches are created in FIFO order, i.e. if one memcg called
> > kmem_cache_alloc on a non-existent cache before another, it will be
> > served first.
> 
> I do not see where this FIFO is guaranteed.
> __memcg_schedule_kmem_cache_create doesn't seem to be using ordered WQ.

Yeah, you're right - I thought max_active implies ordering, but it
doesn't. Then we can use an ordered workqueue. Here's the updated
patch:

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink
  2016-10-01 13:56   ` Vladimir Davydov
@ 2016-10-06  6:27     ` Joonsoo Kim
  -1 siblings, 0 replies; 20+ messages in thread
From: Joonsoo Kim @ 2016-10-06  6:27 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, linux-mm, linux-kernel, Christoph Lameter,
	David Rientjes, Johannes Weiner, Michal Hocko, Pekka Enberg,
	Doug Smythies

Ccing Doug, original reporter.

On Sat, Oct 01, 2016 at 04:56:48PM +0300, Vladimir Davydov wrote:
> synchronize_sched() is a heavy operation and calling it per each cache
> owned by a memory cgroup being destroyed may take quite some time. What
> is worse, it's currently called under the slab_mutex, stalling all works
> doing cache creation/destruction.
> 
> Actually, there isn't much point in calling synchronize_sched() for each
> cache - it's enough to call it just once - after setting cpu_partial for
> all caches and before shrinking them. This way, we can also move it out
> of the slab_mutex, which we have to hold for iterating over the slab
> cache list.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
> Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Reported-by: Doug Smythies <dsmythies@telus.net>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Pekka Enberg <penberg@kernel.org>

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

These two patches should be sent to stable. Isn't it?

Thanks.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink
@ 2016-10-06  6:27     ` Joonsoo Kim
  0 siblings, 0 replies; 20+ messages in thread
From: Joonsoo Kim @ 2016-10-06  6:27 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, linux-mm, linux-kernel, Christoph Lameter,
	David Rientjes, Johannes Weiner, Michal Hocko, Pekka Enberg,
	Doug Smythies

Ccing Doug, original reporter.

On Sat, Oct 01, 2016 at 04:56:48PM +0300, Vladimir Davydov wrote:
> synchronize_sched() is a heavy operation and calling it per each cache
> owned by a memory cgroup being destroyed may take quite some time. What
> is worse, it's currently called under the slab_mutex, stalling all works
> doing cache creation/destruction.
> 
> Actually, there isn't much point in calling synchronize_sched() for each
> cache - it's enough to call it just once - after setting cpu_partial for
> all caches and before shrinking them. This way, we can also move it out
> of the slab_mutex, which we have to hold for iterating over the slab
> cache list.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=172991
> Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Reported-by: Doug Smythies <dsmythies@telus.net>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Pekka Enberg <penberg@kernel.org>

Acked-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>

These two patches should be sent to stable. Isn't it?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
  2016-10-04 13:14         ` Vladimir Davydov
@ 2016-10-06 12:05           ` Michal Hocko
  -1 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-10-06 12:05 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Andrew Morton, linux-mm, linux-kernel

On Tue 04-10-16 16:14:17, Vladimir Davydov wrote:
[...]
> >From 10f5f126800912c6a4b78a8b615138c1322694ad Mon Sep 17 00:00:00 2001
> From: Vladimir Davydov <vdavydov.dev@gmail.com>
> Date: Sat, 1 Oct 2016 16:39:09 +0300
> Subject: [PATCH] mm: memcontrol: use special workqueue for creating per-memcg
>  caches
> 
> Creating a lot of cgroups at the same time might stall all worker
> threads with kmem cache creation works, because kmem cache creation is
> done with the slab_mutex held. The problem was amplified by commits
> 801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
> SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
> increased the maximal time the slab_mutex can be held.
> 
> To prevent that from happening, let's use a special ordered single
> threaded workqueue for kmem cache creation. This shouldn't introduce any
> functional changes regarding how kmem caches are created, as the work
> function holds the global slab_mutex during its whole runtime anyway,
> making it impossible to run more than one work at a time. By using a
> single threaded workqueue, we just avoid creating a thread per each
> work. Ordering is required to avoid a situation when a cgroup's work is
> put off indefinitely because there are other cgroups to serve, in other
> words to guarantee fairness.

I am not sure an indefinit starving was possible but a fairness seems to
be real AFAICS.

> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
> Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Reported-by: Doug Smythies <dsmythies@telus.net>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Pekka Enberg <penberg@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4be518d4e68a..8d753d87ca37 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2175,6 +2175,8 @@ struct memcg_kmem_cache_create_work {
>  	struct work_struct work;
>  };
>  
> +static struct workqueue_struct *memcg_kmem_cache_create_wq;
> +
>  static void memcg_kmem_cache_create_func(struct work_struct *w)
>  {
>  	struct memcg_kmem_cache_create_work *cw =
> @@ -2206,7 +2208,7 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
>  	cw->cachep = cachep;
>  	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
>  
> -	schedule_work(&cw->work);
> +	queue_work(memcg_kmem_cache_create_wq, &cw->work);
>  }
>  
>  static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
> @@ -5794,6 +5796,17 @@ static int __init mem_cgroup_init(void)
>  {
>  	int cpu, node;
>  
> +#ifndef CONFIG_SLOB
> +	/*
> +	 * Kmem cache creation is mostly done with the slab_mutex held,
> +	 * so use a special workqueue to avoid stalling all worker
> +	 * threads in case lots of cgroups are created simultaneously.
> +	 */
> +	memcg_kmem_cache_create_wq =
> +		alloc_ordered_workqueue("memcg_kmem_cache_create", 0);
> +	BUG_ON(!memcg_kmem_cache_create_wq);
> +#endif
> +
>  	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  
>  	for_each_possible_cpu(cpu)

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-06 12:05           ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-10-06 12:05 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Andrew Morton, linux-mm, linux-kernel

On Tue 04-10-16 16:14:17, Vladimir Davydov wrote:
[...]
> >From 10f5f126800912c6a4b78a8b615138c1322694ad Mon Sep 17 00:00:00 2001
> From: Vladimir Davydov <vdavydov.dev@gmail.com>
> Date: Sat, 1 Oct 2016 16:39:09 +0300
> Subject: [PATCH] mm: memcontrol: use special workqueue for creating per-memcg
>  caches
> 
> Creating a lot of cgroups at the same time might stall all worker
> threads with kmem cache creation works, because kmem cache creation is
> done with the slab_mutex held. The problem was amplified by commits
> 801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
> SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
> increased the maximal time the slab_mutex can be held.
> 
> To prevent that from happening, let's use a special ordered single
> threaded workqueue for kmem cache creation. This shouldn't introduce any
> functional changes regarding how kmem caches are created, as the work
> function holds the global slab_mutex during its whole runtime anyway,
> making it impossible to run more than one work at a time. By using a
> single threaded workqueue, we just avoid creating a thread per each
> work. Ordering is required to avoid a situation when a cgroup's work is
> put off indefinitely because there are other cgroups to serve, in other
> words to guarantee fairness.

I am not sure an indefinit starving was possible but a fairness seems to
be real AFAICS.

> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=172981
> Signed-off-by: Vladimir Davydov <vdavydov.dev@gmail.com>
> Reported-by: Doug Smythies <dsmythies@telus.net>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Pekka Enberg <penberg@kernel.org>

Acked-by: Michal Hocko <mhocko@suse.com>

> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4be518d4e68a..8d753d87ca37 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2175,6 +2175,8 @@ struct memcg_kmem_cache_create_work {
>  	struct work_struct work;
>  };
>  
> +static struct workqueue_struct *memcg_kmem_cache_create_wq;
> +
>  static void memcg_kmem_cache_create_func(struct work_struct *w)
>  {
>  	struct memcg_kmem_cache_create_work *cw =
> @@ -2206,7 +2208,7 @@ static void __memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
>  	cw->cachep = cachep;
>  	INIT_WORK(&cw->work, memcg_kmem_cache_create_func);
>  
> -	schedule_work(&cw->work);
> +	queue_work(memcg_kmem_cache_create_wq, &cw->work);
>  }
>  
>  static void memcg_schedule_kmem_cache_create(struct mem_cgroup *memcg,
> @@ -5794,6 +5796,17 @@ static int __init mem_cgroup_init(void)
>  {
>  	int cpu, node;
>  
> +#ifndef CONFIG_SLOB
> +	/*
> +	 * Kmem cache creation is mostly done with the slab_mutex held,
> +	 * so use a special workqueue to avoid stalling all worker
> +	 * threads in case lots of cgroups are created simultaneously.
> +	 */
> +	memcg_kmem_cache_create_wq =
> +		alloc_ordered_workqueue("memcg_kmem_cache_create", 0);
> +	BUG_ON(!memcg_kmem_cache_create_wq);
> +#endif
> +
>  	hotcpu_notifier(memcg_cpu_hotplug_callback, 0);
>  
>  	for_each_possible_cpu(cpu)

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
  2016-10-04 13:14         ` Vladimir Davydov
@ 2016-10-21  3:44           ` Andrew Morton
  -1 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2016-10-21  3:44 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Michal Hocko, linux-mm, linux-kernel

On Tue, 4 Oct 2016 16:14:17 +0300 Vladimir Davydov <vdavydov.dev@gmail.com> wrote:

> Creating a lot of cgroups at the same time might stall all worker
> threads with kmem cache creation works, because kmem cache creation is
> done with the slab_mutex held. The problem was amplified by commits
> 801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
> SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
> increased the maximal time the slab_mutex can be held.
> 
> To prevent that from happening, let's use a special ordered single
> threaded workqueue for kmem cache creation. This shouldn't introduce any
> functional changes regarding how kmem caches are created, as the work
> function holds the global slab_mutex during its whole runtime anyway,
> making it impossible to run more than one work at a time. By using a
> single threaded workqueue, we just avoid creating a thread per each
> work. Ordering is required to avoid a situation when a cgroup's work is
> put off indefinitely because there are other cgroups to serve, in other
> words to guarantee fairness.

I'm having trouble working out the urgency of this patch?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-21  3:44           ` Andrew Morton
  0 siblings, 0 replies; 20+ messages in thread
From: Andrew Morton @ 2016-10-21  3:44 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: Michal Hocko, linux-mm, linux-kernel

On Tue, 4 Oct 2016 16:14:17 +0300 Vladimir Davydov <vdavydov.dev@gmail.com> wrote:

> Creating a lot of cgroups at the same time might stall all worker
> threads with kmem cache creation works, because kmem cache creation is
> done with the slab_mutex held. The problem was amplified by commits
> 801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
> SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
> increased the maximal time the slab_mutex can be held.
> 
> To prevent that from happening, let's use a special ordered single
> threaded workqueue for kmem cache creation. This shouldn't introduce any
> functional changes regarding how kmem caches are created, as the work
> function holds the global slab_mutex during its whole runtime anyway,
> making it impossible to run more than one work at a time. By using a
> single threaded workqueue, we just avoid creating a thread per each
> work. Ordering is required to avoid a situation when a cgroup's work is
> put off indefinitely because there are other cgroups to serve, in other
> words to guarantee fairness.

I'm having trouble working out the urgency of this patch?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
  2016-10-21  3:44           ` Andrew Morton
@ 2016-10-21  6:39             ` Michal Hocko
  -1 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-10-21  6:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vladimir Davydov, linux-mm, linux-kernel

On Thu 20-10-16 20:44:35, Andrew Morton wrote:
> On Tue, 4 Oct 2016 16:14:17 +0300 Vladimir Davydov <vdavydov.dev@gmail.com> wrote:
> 
> > Creating a lot of cgroups at the same time might stall all worker
> > threads with kmem cache creation works, because kmem cache creation is
> > done with the slab_mutex held. The problem was amplified by commits
> > 801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
> > SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> > synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
> > increased the maximal time the slab_mutex can be held.
> > 
> > To prevent that from happening, let's use a special ordered single
> > threaded workqueue for kmem cache creation. This shouldn't introduce any
> > functional changes regarding how kmem caches are created, as the work
> > function holds the global slab_mutex during its whole runtime anyway,
> > making it impossible to run more than one work at a time. By using a
> > single threaded workqueue, we just avoid creating a thread per each
> > work. Ordering is required to avoid a situation when a cgroup's work is
> > put off indefinitely because there are other cgroups to serve, in other
> > words to guarantee fairness.
> 
> I'm having trouble working out the urgency of this patch?

Seeing thousands of kernel threads is certainly annoying so I think we
want to merge it sooner rather than later and have it backported to
stable as well.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches
@ 2016-10-21  6:39             ` Michal Hocko
  0 siblings, 0 replies; 20+ messages in thread
From: Michal Hocko @ 2016-10-21  6:39 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Vladimir Davydov, linux-mm, linux-kernel

On Thu 20-10-16 20:44:35, Andrew Morton wrote:
> On Tue, 4 Oct 2016 16:14:17 +0300 Vladimir Davydov <vdavydov.dev@gmail.com> wrote:
> 
> > Creating a lot of cgroups at the same time might stall all worker
> > threads with kmem cache creation works, because kmem cache creation is
> > done with the slab_mutex held. The problem was amplified by commits
> > 801faf0db894 ("mm/slab: lockless decision to grow cache") in case of
> > SLAB and 81ae6d03952c ("mm/slub.c: replace kick_all_cpus_sync() with
> > synchronize_sched() in kmem_cache_shrink()") in case of SLUB, which
> > increased the maximal time the slab_mutex can be held.
> > 
> > To prevent that from happening, let's use a special ordered single
> > threaded workqueue for kmem cache creation. This shouldn't introduce any
> > functional changes regarding how kmem caches are created, as the work
> > function holds the global slab_mutex during its whole runtime anyway,
> > making it impossible to run more than one work at a time. By using a
> > single threaded workqueue, we just avoid creating a thread per each
> > work. Ordering is required to avoid a situation when a cgroup's work is
> > put off indefinitely because there are other cgroups to serve, in other
> > words to guarantee fairness.
> 
> I'm having trouble working out the urgency of this patch?

Seeing thousands of kernel threads is certainly annoying so I think we
want to merge it sooner rather than later and have it backported to
stable as well.

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-10-21  6:39 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-01 13:56 [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches Vladimir Davydov
2016-10-01 13:56 ` Vladimir Davydov
2016-10-01 13:56 ` [PATCH 2/2] slub: move synchronize_sched out of slab_mutex on shrink Vladimir Davydov
2016-10-01 13:56   ` Vladimir Davydov
2016-10-06  6:27   ` Joonsoo Kim
2016-10-06  6:27     ` Joonsoo Kim
2016-10-03 12:06 ` [PATCH 1/2] mm: memcontrol: use special workqueue for creating per-memcg caches Michal Hocko
2016-10-03 12:06   ` Michal Hocko
2016-10-03 12:35   ` Vladimir Davydov
2016-10-03 12:35     ` Vladimir Davydov
2016-10-03 13:19     ` Michal Hocko
2016-10-03 13:19       ` Michal Hocko
2016-10-04 13:14       ` Vladimir Davydov
2016-10-04 13:14         ` Vladimir Davydov
2016-10-06 12:05         ` Michal Hocko
2016-10-06 12:05           ` Michal Hocko
2016-10-21  3:44         ` Andrew Morton
2016-10-21  3:44           ` Andrew Morton
2016-10-21  6:39           ` Michal Hocko
2016-10-21  6:39             ` Michal Hocko

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.