All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RESEND -mm 00/12] kmemcg reparenting
@ 2014-03-13 15:06 ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

[rebased on top of v3.14-rc6-mmotm-2014-03-12-16-04]

Hi,

During my recent attempt to push kmemcg shrinkers, I was pointed out
that current kmemcg implementation has a serious design flaw - it lacks
reparenting. Currently each memcg cache holds a css ref to its memcg and
does not let it go until the cache is emptied. Although this approach is
simple, it leads to memcgs hanging around for quite a long time after
the death, which is ugly. Building something on top of that is
unacceptable. So this patch set targets on implementing reparenting for
kmemcg charges.

[ for more details see the discussion thread:
  https://lkml.org/lkml/2014/2/11/623 ]

It is based on top of 3.14-rc6-mmotm and organized as follows:
 - Patches 1-3 fix some nasty races in kmemcg implementation. I could
   not let them live any longer, because they touch the code I'm going
   to modify.
 - Patches 4-6 prepare memcg_cache_params for reparenting.
 - Patch 7 rework slab charging making it easier to track and therefore
   reparent kmem charges, and patches 8-10 kill the old charging code.
 - Patch 11 introduces kmemcg reparenting.
 - Patch 12 is for slub. It fixes sysfs naming clashes that can arise
   due to reparented caches.

Please note that this patch set does not resolve all kmemcg-related
issues - there are still plenty of them (e.g. "dangling" caches), but it
is already big enough so I guess I'll address them later when this one
is committed (if it will be committed at all, of course).

Many thanks to Johannes Weiner, who proposed the idea and kindly
outlined basic design principles.

Thanks,

Vladimir Davydov (12):
  memcg: flush cache creation works before memcg cache destruction
  memcg: fix race in memcg cache destruction path
  memcg: fix root vs memcg cache destruction race
  memcg: move slab caches list/mutex init to memcg creation
  memcg: add pointer from memcg_cache_params to cache
  memcg: keep all children of each root cache on a list
  memcg: rework slab charging
  memcg: do not charge kmalloc_large allocations
  fork: do not charge thread_info to kmemcg
  memcg: kill GFP_KMEMCG and stuff
  memcg: reparent slab on css offline
  slub: make sure all memcg caches have unique names on sysfs

 include/linux/gfp.h             |    5 -
 include/linux/memcontrol.h      |  133 ++-------
 include/linux/slab.h            |   15 +-
 include/linux/thread_info.h     |    2 -
 include/trace/events/gfpflags.h |    1 -
 kernel/fork.c                   |    4 +-
 mm/memcontrol.c                 |  579 +++++++++++++++++----------------------
 mm/page_alloc.c                 |   35 ---
 mm/slab.c                       |   47 ++--
 mm/slab.h                       |   17 +-
 mm/slab_common.c                |  100 +++++--
 mm/slub.c                       |   85 ++++--
 12 files changed, 469 insertions(+), 554 deletions(-)

-- 
1.7.10.4


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 00/12] kmemcg reparenting
@ 2014-03-13 15:06 ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

[rebased on top of v3.14-rc6-mmotm-2014-03-12-16-04]

Hi,

During my recent attempt to push kmemcg shrinkers, I was pointed out
that current kmemcg implementation has a serious design flaw - it lacks
reparenting. Currently each memcg cache holds a css ref to its memcg and
does not let it go until the cache is emptied. Although this approach is
simple, it leads to memcgs hanging around for quite a long time after
the death, which is ugly. Building something on top of that is
unacceptable. So this patch set targets on implementing reparenting for
kmemcg charges.

[ for more details see the discussion thread:
  https://lkml.org/lkml/2014/2/11/623 ]

It is based on top of 3.14-rc6-mmotm and organized as follows:
 - Patches 1-3 fix some nasty races in kmemcg implementation. I could
   not let them live any longer, because they touch the code I'm going
   to modify.
 - Patches 4-6 prepare memcg_cache_params for reparenting.
 - Patch 7 rework slab charging making it easier to track and therefore
   reparent kmem charges, and patches 8-10 kill the old charging code.
 - Patch 11 introduces kmemcg reparenting.
 - Patch 12 is for slub. It fixes sysfs naming clashes that can arise
   due to reparented caches.

Please note that this patch set does not resolve all kmemcg-related
issues - there are still plenty of them (e.g. "dangling" caches), but it
is already big enough so I guess I'll address them later when this one
is committed (if it will be committed at all, of course).

Many thanks to Johannes Weiner, who proposed the idea and kindly
outlined basic design principles.

Thanks,

Vladimir Davydov (12):
  memcg: flush cache creation works before memcg cache destruction
  memcg: fix race in memcg cache destruction path
  memcg: fix root vs memcg cache destruction race
  memcg: move slab caches list/mutex init to memcg creation
  memcg: add pointer from memcg_cache_params to cache
  memcg: keep all children of each root cache on a list
  memcg: rework slab charging
  memcg: do not charge kmalloc_large allocations
  fork: do not charge thread_info to kmemcg
  memcg: kill GFP_KMEMCG and stuff
  memcg: reparent slab on css offline
  slub: make sure all memcg caches have unique names on sysfs

 include/linux/gfp.h             |    5 -
 include/linux/memcontrol.h      |  133 ++-------
 include/linux/slab.h            |   15 +-
 include/linux/thread_info.h     |    2 -
 include/trace/events/gfpflags.h |    1 -
 kernel/fork.c                   |    4 +-
 mm/memcontrol.c                 |  579 +++++++++++++++++----------------------
 mm/page_alloc.c                 |   35 ---
 mm/slab.c                       |   47 ++--
 mm/slab.h                       |   17 +-
 mm/slab_common.c                |  100 +++++--
 mm/slub.c                       |   85 ++++--
 12 files changed, 469 insertions(+), 554 deletions(-)

-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

When we get to memcg cache destruction, either from the root cache
destruction path or when turning memcg offline, there still might be
memcg cache creation works pending that was scheduled before we
initiated destruction. We need to flush them before starting to destroy
memcg caches, otherwise we can get a leaked kmem cache or, even worse,
an attempt to use after free.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9d489a9e7701..b183aaf1b616 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2904,6 +2904,7 @@ static DEFINE_MUTEX(set_limit_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
 static DEFINE_MUTEX(activate_kmem_mutex);
+static struct workqueue_struct *memcg_cache_create_wq;
 
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -3327,6 +3328,15 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	int i, failed = 0;
 
 	/*
+	 * Since the cache is being destroyed, it shouldn't be allocated from
+	 * any more, and therefore no new memcg cache creation works could be
+	 * scheduled. However, there still might be pending works scheduled
+	 * before the cache destruction was initiated. Flush them before
+	 * destroying child caches to avoid nasty races.
+	 */
+	flush_workqueue(memcg_cache_create_wq);
+
+	/*
 	 * If the cache is being destroyed, we trust that there is no one else
 	 * requesting objects from it. Even if there are, the sanity checks in
 	 * kmem_cache_destroy should caught this ill-case.
@@ -3374,6 +3384,15 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
+	/*
+	 * By the time we get here, the cgroup must be empty. That said no new
+	 * allocations can happen from its caches, and therefore no new memcg
+	 * cache creation works can be scheduled. However, there still might be
+	 * pending works scheduled before the cgroup was turned offline. Flush
+	 * them before destroying memcg caches to avoid nasty races.
+	 */
+	flush_workqueue(memcg_cache_create_wq);
+
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
@@ -3418,7 +3437,7 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
 	cw->cachep = cachep;
 
 	INIT_WORK(&cw->work, memcg_create_cache_work_func);
-	schedule_work(&cw->work);
+	queue_work(memcg_cache_create_wq, &cw->work);
 }
 
 static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
@@ -3621,10 +3640,20 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
 }
+
+static void __init memcg_kmem_init(void)
+{
+	memcg_cache_create_wq = alloc_workqueue("memcg_cache_create", 0, 1);
+	BUG_ON(!memcg_cache_create_wq);
+}
 #else
 static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 {
 }
+
+static void __init memcg_kmem_init(void)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -7181,6 +7210,7 @@ static int __init mem_cgroup_init(void)
 	enable_swap_cgroup();
 	mem_cgroup_soft_limit_tree_init();
 	memcg_stock_init();
+	memcg_kmem_init();
 	return 0;
 }
 subsys_initcall(mem_cgroup_init);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

When we get to memcg cache destruction, either from the root cache
destruction path or when turning memcg offline, there still might be
memcg cache creation works pending that was scheduled before we
initiated destruction. We need to flush them before starting to destroy
memcg caches, otherwise we can get a leaked kmem cache or, even worse,
an attempt to use after free.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |   32 +++++++++++++++++++++++++++++++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9d489a9e7701..b183aaf1b616 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2904,6 +2904,7 @@ static DEFINE_MUTEX(set_limit_mutex);
 
 #ifdef CONFIG_MEMCG_KMEM
 static DEFINE_MUTEX(activate_kmem_mutex);
+static struct workqueue_struct *memcg_cache_create_wq;
 
 static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 {
@@ -3327,6 +3328,15 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	int i, failed = 0;
 
 	/*
+	 * Since the cache is being destroyed, it shouldn't be allocated from
+	 * any more, and therefore no new memcg cache creation works could be
+	 * scheduled. However, there still might be pending works scheduled
+	 * before the cache destruction was initiated. Flush them before
+	 * destroying child caches to avoid nasty races.
+	 */
+	flush_workqueue(memcg_cache_create_wq);
+
+	/*
 	 * If the cache is being destroyed, we trust that there is no one else
 	 * requesting objects from it. Even if there are, the sanity checks in
 	 * kmem_cache_destroy should caught this ill-case.
@@ -3374,6 +3384,15 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 	if (!memcg_kmem_is_active(memcg))
 		return;
 
+	/*
+	 * By the time we get here, the cgroup must be empty. That said no new
+	 * allocations can happen from its caches, and therefore no new memcg
+	 * cache creation works can be scheduled. However, there still might be
+	 * pending works scheduled before the cgroup was turned offline. Flush
+	 * them before destroying memcg caches to avoid nasty races.
+	 */
+	flush_workqueue(memcg_cache_create_wq);
+
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
 		cachep = memcg_params_to_cache(params);
@@ -3418,7 +3437,7 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
 	cw->cachep = cachep;
 
 	INIT_WORK(&cw->work, memcg_create_cache_work_func);
-	schedule_work(&cw->work);
+	queue_work(memcg_cache_create_wq, &cw->work);
 }
 
 static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
@@ -3621,10 +3640,20 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
 	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
 	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
 }
+
+static void __init memcg_kmem_init(void)
+{
+	memcg_cache_create_wq = alloc_workqueue("memcg_cache_create", 0, 1);
+	BUG_ON(!memcg_cache_create_wq);
+}
 #else
 static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 {
 }
+
+static void __init memcg_kmem_init(void)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -7181,6 +7210,7 @@ static int __init mem_cgroup_init(void)
 	enable_swap_cgroup();
 	mem_cgroup_soft_limit_tree_init();
 	memcg_stock_init();
+	memcg_kmem_init();
 	return 0;
 }
 subsys_initcall(mem_cgroup_init);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

We schedule memcg cache shrink+destruction work (memcg_params::destroy)
from two places: when we turn memcg offline
(mem_cgroup_destroy_all_caches) and when the last page of the cache is
freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
mem_cgroup_destroy_cache). Since the latter can happen while the work
scheduled from mem_cgroup_destroy_all_caches is in progress or still
pending, we need to be cautious to avoid races there - we should
accurately bail out in one of those functions if we see that the other
is in progress. Currently we only check if memcg_params::nr_pages is 0
in the destruction work handler and do not destroy the cache if so. But
that's not enough. An example of race we can get is shown below:

  CPU0					CPU1
  ----					----
  kmem_cache_destroy_work_func:		memcg_release_pages:
					  atomic_sub_and_test(1<<order, &s->
							memcg_params->nr_pages)
					  /* reached 0 => schedule destroy */

    atomic_read(&cachep->memcg_params->nr_pages)
    /* 0 => going to destroy the cache */
    kmem_cache_destroy(cachep);

					  mem_cgroup_destroy_cache(s):
					    /* the cache was destroyed on CPU0
					       - use after free */

An obvious way to fix this would be substituting the nr_pages counter
with a reference counter and make memcg take a reference. The cache
destruction would be then scheduled from that thread which decremented
the refcount to 0. Generally, this is what this patch does, but there is
one subtle thing here - the work handler serves not only for cache
destruction, it also shrinks the cache if it's still in use (we can't
call shrink directly from mem_cgroup_destroy_all_caches due to locking
dependencies). We handle this by noting that we should only issue shrink
if called from mem_cgroup_destroy_all_caches, because the cache is
already empty when we release its last page. And if we drop the
reference taken by memcg in the work handler, we can detect who exactly
scheduled the worker - mem_cgroup_destroy_all_caches or
memcg_release_pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    1 -
 include/linux/slab.h       |    7 ++--
 mm/memcontrol.c            |   86 +++++++++++++-------------------------------
 mm/slab.h                  |   17 +++++++--
 4 files changed, 42 insertions(+), 69 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e9dfcdad24c5..bbe48913c56e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,6 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3dd389aa91c7..3ed53de256ea 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -522,8 +522,8 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
- * @dead: set to true after the memcg dies; the cache may still be around.
- * @nr_pages: number of pages that belongs to this cache.
+ * @refcount: the reference counter; cache destruction will be scheduled when
+ *            it reaches zero
  * @destroy: worker to be called whenever we are ready, or believe we may be
  *           ready, to destroy this cache.
  */
@@ -538,8 +538,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
-			bool dead;
-			atomic_t nr_pages;
+			atomic_t refcount;
 			struct work_struct destroy;
 		};
 	};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b183aaf1b616..cf32254ae1ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3141,6 +3141,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		s->memcg_params->root_cache = root_cache;
 		INIT_WORK(&s->memcg_params->destroy,
 				kmem_cache_destroy_work_func);
+		atomic_set(&s->memcg_params->refcount, 1);
 		css_get(&memcg->css);
 	} else
 		s->memcg_params->is_root_cache = true;
@@ -3262,64 +3263,24 @@ static inline void memcg_resume_kmem_account(void)
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-	struct memcg_cache_params *p;
-
-	p = container_of(w, struct memcg_cache_params, destroy);
+	struct memcg_cache_params *params;
 
-	cachep = memcg_params_to_cache(p);
+	params = container_of(w, struct memcg_cache_params, destroy);
+	cachep = memcg_params_to_cache(params);
 
-	/*
-	 * If we get down to 0 after shrink, we could delete right away.
-	 * However, memcg_release_pages() already puts us back in the workqueue
-	 * in that case. If we proceed deleting, we'll get a dangling
-	 * reference, and removing the object from the workqueue in that case
-	 * is unnecessary complication. We are not a fast path.
-	 *
-	 * Note that this case is fundamentally different from racing with
-	 * shrink_slab(): if memcg_cgroup_destroy_cache() is called in
-	 * kmem_cache_shrink, not only we would be reinserting a dead cache
-	 * into the queue, but doing so from inside the worker racing to
-	 * destroy it.
-	 *
-	 * So if we aren't down to zero, we'll just schedule a worker and try
-	 * again
-	 */
-	if (atomic_read(&cachep->memcg_params->nr_pages) != 0)
+	if (atomic_read(&params->refcount) != 0) {
+		/*
+		 * We were scheduled from mem_cgroup_destroy_all_caches().
+		 * Shrink the cache and drop the reference taken by memcg.
+		 */
 		kmem_cache_shrink(cachep);
-	else
-		kmem_cache_destroy(cachep);
-}
 
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
-{
-	if (!cachep->memcg_params->dead)
-		return;
+		/* cache is still in use? */
+		if (!atomic_dec_and_test(&params->refcount))
+			return;
+	}
 
-	/*
-	 * There are many ways in which we can get here.
-	 *
-	 * We can get to a memory-pressure situation while the delayed work is
-	 * still pending to run. The vmscan shrinkers can then release all
-	 * cache memory and get us to destruction. If this is the case, we'll
-	 * be executed twice, which is a bug (the second time will execute over
-	 * bogus data). In this case, cancelling the work should be fine.
-	 *
-	 * But we can also get here from the worker itself, if
-	 * kmem_cache_shrink is enough to shake all the remaining objects and
-	 * get the page count to 0. In this case, we'll deadlock if we try to
-	 * cancel the work (the worker runs with an internal lock held, which
-	 * is the same lock we would hold for cancel_work_sync().)
-	 *
-	 * Since we can't possibly know who got us here, just refrain from
-	 * running if there is already work pending
-	 */
-	if (work_pending(&cachep->memcg_params->destroy))
-		return;
-	/*
-	 * We have to defer the actual destroying to a workqueue, because
-	 * we might currently be in a context that cannot sleep.
-	 */
-	schedule_work(&cachep->memcg_params->destroy);
+	kmem_cache_destroy(cachep);
 }
 
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
@@ -3360,12 +3321,12 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		 * kmem_cache_destroy() will call kmem_cache_shrink internally,
 		 * and that could spawn the workers again: it is likely that
 		 * the cache still have active pages until this very moment.
-		 * This would lead us back to mem_cgroup_destroy_cache.
+		 * This would lead us back to memcg_release_pages().
 		 *
-		 * But that will not execute at all if the "dead" flag is not
-		 * set, so flip it down to guarantee we are in control.
+		 * But that will not execute at all if the refcount is > 0, so
+		 * increment it to guarantee we are in control.
 		 */
-		c->memcg_params->dead = false;
+		atomic_inc(&c->memcg_params->refcount);
 		cancel_work_sync(&c->memcg_params->destroy);
 		kmem_cache_destroy(c);
 
@@ -3378,7 +3339,6 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 
 static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 {
-	struct kmem_cache *cachep;
 	struct memcg_cache_params *params;
 
 	if (!memcg_kmem_is_active(memcg))
@@ -3395,9 +3355,13 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
-		cachep = memcg_params_to_cache(params);
-		cachep->memcg_params->dead = true;
-		schedule_work(&cachep->memcg_params->destroy);
+		/*
+		 * Since we still hold the reference to the cache params from
+		 * the memcg, the work could not have been scheduled from
+		 * memcg_release_pages(), and this cannot fail.
+		 */
+		if (!schedule_work(&params->destroy))
+			BUG();
 	}
 	mutex_unlock(&memcg->slab_caches_mutex);
 }
diff --git a/mm/slab.h b/mm/slab.h
index 3045316b7c9d..b8caee243b88 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -122,7 +122,7 @@ static inline bool is_root_cache(struct kmem_cache *s)
 static inline void memcg_bind_pages(struct kmem_cache *s, int order)
 {
 	if (!is_root_cache(s))
-		atomic_add(1 << order, &s->memcg_params->nr_pages);
+		atomic_add(1 << order, &s->memcg_params->refcount);
 }
 
 static inline void memcg_release_pages(struct kmem_cache *s, int order)
@@ -130,8 +130,19 @@ static inline void memcg_release_pages(struct kmem_cache *s, int order)
 	if (is_root_cache(s))
 		return;
 
-	if (atomic_sub_and_test((1 << order), &s->memcg_params->nr_pages))
-		mem_cgroup_destroy_cache(s);
+	if (atomic_sub_and_test((1 << order), &s->memcg_params->refcount)) {
+		/*
+		 * We have to defer the actual destroying to a workqueue,
+		 * because we might currently be in a context that cannot
+		 * sleep.
+		 *
+		 * Note we cannot fail here, because if the work scheduled from
+		 * mem_cgroup_destroy_all_caches() were still pending, the
+		 * cache refcount wouldn't reach zero.
+		 */
+		if (!schedule_work(&s->memcg_params->destroy))
+			BUG();
+	}
 }
 
 static inline bool slab_equal_or_root(struct kmem_cache *s,
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

We schedule memcg cache shrink+destruction work (memcg_params::destroy)
from two places: when we turn memcg offline
(mem_cgroup_destroy_all_caches) and when the last page of the cache is
freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
mem_cgroup_destroy_cache). Since the latter can happen while the work
scheduled from mem_cgroup_destroy_all_caches is in progress or still
pending, we need to be cautious to avoid races there - we should
accurately bail out in one of those functions if we see that the other
is in progress. Currently we only check if memcg_params::nr_pages is 0
in the destruction work handler and do not destroy the cache if so. But
that's not enough. An example of race we can get is shown below:

  CPU0					CPU1
  ----					----
  kmem_cache_destroy_work_func:		memcg_release_pages:
					  atomic_sub_and_test(1<<order, &s->
							memcg_params->nr_pages)
					  /* reached 0 => schedule destroy */

    atomic_read(&cachep->memcg_params->nr_pages)
    /* 0 => going to destroy the cache */
    kmem_cache_destroy(cachep);

					  mem_cgroup_destroy_cache(s):
					    /* the cache was destroyed on CPU0
					       - use after free */

An obvious way to fix this would be substituting the nr_pages counter
with a reference counter and make memcg take a reference. The cache
destruction would be then scheduled from that thread which decremented
the refcount to 0. Generally, this is what this patch does, but there is
one subtle thing here - the work handler serves not only for cache
destruction, it also shrinks the cache if it's still in use (we can't
call shrink directly from mem_cgroup_destroy_all_caches due to locking
dependencies). We handle this by noting that we should only issue shrink
if called from mem_cgroup_destroy_all_caches, because the cache is
already empty when we release its last page. And if we drop the
reference taken by memcg in the work handler, we can detect who exactly
scheduled the worker - mem_cgroup_destroy_all_caches or
memcg_release_pages.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    1 -
 include/linux/slab.h       |    7 ++--
 mm/memcontrol.c            |   86 +++++++++++++-------------------------------
 mm/slab.h                  |   17 +++++++--
 4 files changed, 42 insertions(+), 69 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e9dfcdad24c5..bbe48913c56e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,6 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep);
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3dd389aa91c7..3ed53de256ea 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -522,8 +522,8 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
- * @dead: set to true after the memcg dies; the cache may still be around.
- * @nr_pages: number of pages that belongs to this cache.
+ * @refcount: the reference counter; cache destruction will be scheduled when
+ *            it reaches zero
  * @destroy: worker to be called whenever we are ready, or believe we may be
  *           ready, to destroy this cache.
  */
@@ -538,8 +538,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
-			bool dead;
-			atomic_t nr_pages;
+			atomic_t refcount;
 			struct work_struct destroy;
 		};
 	};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b183aaf1b616..cf32254ae1ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3141,6 +3141,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		s->memcg_params->root_cache = root_cache;
 		INIT_WORK(&s->memcg_params->destroy,
 				kmem_cache_destroy_work_func);
+		atomic_set(&s->memcg_params->refcount, 1);
 		css_get(&memcg->css);
 	} else
 		s->memcg_params->is_root_cache = true;
@@ -3262,64 +3263,24 @@ static inline void memcg_resume_kmem_account(void)
 static void kmem_cache_destroy_work_func(struct work_struct *w)
 {
 	struct kmem_cache *cachep;
-	struct memcg_cache_params *p;
-
-	p = container_of(w, struct memcg_cache_params, destroy);
+	struct memcg_cache_params *params;
 
-	cachep = memcg_params_to_cache(p);
+	params = container_of(w, struct memcg_cache_params, destroy);
+	cachep = memcg_params_to_cache(params);
 
-	/*
-	 * If we get down to 0 after shrink, we could delete right away.
-	 * However, memcg_release_pages() already puts us back in the workqueue
-	 * in that case. If we proceed deleting, we'll get a dangling
-	 * reference, and removing the object from the workqueue in that case
-	 * is unnecessary complication. We are not a fast path.
-	 *
-	 * Note that this case is fundamentally different from racing with
-	 * shrink_slab(): if memcg_cgroup_destroy_cache() is called in
-	 * kmem_cache_shrink, not only we would be reinserting a dead cache
-	 * into the queue, but doing so from inside the worker racing to
-	 * destroy it.
-	 *
-	 * So if we aren't down to zero, we'll just schedule a worker and try
-	 * again
-	 */
-	if (atomic_read(&cachep->memcg_params->nr_pages) != 0)
+	if (atomic_read(&params->refcount) != 0) {
+		/*
+		 * We were scheduled from mem_cgroup_destroy_all_caches().
+		 * Shrink the cache and drop the reference taken by memcg.
+		 */
 		kmem_cache_shrink(cachep);
-	else
-		kmem_cache_destroy(cachep);
-}
 
-void mem_cgroup_destroy_cache(struct kmem_cache *cachep)
-{
-	if (!cachep->memcg_params->dead)
-		return;
+		/* cache is still in use? */
+		if (!atomic_dec_and_test(&params->refcount))
+			return;
+	}
 
-	/*
-	 * There are many ways in which we can get here.
-	 *
-	 * We can get to a memory-pressure situation while the delayed work is
-	 * still pending to run. The vmscan shrinkers can then release all
-	 * cache memory and get us to destruction. If this is the case, we'll
-	 * be executed twice, which is a bug (the second time will execute over
-	 * bogus data). In this case, cancelling the work should be fine.
-	 *
-	 * But we can also get here from the worker itself, if
-	 * kmem_cache_shrink is enough to shake all the remaining objects and
-	 * get the page count to 0. In this case, we'll deadlock if we try to
-	 * cancel the work (the worker runs with an internal lock held, which
-	 * is the same lock we would hold for cancel_work_sync().)
-	 *
-	 * Since we can't possibly know who got us here, just refrain from
-	 * running if there is already work pending
-	 */
-	if (work_pending(&cachep->memcg_params->destroy))
-		return;
-	/*
-	 * We have to defer the actual destroying to a workqueue, because
-	 * we might currently be in a context that cannot sleep.
-	 */
-	schedule_work(&cachep->memcg_params->destroy);
+	kmem_cache_destroy(cachep);
 }
 
 int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
@@ -3360,12 +3321,12 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		 * kmem_cache_destroy() will call kmem_cache_shrink internally,
 		 * and that could spawn the workers again: it is likely that
 		 * the cache still have active pages until this very moment.
-		 * This would lead us back to mem_cgroup_destroy_cache.
+		 * This would lead us back to memcg_release_pages().
 		 *
-		 * But that will not execute at all if the "dead" flag is not
-		 * set, so flip it down to guarantee we are in control.
+		 * But that will not execute at all if the refcount is > 0, so
+		 * increment it to guarantee we are in control.
 		 */
-		c->memcg_params->dead = false;
+		atomic_inc(&c->memcg_params->refcount);
 		cancel_work_sync(&c->memcg_params->destroy);
 		kmem_cache_destroy(c);
 
@@ -3378,7 +3339,6 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 
 static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 {
-	struct kmem_cache *cachep;
 	struct memcg_cache_params *params;
 
 	if (!memcg_kmem_is_active(memcg))
@@ -3395,9 +3355,13 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
-		cachep = memcg_params_to_cache(params);
-		cachep->memcg_params->dead = true;
-		schedule_work(&cachep->memcg_params->destroy);
+		/*
+		 * Since we still hold the reference to the cache params from
+		 * the memcg, the work could not have been scheduled from
+		 * memcg_release_pages(), and this cannot fail.
+		 */
+		if (!schedule_work(&params->destroy))
+			BUG();
 	}
 	mutex_unlock(&memcg->slab_caches_mutex);
 }
diff --git a/mm/slab.h b/mm/slab.h
index 3045316b7c9d..b8caee243b88 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -122,7 +122,7 @@ static inline bool is_root_cache(struct kmem_cache *s)
 static inline void memcg_bind_pages(struct kmem_cache *s, int order)
 {
 	if (!is_root_cache(s))
-		atomic_add(1 << order, &s->memcg_params->nr_pages);
+		atomic_add(1 << order, &s->memcg_params->refcount);
 }
 
 static inline void memcg_release_pages(struct kmem_cache *s, int order)
@@ -130,8 +130,19 @@ static inline void memcg_release_pages(struct kmem_cache *s, int order)
 	if (is_root_cache(s))
 		return;
 
-	if (atomic_sub_and_test((1 << order), &s->memcg_params->nr_pages))
-		mem_cgroup_destroy_cache(s);
+	if (atomic_sub_and_test((1 << order), &s->memcg_params->refcount)) {
+		/*
+		 * We have to defer the actual destroying to a workqueue,
+		 * because we might currently be in a context that cannot
+		 * sleep.
+		 *
+		 * Note we cannot fail here, because if the work scheduled from
+		 * mem_cgroup_destroy_all_caches() were still pending, the
+		 * cache refcount wouldn't reach zero.
+		 */
+		if (!schedule_work(&s->memcg_params->destroy))
+			BUG();
+	}
 }
 
 static inline bool slab_equal_or_root(struct kmem_cache *s,
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 03/12] memcg: fix root vs memcg cache destruction race
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

When a root cache is being destroyed and we are about to initiate
destruction of its child caches (see kmem_cache_destroy_memcg_children),
we should handle races with pending memcg cache destruction works
somehow. Currently, we simply cancel the memcg_params::destroy work
before calling kmem_cache_destroy for a memcg cache, but that's totally
wrong, because the work handler may destroy and free the cache resulting
in a use-after-free in cancel_work_sync. Furthermore, we do not handle
the case when memcg cache destruction is scheduled after we start to
destroy the root cache - that's possible, because nothing prevents
memcgs from going offline while we are destroying a root cache. In that
case we are likely to get a use-after-free in the work handler.

This patch resolves the race as follows:

1) It makes kmem_cache_destroy silently exit if it is called for a memcg
   cache while the corresponding root cache is being destroyed, leaving
   the destruction to kmem_cache_destroy_memcg_children. That makes call
   to cancel_work_sync safe if it is called from the root cache
   destruction path.

2) It moves cancel_work_sync to be called after we unregistered a child
   cache from its memcg. That prevents the work from being rescheduled
   from memcg offline path.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    2 +-
 include/linux/slab.h       |    1 +
 mm/memcontrol.c            |   22 ++-----------
 mm/slab_common.c           |   77 ++++++++++++++++++++++++++++++++++++--------
 4 files changed, 69 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bbe48913c56e..689442999562 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-int __kmem_cache_destroy_memcg_children(struct kmem_cache *s);
+int kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3ed53de256ea..ee9f1b0382ac 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -117,6 +117,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			void (*)(void *));
 #ifdef CONFIG_MEMCG_KMEM
 void kmem_cache_create_memcg(struct mem_cgroup *, struct kmem_cache *);
+void kmem_cache_destroy_memcg(struct kmem_cache *, bool);
 #endif
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf32254ae1ee..21974ec406bb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3280,10 +3280,10 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 			return;
 	}
 
-	kmem_cache_destroy(cachep);
+	kmem_cache_destroy_memcg(cachep, false);
 }
 
-int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 	struct kmem_cache *c;
 	int i, failed = 0;
@@ -3313,23 +3313,7 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		if (!c)
 			continue;
 
-		/*
-		 * We will now manually delete the caches, so to avoid races
-		 * we need to cancel all pending destruction workers and
-		 * proceed with destruction ourselves.
-		 *
-		 * kmem_cache_destroy() will call kmem_cache_shrink internally,
-		 * and that could spawn the workers again: it is likely that
-		 * the cache still have active pages until this very moment.
-		 * This would lead us back to memcg_release_pages().
-		 *
-		 * But that will not execute at all if the refcount is > 0, so
-		 * increment it to guarantee we are in control.
-		 */
-		atomic_inc(&c->memcg_params->refcount);
-		cancel_work_sync(&c->memcg_params->destroy);
-		kmem_cache_destroy(c);
-
+		kmem_cache_destroy_memcg(c, true);
 		if (cache_from_memcg_idx(s, i))
 			failed++;
 	}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f3cfccf76dda..05ba3cd1b507 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -302,28 +302,70 @@ out_unlock:
 	put_online_cpus();
 }
 
-static int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+static void __kmem_cache_destroy(struct kmem_cache *, bool);
+
+void kmem_cache_destroy_memcg(struct kmem_cache *s, bool destroying_root)
+{
+	BUG_ON(is_root_cache(s));
+	__kmem_cache_destroy(s, destroying_root);
+}
+
+static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
+				       bool destroying_root)
 {
-	int rc;
+	int rc = 0;
 
-	if (!s->memcg_params ||
-	    !s->memcg_params->is_root_cache)
+	if (!destroying_root) {
+		struct kmem_cache *root;
+
+		root = memcg_root_cache(s);
+		BUG_ON(root == s);
+		/*
+		 * If we are racing with the root cache destruction, let
+		 * kmem_cache_destroy_memcg_children() finish with this cache.
+		 */
+		if (!root->refcount) {
+			s->refcount++;
+			return 1;
+		}
+	}
+
+	if (!s->memcg_params)
 		return 0;
 
 	mutex_unlock(&slab_mutex);
-	rc = __kmem_cache_destroy_memcg_children(s);
+	if (s->memcg_params->is_root_cache) {
+		rc = kmem_cache_destroy_memcg_children(s);
+	} else {
+		/*
+		 * There might be a destruction work pending, which needs to be
+		 * cancelled before we start to destroy the cache.
+		 *
+		 * This should be done after we deleted all the references to
+		 * this cache, otherwise the work could be rescheduled.
+		 *
+		 * __kmem_cache_shutdown() will call kmem_cache_shrink()
+		 * internally, and that could spawn the worker again. We
+		 * increment the refcount to avoid that.
+		 */
+		if (destroying_root) {
+			cancel_work_sync(&s->memcg_params->destroy);
+			atomic_inc(&s->memcg_params->refcount);
+		}
+	}
 	mutex_lock(&slab_mutex);
 
 	return rc;
 }
 #else
-static int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
+				       bool destroying_root)
 {
 	return 0;
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-void kmem_cache_destroy(struct kmem_cache *s)
+static void __kmem_cache_destroy(struct kmem_cache *s, bool destroying_root)
 {
 	get_online_cpus();
 	mutex_lock(&slab_mutex);
@@ -332,19 +374,17 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (s->refcount)
 		goto out_unlock;
 
-	if (kmem_cache_destroy_memcg_children(s) != 0)
-		goto out_unlock;
-
 	list_del(&s->list);
 	memcg_unregister_cache(s);
 
+	if (__kmem_cache_shutdown_memcg(s, destroying_root) != 0)
+		goto out_undelete;
+
 	if (__kmem_cache_shutdown(s) != 0) {
-		list_add(&s->list, &slab_caches);
-		memcg_register_cache(s);
 		printk(KERN_ERR "kmem_cache_destroy %s: "
 		       "Slab cache still has objects\n", s->name);
 		dump_stack();
-		goto out_unlock;
+		goto out_undelete;
 	}
 
 	mutex_unlock(&slab_mutex);
@@ -360,6 +400,17 @@ out_unlock:
 	mutex_unlock(&slab_mutex);
 out_put_cpus:
 	put_online_cpus();
+	return;
+out_undelete:
+	list_add(&s->list, &slab_caches);
+	memcg_register_cache(s);
+	goto out_unlock;
+}
+
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+	BUG_ON(!is_root_cache(s));
+	__kmem_cache_destroy(s, true);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 03/12] memcg: fix root vs memcg cache destruction race
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

When a root cache is being destroyed and we are about to initiate
destruction of its child caches (see kmem_cache_destroy_memcg_children),
we should handle races with pending memcg cache destruction works
somehow. Currently, we simply cancel the memcg_params::destroy work
before calling kmem_cache_destroy for a memcg cache, but that's totally
wrong, because the work handler may destroy and free the cache resulting
in a use-after-free in cancel_work_sync. Furthermore, we do not handle
the case when memcg cache destruction is scheduled after we start to
destroy the root cache - that's possible, because nothing prevents
memcgs from going offline while we are destroying a root cache. In that
case we are likely to get a use-after-free in the work handler.

This patch resolves the race as follows:

1) It makes kmem_cache_destroy silently exit if it is called for a memcg
   cache while the corresponding root cache is being destroyed, leaving
   the destruction to kmem_cache_destroy_memcg_children. That makes call
   to cancel_work_sync safe if it is called from the root cache
   destruction path.

2) It moves cancel_work_sync to be called after we unregistered a child
   cache from its memcg. That prevents the work from being rescheduled
   from memcg offline path.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    2 +-
 include/linux/slab.h       |    1 +
 mm/memcontrol.c            |   22 ++-----------
 mm/slab_common.c           |   77 ++++++++++++++++++++++++++++++++++++--------
 4 files changed, 69 insertions(+), 33 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bbe48913c56e..689442999562 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-int __kmem_cache_destroy_memcg_children(struct kmem_cache *s);
+int kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index 3ed53de256ea..ee9f1b0382ac 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -117,6 +117,7 @@ struct kmem_cache *kmem_cache_create(const char *, size_t, size_t,
 			void (*)(void *));
 #ifdef CONFIG_MEMCG_KMEM
 void kmem_cache_create_memcg(struct mem_cgroup *, struct kmem_cache *);
+void kmem_cache_destroy_memcg(struct kmem_cache *, bool);
 #endif
 void kmem_cache_destroy(struct kmem_cache *);
 int kmem_cache_shrink(struct kmem_cache *);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf32254ae1ee..21974ec406bb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3280,10 +3280,10 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 			return;
 	}
 
-	kmem_cache_destroy(cachep);
+	kmem_cache_destroy_memcg(cachep, false);
 }
 
-int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
 	struct kmem_cache *c;
 	int i, failed = 0;
@@ -3313,23 +3313,7 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 		if (!c)
 			continue;
 
-		/*
-		 * We will now manually delete the caches, so to avoid races
-		 * we need to cancel all pending destruction workers and
-		 * proceed with destruction ourselves.
-		 *
-		 * kmem_cache_destroy() will call kmem_cache_shrink internally,
-		 * and that could spawn the workers again: it is likely that
-		 * the cache still have active pages until this very moment.
-		 * This would lead us back to memcg_release_pages().
-		 *
-		 * But that will not execute at all if the refcount is > 0, so
-		 * increment it to guarantee we are in control.
-		 */
-		atomic_inc(&c->memcg_params->refcount);
-		cancel_work_sync(&c->memcg_params->destroy);
-		kmem_cache_destroy(c);
-
+		kmem_cache_destroy_memcg(c, true);
 		if (cache_from_memcg_idx(s, i))
 			failed++;
 	}
diff --git a/mm/slab_common.c b/mm/slab_common.c
index f3cfccf76dda..05ba3cd1b507 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -302,28 +302,70 @@ out_unlock:
 	put_online_cpus();
 }
 
-static int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+static void __kmem_cache_destroy(struct kmem_cache *, bool);
+
+void kmem_cache_destroy_memcg(struct kmem_cache *s, bool destroying_root)
+{
+	BUG_ON(is_root_cache(s));
+	__kmem_cache_destroy(s, destroying_root);
+}
+
+static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
+				       bool destroying_root)
 {
-	int rc;
+	int rc = 0;
 
-	if (!s->memcg_params ||
-	    !s->memcg_params->is_root_cache)
+	if (!destroying_root) {
+		struct kmem_cache *root;
+
+		root = memcg_root_cache(s);
+		BUG_ON(root == s);
+		/*
+		 * If we are racing with the root cache destruction, let
+		 * kmem_cache_destroy_memcg_children() finish with this cache.
+		 */
+		if (!root->refcount) {
+			s->refcount++;
+			return 1;
+		}
+	}
+
+	if (!s->memcg_params)
 		return 0;
 
 	mutex_unlock(&slab_mutex);
-	rc = __kmem_cache_destroy_memcg_children(s);
+	if (s->memcg_params->is_root_cache) {
+		rc = kmem_cache_destroy_memcg_children(s);
+	} else {
+		/*
+		 * There might be a destruction work pending, which needs to be
+		 * cancelled before we start to destroy the cache.
+		 *
+		 * This should be done after we deleted all the references to
+		 * this cache, otherwise the work could be rescheduled.
+		 *
+		 * __kmem_cache_shutdown() will call kmem_cache_shrink()
+		 * internally, and that could spawn the worker again. We
+		 * increment the refcount to avoid that.
+		 */
+		if (destroying_root) {
+			cancel_work_sync(&s->memcg_params->destroy);
+			atomic_inc(&s->memcg_params->refcount);
+		}
+	}
 	mutex_lock(&slab_mutex);
 
 	return rc;
 }
 #else
-static int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
+				       bool destroying_root)
 {
 	return 0;
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-void kmem_cache_destroy(struct kmem_cache *s)
+static void __kmem_cache_destroy(struct kmem_cache *s, bool destroying_root)
 {
 	get_online_cpus();
 	mutex_lock(&slab_mutex);
@@ -332,19 +374,17 @@ void kmem_cache_destroy(struct kmem_cache *s)
 	if (s->refcount)
 		goto out_unlock;
 
-	if (kmem_cache_destroy_memcg_children(s) != 0)
-		goto out_unlock;
-
 	list_del(&s->list);
 	memcg_unregister_cache(s);
 
+	if (__kmem_cache_shutdown_memcg(s, destroying_root) != 0)
+		goto out_undelete;
+
 	if (__kmem_cache_shutdown(s) != 0) {
-		list_add(&s->list, &slab_caches);
-		memcg_register_cache(s);
 		printk(KERN_ERR "kmem_cache_destroy %s: "
 		       "Slab cache still has objects\n", s->name);
 		dump_stack();
-		goto out_unlock;
+		goto out_undelete;
 	}
 
 	mutex_unlock(&slab_mutex);
@@ -360,6 +400,17 @@ out_unlock:
 	mutex_unlock(&slab_mutex);
 out_put_cpus:
 	put_online_cpus();
+	return;
+out_undelete:
+	list_add(&s->list, &slab_caches);
+	memcg_register_cache(s);
+	goto out_unlock;
+}
+
+void kmem_cache_destroy(struct kmem_cache *s)
+{
+	BUG_ON(!is_root_cache(s));
+	__kmem_cache_destroy(s, true);
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 04/12] memcg: move slab caches list/mutex init to memcg creation
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

I need them initialized for cgroups that haven't got kmem accounting
initialized.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21974ec406bb..3659d90d5a40 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5043,8 +5043,6 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 		goto out_rmid;
 
 	memcg->kmemcg_id = memcg_id;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 
 	/*
 	 * We couldn't have accounted to this cgroup, because it hasn't got the
@@ -5791,6 +5789,9 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	int ret;
 
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
+
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
 		return ret;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 04/12] memcg: move slab caches list/mutex init to memcg creation
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

I need them initialized for cgroups that haven't got kmem accounting
initialized.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 21974ec406bb..3659d90d5a40 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5043,8 +5043,6 @@ static int __memcg_activate_kmem(struct mem_cgroup *memcg,
 		goto out_rmid;
 
 	memcg->kmemcg_id = memcg_id;
-	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
-	mutex_init(&memcg->slab_caches_mutex);
 
 	/*
 	 * We couldn't have accounted to this cgroup, because it hasn't got the
@@ -5791,6 +5789,9 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 	int ret;
 
 	memcg->kmemcg_id = -1;
+	INIT_LIST_HEAD(&memcg->memcg_slab_caches);
+	mutex_init(&memcg->slab_caches_mutex);
+
 	ret = memcg_propagate_kmem(memcg);
 	if (ret)
 		return ret;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 05/12] memcg: add pointer from memcg_cache_params to cache
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

While iterating over the memcg_slab_caches list we need to obtain the
cache a particular memcg_params is corresponding to, because
memcg_params are actually linked, not caches. Currently to achieve that,
we employ the fact that each root cache tracks all its child caches in
its memcg_caches array, so that given a memcg_params we can find the
memcg cache by doing something like this:

  root = memcg_params->root_cache;
  memcg = memcg_params->memcg;
  memcg_cache = root->memcg_params->memcg_caches[memcg->kmemcg_id];

Apart from being cumbersome, this is not going to work when reparenting
of memcg caches is introduced. So let's embed a pointer back to the
memcg cache into the memcg_cache_params struct.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   28 +++-------------------------
 2 files changed, 5 insertions(+), 25 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ee9f1b0382ac..f2fd4212976e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -520,6 +520,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  *
  * Child caches will hold extra metadata needed for its operation. Fields are:
  *
+ * @cachep: back pointer to the memcg cache
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
@@ -536,6 +537,7 @@ struct memcg_cache_params {
 			struct kmem_cache *memcg_caches[0];
 		};
 		struct {
+			struct kmem_cache *cachep;
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3659d90d5a40..626a37e01126 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2912,19 +2912,6 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
-/*
- * This is a bit cumbersome, but it is rarely used and avoids a backpointer
- * in the memcg_cache_params struct.
- */
-static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
-{
-	struct kmem_cache *cachep;
-
-	VM_BUG_ON(p->is_root_cache);
-	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
-}
-
 #ifdef CONFIG_SLABINFO
 static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 {
@@ -2938,7 +2925,7 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list)
-		cache_show(memcg_params_to_cache(params), m);
+		cache_show(params->cachep, m);
 	mutex_unlock(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3137,6 +3124,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		return -ENOMEM;
 
 	if (memcg) {
+		s->memcg_params->cachep = s;
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
 		INIT_WORK(&s->memcg_params->destroy,
@@ -3184,11 +3172,6 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
-	/*
-	 * Initialize the pointer to this cache in its parent's memcg_params
-	 * before adding it to the memcg_slab_caches list, otherwise we can
-	 * fail to convert memcg_params_to_cache() while traversing the list.
-	 */
 	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
 	root->memcg_params->memcg_caches[id] = s;
 
@@ -3220,11 +3203,6 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	list_del(&s->memcg_params->list);
 	mutex_unlock(&memcg->slab_caches_mutex);
 
-	/*
-	 * Clear the pointer to this cache in its parent's memcg_params only
-	 * after removing it from the memcg_slab_caches list, otherwise we can
-	 * fail to convert memcg_params_to_cache() while traversing the list.
-	 */
 	VM_BUG_ON(root->memcg_params->memcg_caches[id] != s);
 	root->memcg_params->memcg_caches[id] = NULL;
 }
@@ -3266,7 +3244,7 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 	struct memcg_cache_params *params;
 
 	params = container_of(w, struct memcg_cache_params, destroy);
-	cachep = memcg_params_to_cache(params);
+	cachep = params->cachep;
 
 	if (atomic_read(&params->refcount) != 0) {
 		/*
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 05/12] memcg: add pointer from memcg_cache_params to cache
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

While iterating over the memcg_slab_caches list we need to obtain the
cache a particular memcg_params is corresponding to, because
memcg_params are actually linked, not caches. Currently to achieve that,
we employ the fact that each root cache tracks all its child caches in
its memcg_caches array, so that given a memcg_params we can find the
memcg cache by doing something like this:

  root = memcg_params->root_cache;
  memcg = memcg_params->memcg;
  memcg_cache = root->memcg_params->memcg_caches[memcg->kmemcg_id];

Apart from being cumbersome, this is not going to work when reparenting
of memcg caches is introduced. So let's embed a pointer back to the
memcg cache into the memcg_cache_params struct.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/slab.h |    2 ++
 mm/memcontrol.c      |   28 +++-------------------------
 2 files changed, 5 insertions(+), 25 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index ee9f1b0382ac..f2fd4212976e 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -520,6 +520,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  *
  * Child caches will hold extra metadata needed for its operation. Fields are:
  *
+ * @cachep: back pointer to the memcg cache
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
@@ -536,6 +537,7 @@ struct memcg_cache_params {
 			struct kmem_cache *memcg_caches[0];
 		};
 		struct {
+			struct kmem_cache *cachep;
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3659d90d5a40..626a37e01126 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2912,19 +2912,6 @@ static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
 		memcg_kmem_is_active(memcg);
 }
 
-/*
- * This is a bit cumbersome, but it is rarely used and avoids a backpointer
- * in the memcg_cache_params struct.
- */
-static struct kmem_cache *memcg_params_to_cache(struct memcg_cache_params *p)
-{
-	struct kmem_cache *cachep;
-
-	VM_BUG_ON(p->is_root_cache);
-	cachep = p->root_cache;
-	return cache_from_memcg_idx(cachep, memcg_cache_id(p->memcg));
-}
-
 #ifdef CONFIG_SLABINFO
 static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 {
@@ -2938,7 +2925,7 @@ static int mem_cgroup_slabinfo_read(struct seq_file *m, void *v)
 
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_for_each_entry(params, &memcg->memcg_slab_caches, list)
-		cache_show(memcg_params_to_cache(params), m);
+		cache_show(params->cachep, m);
 	mutex_unlock(&memcg->slab_caches_mutex);
 
 	return 0;
@@ -3137,6 +3124,7 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		return -ENOMEM;
 
 	if (memcg) {
+		s->memcg_params->cachep = s;
 		s->memcg_params->memcg = memcg;
 		s->memcg_params->root_cache = root_cache;
 		INIT_WORK(&s->memcg_params->destroy,
@@ -3184,11 +3172,6 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
-	/*
-	 * Initialize the pointer to this cache in its parent's memcg_params
-	 * before adding it to the memcg_slab_caches list, otherwise we can
-	 * fail to convert memcg_params_to_cache() while traversing the list.
-	 */
 	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
 	root->memcg_params->memcg_caches[id] = s;
 
@@ -3220,11 +3203,6 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	list_del(&s->memcg_params->list);
 	mutex_unlock(&memcg->slab_caches_mutex);
 
-	/*
-	 * Clear the pointer to this cache in its parent's memcg_params only
-	 * after removing it from the memcg_slab_caches list, otherwise we can
-	 * fail to convert memcg_params_to_cache() while traversing the list.
-	 */
 	VM_BUG_ON(root->memcg_params->memcg_caches[id] != s);
 	root->memcg_params->memcg_caches[id] = NULL;
 }
@@ -3266,7 +3244,7 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 	struct memcg_cache_params *params;
 
 	params = container_of(w, struct memcg_cache_params, destroy);
-	cachep = memcg_params_to_cache(params);
+	cachep = params->cachep;
 
 	if (atomic_read(&params->refcount) != 0) {
 		/*
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 06/12] memcg: keep all children of each root cache on a list
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Sometimes we need to iterate over all child caches of a particular root
cache, e.g. when we are destroying it. Currently each root cache keeps
pointers to its children in its memcg_cache_params::memcg_caches_array
so that we can enumerate all active kmemcg ids dereferencing appropriate
array slots to get a memcg. However, this is going to change when memcg
cache reparenting is introduced - only active (not dead) caches will
reside in this array. So let's organize all child caches of the same
root cache into a list on memcg_cache_params.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    2 +-
 include/linux/slab.h       |    3 +++
 mm/memcontrol.c            |   36 +++++++++++++++++++-----------------
 mm/slab.c                  |   38 ++++++++++++++++++++++----------------
 mm/slab_common.c           |   19 +++++++++----------
 mm/slub.c                  |   41 +++++++++++++++++++++++++----------------
 6 files changed, 79 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 689442999562..925dd7e8bbb1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-int kmem_cache_destroy_memcg_children(struct kmem_cache *s);
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index f2fd4212976e..8091d009cd72 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -524,6 +524,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
+ * @siblings: list_head for the list of all child caches of the root_cache
  * @refcount: the reference counter; cache destruction will be scheduled when
  *            it reaches zero
  * @destroy: worker to be called whenever we are ready, or believe we may be
@@ -533,6 +534,7 @@ struct memcg_cache_params {
 	bool is_root_cache;
 	union {
 		struct {
+			struct list_head children;
 			struct rcu_head rcu_head;
 			struct kmem_cache *memcg_caches[0];
 		};
@@ -541,6 +543,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
+			struct list_head siblings;
 			atomic_t refcount;
 			struct work_struct destroy;
 		};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 626a37e01126..e03e9a3535bb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3049,6 +3049,10 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 			return -ENOMEM;
 
 		new_params->is_root_cache = true;
+		INIT_LIST_HEAD(&new_params->children);
+		if (cur_params)
+			list_splice(&cur_params->children,
+				    &new_params->children);
 
 		/*
 		 * There is the chance it will be bigger than
@@ -3131,8 +3135,10 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 				kmem_cache_destroy_work_func);
 		atomic_set(&s->memcg_params->refcount, 1);
 		css_get(&memcg->css);
-	} else
+	} else {
 		s->memcg_params->is_root_cache = true;
+		INIT_LIST_HEAD(&s->memcg_params->children);
+	}
 
 	return 0;
 }
@@ -3172,6 +3178,8 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
+	list_add(&s->memcg_params->siblings, &root->memcg_params->children);
+
 	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
 	root->memcg_params->memcg_caches[id] = s;
 
@@ -3199,6 +3207,8 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	memcg = s->memcg_params->memcg;
 	id = memcg_cache_id(memcg);
 
+	list_del(&s->memcg_params->siblings);
+
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_del(&s->memcg_params->list);
 	mutex_unlock(&memcg->slab_caches_mutex);
@@ -3261,10 +3271,9 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 	kmem_cache_destroy_memcg(cachep, false);
 }
 
-int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
-	struct kmem_cache *c;
-	int i, failed = 0;
+	struct memcg_cache_params *params, *tmp;
 
 	/*
 	 * Since the cache is being destroyed, it shouldn't be allocated from
@@ -3276,9 +3285,9 @@ int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	flush_workqueue(memcg_cache_create_wq);
 
 	/*
-	 * If the cache is being destroyed, we trust that there is no one else
-	 * requesting objects from it. Even if there are, the sanity checks in
-	 * kmem_cache_destroy should caught this ill-case.
+	 * At this point nobody except us is allowed to create or destroy child
+	 * caches so we don't need to take the slab_mutex for iterating over
+	 * the children list.
 	 *
 	 * Still, we don't want anyone else freeing memcg_caches under our
 	 * noses, which can happen if a new memcg comes to life. As usual,
@@ -3286,17 +3295,10 @@ int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	 * this.
 	 */
 	mutex_lock(&activate_kmem_mutex);
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
-		kmem_cache_destroy_memcg(c, true);
-		if (cache_from_memcg_idx(s, i))
-			failed++;
-	}
+	list_for_each_entry_safe(params, tmp,
+			&s->memcg_params->children, siblings)
+		kmem_cache_destroy_memcg(params->cachep, true);
 	mutex_unlock(&activate_kmem_mutex);
-	return failed;
 }
 
 static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
diff --git a/mm/slab.c b/mm/slab.c
index eebc619ae33c..040dcd89bd6d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3816,29 +3816,35 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	return alloc_kmemlist(cachep, gfp);
 }
 
+static void __do_tune_cpucache_memcg(struct kmem_cache *cachep, int limit,
+				     int batchcount, int shared, gfp_t gfp)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
+
+	if (!cachep->memcg_params ||
+	    !cachep->memcg_params->is_root_cache)
+		return;
+
+	lockdep_assert_held(&slab_mutex);
+	list_for_each_entry(params,
+			&cachep->memcg_params->children, siblings)
+		__do_tune_cpucache(params->cachep, limit,
+				   batchcount, shared, gfp);
+#endif
+}
+
 static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 				int batchcount, int shared, gfp_t gfp)
 {
 	int ret;
-	struct kmem_cache *c = NULL;
-	int i = 0;
 
 	ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
-	if (slab_state < FULL)
-		return ret;
-
-	if ((ret < 0) || !is_root_cache(cachep))
-		return ret;
-
-	VM_BUG_ON(!mutex_is_locked(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(cachep, i);
-		if (c)
-			/* return value determined by the parent cache only */
-			__do_tune_cpucache(c, limit, batchcount, shared, gfp);
+	if (!ret) {
+		/* return value determined by the parent cache only */
+		__do_tune_cpucache_memcg(cachep, limit,
+					 batchcount, shared, gfp);
 	}
-
 	return ret;
 }
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 05ba3cd1b507..48e472894511 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -335,7 +335,8 @@ static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
 
 	mutex_unlock(&slab_mutex);
 	if (s->memcg_params->is_root_cache) {
-		rc = kmem_cache_destroy_memcg_children(s);
+		kmem_cache_destroy_memcg_children(s);
+		rc = !list_empty(&s->memcg_params->children);
 	} else {
 		/*
 		 * There might be a destruction work pending, which needs to be
@@ -693,20 +694,17 @@ void slab_stop(struct seq_file *m, void *p)
 static void
 memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 {
-	struct kmem_cache *c;
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
 	struct slabinfo sinfo;
-	int i;
 
-	if (!is_root_cache(s))
+	if (!s->memcg_params ||
+	    !s->memcg_params->is_root_cache)
 		return;
 
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
+	list_for_each_entry(params, &s->memcg_params->children, siblings) {
 		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
+		get_slabinfo(params->cachep, &sinfo);
 
 		info->active_slabs += sinfo.active_slabs;
 		info->num_slabs += sinfo.num_slabs;
@@ -714,6 +712,7 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 		info->active_objs += sinfo.active_objs;
 		info->num_objs += sinfo.num_objs;
 	}
+#endif
 }
 
 int cache_show(struct kmem_cache *s, struct seq_file *m)
diff --git a/mm/slub.c b/mm/slub.c
index 5c6b2b26ec50..66e8e7bef27f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3741,6 +3741,25 @@ static struct kmem_cache *find_mergeable(size_t size, size_t align,
 	return NULL;
 }
 
+static void memcg_slab_merge(struct kmem_cache *s, size_t size)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct kmem_cache *cachep;
+	struct memcg_cache_params *params;
+
+	if (!s->memcg_params)
+		return;
+	BUG_ON(!s->memcg_params->is_root_cache);
+
+	list_for_each_entry(params, &s->memcg_params->children, siblings) {
+		cachep = params->cachep;
+		cachep->object_size = s->object_size;
+		cachep->inuse = max_t(int, cachep->inuse,
+				      ALIGN(size, sizeof(void *)));
+	}
+#endif
+}
+
 struct kmem_cache *
 __kmem_cache_alias(const char *name, size_t size, size_t align,
 		   unsigned long flags, void (*ctor)(void *))
@@ -3749,9 +3768,6 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int i;
-		struct kmem_cache *c;
-
 		s->refcount++;
 
 		/*
@@ -3761,14 +3777,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 		s->object_size = max(s->object_size, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache_index(i) {
-			c = cache_from_memcg_idx(s, i);
-			if (!c)
-				continue;
-			c->object_size = s->object_size;
-			c->inuse = max_t(int, c->inuse,
-					 ALIGN(size, sizeof(void *)));
-		}
+		memcg_slab_merge(s, size);
 
 		if (sysfs_slab_alias(s, name)) {
 			s->refcount--;
@@ -5028,7 +5037,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 	err = attribute->store(s, buf, len);
 #ifdef CONFIG_MEMCG_KMEM
 	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		int i;
+		struct memcg_cache_params *params;
 
 		mutex_lock(&slab_mutex);
 		if (s->max_attr_size < len)
@@ -5051,10 +5060,10 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache_index(i) {
-			struct kmem_cache *c = cache_from_memcg_idx(s, i);
-			if (c)
-				attribute->store(c, buf, len);
+		if (s->memcg_params) {
+			list_for_each_entry(params,
+					&s->memcg_params->children, siblings)
+				attribute->store(params->cachep, buf, len);
 		}
 		mutex_unlock(&slab_mutex);
 	}
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 06/12] memcg: keep all children of each root cache on a list
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Sometimes we need to iterate over all child caches of a particular root
cache, e.g. when we are destroying it. Currently each root cache keeps
pointers to its children in its memcg_cache_params::memcg_caches_array
so that we can enumerate all active kmemcg ids dereferencing appropriate
array slots to get a memcg. However, this is going to change when memcg
cache reparenting is introduced - only active (not dead) caches will
reside in this array. So let's organize all child caches of the same
root cache into a list on memcg_cache_params.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/memcontrol.h |    2 +-
 include/linux/slab.h       |    3 +++
 mm/memcontrol.c            |   36 +++++++++++++++++++-----------------
 mm/slab.c                  |   38 ++++++++++++++++++++++----------------
 mm/slab_common.c           |   19 +++++++++----------
 mm/slub.c                  |   41 +++++++++++++++++++++++++----------------
 6 files changed, 79 insertions(+), 60 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 689442999562..925dd7e8bbb1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -512,7 +512,7 @@ void memcg_update_array_size(int num_groups);
 struct kmem_cache *
 __memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 
-int kmem_cache_destroy_memcg_children(struct kmem_cache *s);
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
  * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
diff --git a/include/linux/slab.h b/include/linux/slab.h
index f2fd4212976e..8091d009cd72 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -524,6 +524,7 @@ static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
  * @memcg: pointer to the memcg this cache belongs to
  * @list: list_head for the list of all caches in this memcg
  * @root_cache: pointer to the global, root cache, this cache was derived from
+ * @siblings: list_head for the list of all child caches of the root_cache
  * @refcount: the reference counter; cache destruction will be scheduled when
  *            it reaches zero
  * @destroy: worker to be called whenever we are ready, or believe we may be
@@ -533,6 +534,7 @@ struct memcg_cache_params {
 	bool is_root_cache;
 	union {
 		struct {
+			struct list_head children;
 			struct rcu_head rcu_head;
 			struct kmem_cache *memcg_caches[0];
 		};
@@ -541,6 +543,7 @@ struct memcg_cache_params {
 			struct mem_cgroup *memcg;
 			struct list_head list;
 			struct kmem_cache *root_cache;
+			struct list_head siblings;
 			atomic_t refcount;
 			struct work_struct destroy;
 		};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 626a37e01126..e03e9a3535bb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3049,6 +3049,10 @@ int memcg_update_cache_size(struct kmem_cache *s, int num_groups)
 			return -ENOMEM;
 
 		new_params->is_root_cache = true;
+		INIT_LIST_HEAD(&new_params->children);
+		if (cur_params)
+			list_splice(&cur_params->children,
+				    &new_params->children);
 
 		/*
 		 * There is the chance it will be bigger than
@@ -3131,8 +3135,10 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 				kmem_cache_destroy_work_func);
 		atomic_set(&s->memcg_params->refcount, 1);
 		css_get(&memcg->css);
-	} else
+	} else {
 		s->memcg_params->is_root_cache = true;
+		INIT_LIST_HEAD(&s->memcg_params->children);
+	}
 
 	return 0;
 }
@@ -3172,6 +3178,8 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
+	list_add(&s->memcg_params->siblings, &root->memcg_params->children);
+
 	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
 	root->memcg_params->memcg_caches[id] = s;
 
@@ -3199,6 +3207,8 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	memcg = s->memcg_params->memcg;
 	id = memcg_cache_id(memcg);
 
+	list_del(&s->memcg_params->siblings);
+
 	mutex_lock(&memcg->slab_caches_mutex);
 	list_del(&s->memcg_params->list);
 	mutex_unlock(&memcg->slab_caches_mutex);
@@ -3261,10 +3271,9 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 	kmem_cache_destroy_memcg(cachep, false);
 }
 
-int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
+void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 {
-	struct kmem_cache *c;
-	int i, failed = 0;
+	struct memcg_cache_params *params, *tmp;
 
 	/*
 	 * Since the cache is being destroyed, it shouldn't be allocated from
@@ -3276,9 +3285,9 @@ int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	flush_workqueue(memcg_cache_create_wq);
 
 	/*
-	 * If the cache is being destroyed, we trust that there is no one else
-	 * requesting objects from it. Even if there are, the sanity checks in
-	 * kmem_cache_destroy should caught this ill-case.
+	 * At this point nobody except us is allowed to create or destroy child
+	 * caches so we don't need to take the slab_mutex for iterating over
+	 * the children list.
 	 *
 	 * Still, we don't want anyone else freeing memcg_caches under our
 	 * noses, which can happen if a new memcg comes to life. As usual,
@@ -3286,17 +3295,10 @@ int kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	 * this.
 	 */
 	mutex_lock(&activate_kmem_mutex);
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
-		kmem_cache_destroy_memcg(c, true);
-		if (cache_from_memcg_idx(s, i))
-			failed++;
-	}
+	list_for_each_entry_safe(params, tmp,
+			&s->memcg_params->children, siblings)
+		kmem_cache_destroy_memcg(params->cachep, true);
 	mutex_unlock(&activate_kmem_mutex);
-	return failed;
 }
 
 static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
diff --git a/mm/slab.c b/mm/slab.c
index eebc619ae33c..040dcd89bd6d 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3816,29 +3816,35 @@ static int __do_tune_cpucache(struct kmem_cache *cachep, int limit,
 	return alloc_kmemlist(cachep, gfp);
 }
 
+static void __do_tune_cpucache_memcg(struct kmem_cache *cachep, int limit,
+				     int batchcount, int shared, gfp_t gfp)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
+
+	if (!cachep->memcg_params ||
+	    !cachep->memcg_params->is_root_cache)
+		return;
+
+	lockdep_assert_held(&slab_mutex);
+	list_for_each_entry(params,
+			&cachep->memcg_params->children, siblings)
+		__do_tune_cpucache(params->cachep, limit,
+				   batchcount, shared, gfp);
+#endif
+}
+
 static int do_tune_cpucache(struct kmem_cache *cachep, int limit,
 				int batchcount, int shared, gfp_t gfp)
 {
 	int ret;
-	struct kmem_cache *c = NULL;
-	int i = 0;
 
 	ret = __do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
-
-	if (slab_state < FULL)
-		return ret;
-
-	if ((ret < 0) || !is_root_cache(cachep))
-		return ret;
-
-	VM_BUG_ON(!mutex_is_locked(&slab_mutex));
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(cachep, i);
-		if (c)
-			/* return value determined by the parent cache only */
-			__do_tune_cpucache(c, limit, batchcount, shared, gfp);
+	if (!ret) {
+		/* return value determined by the parent cache only */
+		__do_tune_cpucache_memcg(cachep, limit,
+					 batchcount, shared, gfp);
 	}
-
 	return ret;
 }
 
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 05ba3cd1b507..48e472894511 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -335,7 +335,8 @@ static int __kmem_cache_shutdown_memcg(struct kmem_cache *s,
 
 	mutex_unlock(&slab_mutex);
 	if (s->memcg_params->is_root_cache) {
-		rc = kmem_cache_destroy_memcg_children(s);
+		kmem_cache_destroy_memcg_children(s);
+		rc = !list_empty(&s->memcg_params->children);
 	} else {
 		/*
 		 * There might be a destruction work pending, which needs to be
@@ -693,20 +694,17 @@ void slab_stop(struct seq_file *m, void *p)
 static void
 memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 {
-	struct kmem_cache *c;
+#ifdef CONFIG_MEMCG_KMEM
+	struct memcg_cache_params *params;
 	struct slabinfo sinfo;
-	int i;
 
-	if (!is_root_cache(s))
+	if (!s->memcg_params ||
+	    !s->memcg_params->is_root_cache)
 		return;
 
-	for_each_memcg_cache_index(i) {
-		c = cache_from_memcg_idx(s, i);
-		if (!c)
-			continue;
-
+	list_for_each_entry(params, &s->memcg_params->children, siblings) {
 		memset(&sinfo, 0, sizeof(sinfo));
-		get_slabinfo(c, &sinfo);
+		get_slabinfo(params->cachep, &sinfo);
 
 		info->active_slabs += sinfo.active_slabs;
 		info->num_slabs += sinfo.num_slabs;
@@ -714,6 +712,7 @@ memcg_accumulate_slabinfo(struct kmem_cache *s, struct slabinfo *info)
 		info->active_objs += sinfo.active_objs;
 		info->num_objs += sinfo.num_objs;
 	}
+#endif
 }
 
 int cache_show(struct kmem_cache *s, struct seq_file *m)
diff --git a/mm/slub.c b/mm/slub.c
index 5c6b2b26ec50..66e8e7bef27f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3741,6 +3741,25 @@ static struct kmem_cache *find_mergeable(size_t size, size_t align,
 	return NULL;
 }
 
+static void memcg_slab_merge(struct kmem_cache *s, size_t size)
+{
+#ifdef CONFIG_MEMCG_KMEM
+	struct kmem_cache *cachep;
+	struct memcg_cache_params *params;
+
+	if (!s->memcg_params)
+		return;
+	BUG_ON(!s->memcg_params->is_root_cache);
+
+	list_for_each_entry(params, &s->memcg_params->children, siblings) {
+		cachep = params->cachep;
+		cachep->object_size = s->object_size;
+		cachep->inuse = max_t(int, cachep->inuse,
+				      ALIGN(size, sizeof(void *)));
+	}
+#endif
+}
+
 struct kmem_cache *
 __kmem_cache_alias(const char *name, size_t size, size_t align,
 		   unsigned long flags, void (*ctor)(void *))
@@ -3749,9 +3768,6 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 
 	s = find_mergeable(size, align, flags, name, ctor);
 	if (s) {
-		int i;
-		struct kmem_cache *c;
-
 		s->refcount++;
 
 		/*
@@ -3761,14 +3777,7 @@ __kmem_cache_alias(const char *name, size_t size, size_t align,
 		s->object_size = max(s->object_size, (int)size);
 		s->inuse = max_t(int, s->inuse, ALIGN(size, sizeof(void *)));
 
-		for_each_memcg_cache_index(i) {
-			c = cache_from_memcg_idx(s, i);
-			if (!c)
-				continue;
-			c->object_size = s->object_size;
-			c->inuse = max_t(int, c->inuse,
-					 ALIGN(size, sizeof(void *)));
-		}
+		memcg_slab_merge(s, size);
 
 		if (sysfs_slab_alias(s, name)) {
 			s->refcount--;
@@ -5028,7 +5037,7 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 	err = attribute->store(s, buf, len);
 #ifdef CONFIG_MEMCG_KMEM
 	if (slab_state >= FULL && err >= 0 && is_root_cache(s)) {
-		int i;
+		struct memcg_cache_params *params;
 
 		mutex_lock(&slab_mutex);
 		if (s->max_attr_size < len)
@@ -5051,10 +5060,10 @@ static ssize_t slab_attr_store(struct kobject *kobj,
 		 * directly either failed or succeeded, in which case we loop
 		 * through the descendants with best-effort propagation.
 		 */
-		for_each_memcg_cache_index(i) {
-			struct kmem_cache *c = cache_from_memcg_idx(s, i);
-			if (c)
-				attribute->store(c, buf, len);
+		if (s->memcg_params) {
+			list_for_each_entry(params,
+					&s->memcg_params->children, siblings)
+				attribute->store(params->cachep, buf, len);
 		}
 		mutex_unlock(&slab_mutex);
 	}
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 07/12] memcg: rework slab charging
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

Currently kmemcg charging is embedded to the alloc_pages path - if we
get there with the GFP_KMEMCG bit set, we charge the new page to the
cgroup of the caller. All per-memcg caches have this bit set in
allocflags so kmalloc and friends are properly charged.

So, what's wrong with it, why should it be reworked?

First, we do some extra work due to this design. We get memcg from mm,
but we already know the cache we are allocating a page for, why not
simply get it from there? We remember the memcg a page is charged to in
a page_cgroup in order to properly uncharge it, but again each kmem slab
holds a reference to its kmem cache in page->slab_cache so we could use
that instead.

Second, it's racy. If a task changes its cgroup between selecting a
cache to allocate from (memcg_kmem_get_cache) and charging, an object
allocated from one cgroup's cache will be accounted to another cgroup.

And the last, but not least, we don't have a reliable way to track all
kmem pages accounted to a particular memcg, which makes reparenting
impossible. As a result, each memcg cache holds a reference to its memcg
until death, which is bad.

Since we have only a couple of places where we should actually charge
kmem pages, why not just insert kmemcg charge/uncharge there passing on
the slab we are allocating from instead of introdudingh into the generic
allocation path. That's what this patch does.

Note, it does not remove the old code - it will be handled further.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 include/linux/memcontrol.h |   44 +++++++++++++++++++++++++++++---------------
 mm/memcontrol.c            |   15 +++++++++++++++
 mm/slab.c                  |    9 +++++++--
 mm/slab_common.c           |    6 +-----
 mm/slub.c                  |   27 ++++++++++++++++++++-------
 5 files changed, 72 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 925dd7e8bbb1..b11808e7e6ee 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -496,6 +496,10 @@ void __memcg_kmem_commit_charge(struct page *page,
 				       struct mem_cgroup *memcg, int order);
 void __memcg_kmem_uncharge_pages(struct page *page, int order);
 
+struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
+int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages);
+void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages);
+
 int memcg_cache_id(struct mem_cgroup *memcg);
 
 char *memcg_create_cache_name(struct mem_cgroup *memcg,
@@ -509,9 +513,6 @@ void memcg_unregister_cache(struct kmem_cache *s);
 int memcg_update_cache_size(struct kmem_cache *s, int num_groups);
 void memcg_update_array_size(int num_groups);
 
-struct kmem_cache *
-__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
-
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
@@ -587,18 +588,6 @@ memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
  * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
  * @gfp: allocation flags.
- *
- * This function assumes that the task allocating, which determines the memcg
- * in the page allocator, belongs to the same cgroup throughout the whole
- * process.  Misacounting can happen if the task calls memcg_kmem_get_cache()
- * while belonging to a cgroup, and later on changes. This is considered
- * acceptable, and should only happen upon task migration.
- *
- * Before the cache is created by the memcg core, there is also a possible
- * imbalance: the task belongs to a memcg, but the cache being allocated from
- * is the global cache, since the child cache is not yet guaranteed to be
- * ready. This case is also fine, since in this case the GFP_KMEMCG will not be
- * passed and the page allocator will not attempt any cgroup accounting.
  */
 static __always_inline struct kmem_cache *
 memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
@@ -614,6 +603,21 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
+
+static __always_inline int memcg_kmem_charge_slab(struct kmem_cache *s,
+						  gfp_t gfp, int nr_pages)
+{
+	if (memcg_kmem_enabled())
+		return __memcg_kmem_charge_slab(s, gfp, nr_pages);
+	return 0;
+}
+
+static __always_inline void memcg_kmem_uncharge_slab(struct kmem_cache *s,
+						     int nr_pages)
+{
+	if (memcg_kmem_enabled())
+		__memcg_kmem_uncharge_slab(s, nr_pages);
+}
 #else
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
@@ -666,6 +670,16 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 {
 	return cachep;
 }
+
+static inline int memcg_kmem_charge_slab(struct kmem_cache *s,
+					 gfp_t gfp, int nr_pages)
+{
+	return 0;
+}
+
+static inline void memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e03e9a3535bb..9b6f45607f4f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2992,6 +2992,21 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 		css_put(&memcg->css);
 }
 
+int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages)
+{
+	if (is_root_cache(s))
+		return 0;
+	return memcg_charge_kmem(s->memcg_params->memcg,
+				 gfp, nr_pages << PAGE_SHIFT);
+}
+
+void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
+{
+	if (is_root_cache(s))
+		return;
+	memcg_uncharge_kmem(s->memcg_params->memcg, nr_pages << PAGE_SHIFT);
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/slab.c b/mm/slab.c
index 040dcd89bd6d..1656144ca8a0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1664,10 +1664,15 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
+	nr_pages = (1 << cachep->gfporder);
+	if (memcg_kmem_charge_slab(cachep, flags, nr_pages))
+		return NULL;
+
 	page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
 	if (!page) {
 		if (!(flags & __GFP_NOWARN) && printk_ratelimit())
 			slab_out_of_memory(cachep, flags, nodeid);
+		memcg_kmem_uncharge_slab(cachep, nr_pages);
 		return NULL;
 	}
 
@@ -1675,7 +1680,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (unlikely(page->pfmemalloc))
 		pfmemalloc_active = true;
 
-	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_RECLAIMABLE, nr_pages);
@@ -1724,7 +1728,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 	memcg_release_pages(cachep, cachep->gfporder);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += nr_freed;
-	__free_memcg_kmem_pages(page, cachep->gfporder);
+	memcg_kmem_uncharge_slab(cachep, nr_freed);
+	__free_pages(page, cachep->gfporder);
 }
 
 static void kmem_rcu_free(struct rcu_head *head)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 48e472894511..22e48d000b1d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -290,12 +290,8 @@ void kmem_cache_create_memcg(struct mem_cgroup *memcg, struct kmem_cache *root_c
 				 root_cache->size, root_cache->align,
 				 root_cache->flags, root_cache->ctor,
 				 memcg, root_cache);
-	if (IS_ERR(s)) {
+	if (IS_ERR(s))
 		kfree(cache_name);
-		goto out_unlock;
-	}
-
-	s->allocflags |= __GFP_KMEMCG;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
diff --git a/mm/slub.c b/mm/slub.c
index 66e8e7bef27f..bd75ec0e2ef8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1328,8 +1328,9 @@ static inline struct page *alloc_slab_page(gfp_t flags, int node,
 
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct kmem_cache_order_objects oo = s->oo;
+	int pages = 1 << oo_order(oo);
 	gfp_t alloc_gfp;
 
 	flags &= gfp_allowed_mask;
@@ -1345,22 +1346,33 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 */
 	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
 
+	if (memcg_kmem_charge_slab(s, alloc_gfp, pages))
+		goto out;
+
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
+		int charged = pages;
+
 		oo = s->min;
+		pages = 1 << oo_order(oo);
 		alloc_gfp = flags;
 		/*
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
 		page = alloc_slab_page(alloc_gfp, node, oo);
+		if (!page) {
+			memcg_kmem_uncharge_slab(s, charged);
+			goto out;
+		}
 
-		if (page)
-			stat(s, ORDER_FALLBACK);
+		VM_BUG_ON(charged <= pages);
+		memcg_kmem_uncharge_slab(s, charged - pages);
+		stat(s, ORDER_FALLBACK);
 	}
 
-	if (kmemcheck_enabled && page
-		&& !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
+	if (kmemcheck_enabled &&
+	    !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
 		int pages = 1 << oo_order(oo);
 
 		kmemcheck_alloc_shadow(page, oo_order(oo), alloc_gfp, node);
@@ -1374,7 +1386,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		else
 			kmemcheck_mark_unallocated_pages(page, pages);
 	}
-
+out:
 	if (flags & __GFP_WAIT)
 		local_irq_disable();
 	if (!page)
@@ -1469,7 +1481,8 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 	page_mapcount_reset(page);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
-	__free_memcg_kmem_pages(page, order);
+	memcg_kmem_uncharge_slab(s, pages);
+	__free_pages(page, order);
 }
 
 #define need_reserve_slab_rcu						\
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 07/12] memcg: rework slab charging
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

Currently kmemcg charging is embedded to the alloc_pages path - if we
get there with the GFP_KMEMCG bit set, we charge the new page to the
cgroup of the caller. All per-memcg caches have this bit set in
allocflags so kmalloc and friends are properly charged.

So, what's wrong with it, why should it be reworked?

First, we do some extra work due to this design. We get memcg from mm,
but we already know the cache we are allocating a page for, why not
simply get it from there? We remember the memcg a page is charged to in
a page_cgroup in order to properly uncharge it, but again each kmem slab
holds a reference to its kmem cache in page->slab_cache so we could use
that instead.

Second, it's racy. If a task changes its cgroup between selecting a
cache to allocate from (memcg_kmem_get_cache) and charging, an object
allocated from one cgroup's cache will be accounted to another cgroup.

And the last, but not least, we don't have a reliable way to track all
kmem pages accounted to a particular memcg, which makes reparenting
impossible. As a result, each memcg cache holds a reference to its memcg
until death, which is bad.

Since we have only a couple of places where we should actually charge
kmem pages, why not just insert kmemcg charge/uncharge there passing on
the slab we are allocating from instead of introdudingh into the generic
allocation path. That's what this patch does.

Note, it does not remove the old code - it will be handled further.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 include/linux/memcontrol.h |   44 +++++++++++++++++++++++++++++---------------
 mm/memcontrol.c            |   15 +++++++++++++++
 mm/slab.c                  |    9 +++++++--
 mm/slab_common.c           |    6 +-----
 mm/slub.c                  |   27 ++++++++++++++++++++-------
 5 files changed, 72 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 925dd7e8bbb1..b11808e7e6ee 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -496,6 +496,10 @@ void __memcg_kmem_commit_charge(struct page *page,
 				       struct mem_cgroup *memcg, int order);
 void __memcg_kmem_uncharge_pages(struct page *page, int order);
 
+struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
+int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages);
+void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages);
+
 int memcg_cache_id(struct mem_cgroup *memcg);
 
 char *memcg_create_cache_name(struct mem_cgroup *memcg,
@@ -509,9 +513,6 @@ void memcg_unregister_cache(struct kmem_cache *s);
 int memcg_update_cache_size(struct kmem_cache *s, int num_groups);
 void memcg_update_array_size(int num_groups);
 
-struct kmem_cache *
-__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
-
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
@@ -587,18 +588,6 @@ memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
  * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
  * @gfp: allocation flags.
- *
- * This function assumes that the task allocating, which determines the memcg
- * in the page allocator, belongs to the same cgroup throughout the whole
- * process.  Misacounting can happen if the task calls memcg_kmem_get_cache()
- * while belonging to a cgroup, and later on changes. This is considered
- * acceptable, and should only happen upon task migration.
- *
- * Before the cache is created by the memcg core, there is also a possible
- * imbalance: the task belongs to a memcg, but the cache being allocated from
- * is the global cache, since the child cache is not yet guaranteed to be
- * ready. This case is also fine, since in this case the GFP_KMEMCG will not be
- * passed and the page allocator will not attempt any cgroup accounting.
  */
 static __always_inline struct kmem_cache *
 memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
@@ -614,6 +603,21 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 
 	return __memcg_kmem_get_cache(cachep, gfp);
 }
+
+static __always_inline int memcg_kmem_charge_slab(struct kmem_cache *s,
+						  gfp_t gfp, int nr_pages)
+{
+	if (memcg_kmem_enabled())
+		return __memcg_kmem_charge_slab(s, gfp, nr_pages);
+	return 0;
+}
+
+static __always_inline void memcg_kmem_uncharge_slab(struct kmem_cache *s,
+						     int nr_pages)
+{
+	if (memcg_kmem_enabled())
+		__memcg_kmem_uncharge_slab(s, nr_pages);
+}
 #else
 #define for_each_memcg_cache_index(_idx)	\
 	for (; NULL; )
@@ -666,6 +670,16 @@ memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp)
 {
 	return cachep;
 }
+
+static inline int memcg_kmem_charge_slab(struct kmem_cache *s,
+					 gfp_t gfp, int nr_pages)
+{
+	return 0;
+}
+
+static inline void memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
+{
+}
 #endif /* CONFIG_MEMCG_KMEM */
 #endif /* _LINUX_MEMCONTROL_H */
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e03e9a3535bb..9b6f45607f4f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2992,6 +2992,21 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 		css_put(&memcg->css);
 }
 
+int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages)
+{
+	if (is_root_cache(s))
+		return 0;
+	return memcg_charge_kmem(s->memcg_params->memcg,
+				 gfp, nr_pages << PAGE_SHIFT);
+}
+
+void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
+{
+	if (is_root_cache(s))
+		return;
+	memcg_uncharge_kmem(s->memcg_params->memcg, nr_pages << PAGE_SHIFT);
+}
+
 /*
  * helper for acessing a memcg's index. It will be used as an index in the
  * child cache array in kmem_cache, and also to derive its name. This function
diff --git a/mm/slab.c b/mm/slab.c
index 040dcd89bd6d..1656144ca8a0 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1664,10 +1664,15 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		flags |= __GFP_RECLAIMABLE;
 
+	nr_pages = (1 << cachep->gfporder);
+	if (memcg_kmem_charge_slab(cachep, flags, nr_pages))
+		return NULL;
+
 	page = alloc_pages_exact_node(nodeid, flags | __GFP_NOTRACK, cachep->gfporder);
 	if (!page) {
 		if (!(flags & __GFP_NOWARN) && printk_ratelimit())
 			slab_out_of_memory(cachep, flags, nodeid);
+		memcg_kmem_uncharge_slab(cachep, nr_pages);
 		return NULL;
 	}
 
@@ -1675,7 +1680,6 @@ static struct page *kmem_getpages(struct kmem_cache *cachep, gfp_t flags,
 	if (unlikely(page->pfmemalloc))
 		pfmemalloc_active = true;
 
-	nr_pages = (1 << cachep->gfporder);
 	if (cachep->flags & SLAB_RECLAIM_ACCOUNT)
 		add_zone_page_state(page_zone(page),
 			NR_SLAB_RECLAIMABLE, nr_pages);
@@ -1724,7 +1728,8 @@ static void kmem_freepages(struct kmem_cache *cachep, struct page *page)
 	memcg_release_pages(cachep, cachep->gfporder);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += nr_freed;
-	__free_memcg_kmem_pages(page, cachep->gfporder);
+	memcg_kmem_uncharge_slab(cachep, nr_freed);
+	__free_pages(page, cachep->gfporder);
 }
 
 static void kmem_rcu_free(struct rcu_head *head)
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 48e472894511..22e48d000b1d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -290,12 +290,8 @@ void kmem_cache_create_memcg(struct mem_cgroup *memcg, struct kmem_cache *root_c
 				 root_cache->size, root_cache->align,
 				 root_cache->flags, root_cache->ctor,
 				 memcg, root_cache);
-	if (IS_ERR(s)) {
+	if (IS_ERR(s))
 		kfree(cache_name);
-		goto out_unlock;
-	}
-
-	s->allocflags |= __GFP_KMEMCG;
 
 out_unlock:
 	mutex_unlock(&slab_mutex);
diff --git a/mm/slub.c b/mm/slub.c
index 66e8e7bef27f..bd75ec0e2ef8 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1328,8 +1328,9 @@ static inline struct page *alloc_slab_page(gfp_t flags, int node,
 
 static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 {
-	struct page *page;
+	struct page *page = NULL;
 	struct kmem_cache_order_objects oo = s->oo;
+	int pages = 1 << oo_order(oo);
 	gfp_t alloc_gfp;
 
 	flags &= gfp_allowed_mask;
@@ -1345,22 +1346,33 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 	 */
 	alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL;
 
+	if (memcg_kmem_charge_slab(s, alloc_gfp, pages))
+		goto out;
+
 	page = alloc_slab_page(alloc_gfp, node, oo);
 	if (unlikely(!page)) {
+		int charged = pages;
+
 		oo = s->min;
+		pages = 1 << oo_order(oo);
 		alloc_gfp = flags;
 		/*
 		 * Allocation may have failed due to fragmentation.
 		 * Try a lower order alloc if possible
 		 */
 		page = alloc_slab_page(alloc_gfp, node, oo);
+		if (!page) {
+			memcg_kmem_uncharge_slab(s, charged);
+			goto out;
+		}
 
-		if (page)
-			stat(s, ORDER_FALLBACK);
+		VM_BUG_ON(charged <= pages);
+		memcg_kmem_uncharge_slab(s, charged - pages);
+		stat(s, ORDER_FALLBACK);
 	}
 
-	if (kmemcheck_enabled && page
-		&& !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
+	if (kmemcheck_enabled &&
+	    !(s->flags & (SLAB_NOTRACK | DEBUG_DEFAULT_FLAGS))) {
 		int pages = 1 << oo_order(oo);
 
 		kmemcheck_alloc_shadow(page, oo_order(oo), alloc_gfp, node);
@@ -1374,7 +1386,7 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node)
 		else
 			kmemcheck_mark_unallocated_pages(page, pages);
 	}
-
+out:
 	if (flags & __GFP_WAIT)
 		local_irq_disable();
 	if (!page)
@@ -1469,7 +1481,8 @@ static void __free_slab(struct kmem_cache *s, struct page *page)
 	page_mapcount_reset(page);
 	if (current->reclaim_state)
 		current->reclaim_state->reclaimed_slab += pages;
-	__free_memcg_kmem_pages(page, order);
+	memcg_kmem_uncharge_slab(s, pages);
+	__free_pages(page, order);
 }
 
 #define need_reserve_slab_rcu						\
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 08/12] memcg: do not charge kmalloc_large allocations
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

We don't have a way to track kmalloc_large allocations so that charging
them makes kmemcg reparenting impossible. Since such allocations are
rare and can't be massively triggered from userspace, let's just ignore
them.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 include/linux/slab.h |    2 +-
 mm/slub.c            |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8091d009cd72..29dbf6f2fd3a 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -364,7 +364,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 {
 	void *ret;
 
-	flags |= (__GFP_COMP | __GFP_KMEMCG);
+	flags |= __GFP_COMP;
 	ret = (void *) __get_free_pages(flags, order);
 	kmemleak_alloc(ret, size, 1, flags);
 	return ret;
diff --git a/mm/slub.c b/mm/slub.c
index bd75ec0e2ef8..3ea91fb54f41 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3334,7 +3334,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	struct page *page;
 	void *ptr = NULL;
 
-	flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG;
+	flags |= __GFP_COMP | __GFP_NOTRACK;
 	page = alloc_pages_node(node, flags, get_order(size));
 	if (page)
 		ptr = page_address(page);
@@ -3404,7 +3404,7 @@ void kfree(const void *x)
 	if (unlikely(!PageSlab(page))) {
 		BUG_ON(!PageCompound(page));
 		kfree_hook(x);
-		__free_memcg_kmem_pages(page, compound_order(page));
+		__free_pages(page, compound_order(page));
 		return;
 	}
 	slab_free(page->slab_cache, page, object, _RET_IP_);
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 08/12] memcg: do not charge kmalloc_large allocations
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

We don't have a way to track kmalloc_large allocations so that charging
them makes kmemcg reparenting impossible. Since such allocations are
rare and can't be massively triggered from userspace, let's just ignore
them.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 include/linux/slab.h |    2 +-
 mm/slub.c            |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/slab.h b/include/linux/slab.h
index 8091d009cd72..29dbf6f2fd3a 100644
--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -364,7 +364,7 @@ kmalloc_order(size_t size, gfp_t flags, unsigned int order)
 {
 	void *ret;
 
-	flags |= (__GFP_COMP | __GFP_KMEMCG);
+	flags |= __GFP_COMP;
 	ret = (void *) __get_free_pages(flags, order);
 	kmemleak_alloc(ret, size, 1, flags);
 	return ret;
diff --git a/mm/slub.c b/mm/slub.c
index bd75ec0e2ef8..3ea91fb54f41 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3334,7 +3334,7 @@ static void *kmalloc_large_node(size_t size, gfp_t flags, int node)
 	struct page *page;
 	void *ptr = NULL;
 
-	flags |= __GFP_COMP | __GFP_NOTRACK | __GFP_KMEMCG;
+	flags |= __GFP_COMP | __GFP_NOTRACK;
 	page = alloc_pages_node(node, flags, get_order(size));
 	if (page)
 		ptr = page_address(page);
@@ -3404,7 +3404,7 @@ void kfree(const void *x)
 	if (unlikely(!PageSlab(page))) {
 		BUG_ON(!PageCompound(page));
 		kfree_hook(x);
-		__free_memcg_kmem_pages(page, compound_order(page));
+		__free_pages(page, compound_order(page));
 		return;
 	}
 	slab_free(page->slab_cache, page, object, _RET_IP_);
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 09/12] fork: do not charge thread_info to kmemcg
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Frederic Weisbecker

This patch reverts 2ad306b17c0a ("fork: protect architectures where
THREAD_SIZE >= PAGE_SIZE against fork bombs").

The reasoning behind this is that charging thread_info is the last piece
that prevents us from reparenting kmemcg on css offline. The point is
that we can't reliably track all thread_info pages accounted to a
particular cgroup, because (a) it is freed in __put_task_struct and (b)
on exit tasks are moved to the root cgroup. That said, given a cgroup
there is no sane way to find all tasks (including zombies) that charged
thread_info to this cgroup. Of course, we could uncharge thread_info on
task exit, but that wouldn't help us against fork bombs. So revert and
forget about this.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
---
 include/linux/thread_info.h |    2 --
 kernel/fork.c               |    4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index fddbe2023a5d..1807bb194816 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,8 +61,6 @@ extern long do_no_restart_syscall(struct restart_block *parm);
 # define THREADINFO_GFP		(GFP_KERNEL | __GFP_NOTRACK)
 #endif
 
-#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
-
 /*
  * flag set/clear/test wrappers
  * - pass TIF_xxxx constants to these functions
diff --git a/kernel/fork.c b/kernel/fork.c
index ea1bb6a54823..c22bdaa5db4e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -150,7 +150,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
-	struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
+	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 					     THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
@@ -158,7 +158,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	free_memcg_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 09/12] fork: do not charge thread_info to kmemcg
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Frederic Weisbecker

This patch reverts 2ad306b17c0a ("fork: protect architectures where
THREAD_SIZE >= PAGE_SIZE against fork bombs").

The reasoning behind this is that charging thread_info is the last piece
that prevents us from reparenting kmemcg on css offline. The point is
that we can't reliably track all thread_info pages accounted to a
particular cgroup, because (a) it is freed in __put_task_struct and (b)
on exit tasks are moved to the root cgroup. That said, given a cgroup
there is no sane way to find all tasks (including zombies) that charged
thread_info to this cgroup. Of course, we could uncharge thread_info on
task exit, but that wouldn't help us against fork bombs. So revert and
forget about this.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Frederic Weisbecker <fweisbec@redhat.com>
---
 include/linux/thread_info.h |    2 --
 kernel/fork.c               |    4 ++--
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/thread_info.h b/include/linux/thread_info.h
index fddbe2023a5d..1807bb194816 100644
--- a/include/linux/thread_info.h
+++ b/include/linux/thread_info.h
@@ -61,8 +61,6 @@ extern long do_no_restart_syscall(struct restart_block *parm);
 # define THREADINFO_GFP		(GFP_KERNEL | __GFP_NOTRACK)
 #endif
 
-#define THREADINFO_GFP_ACCOUNTED (THREADINFO_GFP | __GFP_KMEMCG)
-
 /*
  * flag set/clear/test wrappers
  * - pass TIF_xxxx constants to these functions
diff --git a/kernel/fork.c b/kernel/fork.c
index ea1bb6a54823..c22bdaa5db4e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -150,7 +150,7 @@ void __weak arch_release_thread_info(struct thread_info *ti)
 static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 						  int node)
 {
-	struct page *page = alloc_pages_node(node, THREADINFO_GFP_ACCOUNTED,
+	struct page *page = alloc_pages_node(node, THREADINFO_GFP,
 					     THREAD_SIZE_ORDER);
 
 	return page ? page_address(page) : NULL;
@@ -158,7 +158,7 @@ static struct thread_info *alloc_thread_info_node(struct task_struct *tsk,
 
 static inline void free_thread_info(struct thread_info *ti)
 {
-	free_memcg_kmem_pages((unsigned long)ti, THREAD_SIZE_ORDER);
+	free_pages((unsigned long)ti, THREAD_SIZE_ORDER);
 }
 # else
 static struct kmem_cache *thread_info_cache;
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 10/12] memcg: kill GFP_KMEMCG and stuff
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

No one uses it any more. Just get rid of it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/gfp.h             |    5 --
 include/linux/memcontrol.h      |   90 ------------------------------
 include/trace/events/gfpflags.h |    1 -
 mm/memcontrol.c                 |  117 ---------------------------------------
 mm/page_alloc.c                 |   35 ------------
 5 files changed, 248 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 39b81dc7d01a..e37b662cd869 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -31,7 +31,6 @@ struct vm_area_struct;
 #define ___GFP_HARDWALL		0x20000u
 #define ___GFP_THISNODE		0x40000u
 #define ___GFP_RECLAIMABLE	0x80000u
-#define ___GFP_KMEMCG		0x100000u
 #define ___GFP_NOTRACK		0x200000u
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
@@ -91,7 +90,6 @@ struct vm_area_struct;
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
-#define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 
 /*
@@ -372,9 +370,6 @@ extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_cold_page(struct page *page, int cold);
 extern void free_hot_cold_page_list(struct list_head *list, int cold);
 
-extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
-extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
-
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b11808e7e6ee..cd0f8d5095d7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -490,12 +490,6 @@ static inline bool memcg_kmem_enabled(void)
  * conditions, but because they are pretty simple, they are expected to be
  * fast.
  */
-bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
-					int order);
-void __memcg_kmem_commit_charge(struct page *page,
-				       struct mem_cgroup *memcg, int order);
-void __memcg_kmem_uncharge_pages(struct page *page, int order);
-
 struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages);
 void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages);
@@ -516,75 +510,6 @@ void memcg_update_array_size(int num_groups);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
- * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
- * @gfp: the gfp allocation flags.
- * @memcg: a pointer to the memcg this was charged against.
- * @order: allocation order.
- *
- * returns true if the memcg where the current task belongs can hold this
- * allocation.
- *
- * We return true automatically if this allocation is not to be accounted to
- * any memcg.
- */
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
-	if (!memcg_kmem_enabled())
-		return true;
-
-	/*
-	 * __GFP_NOFAIL allocations will move on even if charging is not
-	 * possible. Therefore we don't even try, and have this allocation
-	 * unaccounted. We could in theory charge it with
-	 * res_counter_charge_nofail, but we hope those allocations are rare,
-	 * and won't be worth the trouble.
-	 */
-	if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
-		return true;
-	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
-		return true;
-
-	/* If the test is dying, just let it go. */
-	if (unlikely(fatal_signal_pending(current)))
-		return true;
-
-	return __memcg_kmem_newpage_charge(gfp, memcg, order);
-}
-
-/**
- * memcg_kmem_uncharge_pages: uncharge pages from memcg
- * @page: pointer to struct page being freed
- * @order: allocation order.
- *
- * there is no need to specify memcg here, since it is embedded in page_cgroup
- */
-static inline void
-memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge_pages(page, order);
-}
-
-/**
- * memcg_kmem_commit_charge: embeds correct memcg in a page
- * @page: pointer to struct page recently allocated
- * @memcg: the memcg structure we charged against
- * @order: allocation order.
- *
- * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
- * failure of the allocation. if @page is NULL, this function will revert the
- * charges. Otherwise, it will commit the memcg given by @memcg to the
- * corresponding page_cgroup.
- */
-static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
-{
-	if (memcg_kmem_enabled() && memcg)
-		__memcg_kmem_commit_charge(page, memcg, order);
-}
-
-/**
  * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
  * @gfp: allocation flags.
@@ -627,21 +552,6 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
-	return true;
-}
-
-static inline void memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-}
-
-static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
-{
-}
-
 static inline int memcg_cache_id(struct mem_cgroup *memcg)
 {
 	return -1;
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 1eddbf1557f2..d6fd8e5b14b7 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -34,7 +34,6 @@
 	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
 	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_KMEMCG,		"GFP_KMEMCG"},		\
 	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
 	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
 	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b6f45607f4f..380292b88897 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3466,123 +3466,6 @@ out:
 	rcu_read_unlock();
 	return cachep;
 }
-EXPORT_SYMBOL(__memcg_kmem_get_cache);
-
-/*
- * We need to verify if the allocation against current->mm->owner's memcg is
- * possible for the given order. But the page is not allocated yet, so we'll
- * need a further commit step to do the final arrangements.
- *
- * It is possible for the task to switch cgroups in this mean time, so at
- * commit time, we can't rely on task conversion any longer.  We'll then use
- * the handle argument to return to the caller which cgroup we should commit
- * against. We could also return the memcg directly and avoid the pointer
- * passing, but a boolean return value gives better semantics considering
- * the compiled-out case as well.
- *
- * Returning true means the allocation is possible.
- */
-bool
-__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
-{
-	struct mem_cgroup *memcg;
-	int ret;
-
-	*_memcg = NULL;
-
-	/*
-	 * Disabling accounting is only relevant for some specific memcg
-	 * internal allocations. Therefore we would initially not have such
-	 * check here, since direct calls to the page allocator that are marked
-	 * with GFP_KMEMCG only happen outside memcg core. We are mostly
-	 * concerned with cache allocations, and by having this test at
-	 * memcg_kmem_get_cache, we are already able to relay the allocation to
-	 * the root cache and bypass the memcg cache altogether.
-	 *
-	 * There is one exception, though: the SLUB allocator does not create
-	 * large order caches, but rather service large kmallocs directly from
-	 * the page allocator. Therefore, the following sequence when backed by
-	 * the SLUB allocator:
-	 *
-	 *	memcg_stop_kmem_account();
-	 *	kmalloc(<large_number>)
-	 *	memcg_resume_kmem_account();
-	 *
-	 * would effectively ignore the fact that we should skip accounting,
-	 * since it will drive us directly to this function without passing
-	 * through the cache selector memcg_kmem_get_cache. Such large
-	 * allocations are extremely rare but can happen, for instance, for the
-	 * cache arrays. We bring this test here.
-	 */
-	if (!current->mm || current->memcg_kmem_skip_account)
-		return true;
-
-	memcg = get_mem_cgroup_from_mm(current->mm);
-
-	if (!memcg_can_account_kmem(memcg)) {
-		css_put(&memcg->css);
-		return true;
-	}
-
-	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
-	if (!ret)
-		*_memcg = memcg;
-
-	css_put(&memcg->css);
-	return (ret == 0);
-}
-
-void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      int order)
-{
-	struct page_cgroup *pc;
-
-	VM_BUG_ON(mem_cgroup_is_root(memcg));
-
-	/* The page allocation failed. Revert */
-	if (!page) {
-		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
-		return;
-	}
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
-	unlock_page_cgroup(pc);
-}
-
-void __memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Fast unlocked return. Theoretically might have changed, have to
-	 * check again after locking.
-	 */
-	if (!PageCgroupUsed(pc))
-		return;
-
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * We trust that only if there is a memcg associated with the page, it
-	 * is a valid allocation
-	 */
-	if (!memcg)
-		return;
-
-	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
-}
 
 static void __init memcg_kmem_init(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 12aa8c255d8d..68d2c1708bca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,7 +2743,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
-	struct mem_cgroup *memcg = NULL;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2762,13 +2761,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	/*
-	 * Will only have any effect when __GFP_KMEMCG is set.  This is
-	 * verified in the (always inline) callee
-	 */
-	if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
-		return NULL;
-
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
@@ -2811,8 +2803,6 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
-	memcg_kmem_commit_charge(page, memcg, order);
-
 	if (page)
 		set_page_owner(page, order, gfp_mask);
 
@@ -2868,31 +2858,6 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
-/*
- * __free_memcg_kmem_pages and free_memcg_kmem_pages will free
- * pages allocated with __GFP_KMEMCG.
- *
- * Those pages are accounted to a particular memcg, embedded in the
- * corresponding page_cgroup. To avoid adding a hit in the allocator to search
- * for that information only to find out that it is NULL for users who have no
- * interest in that whatsoever, we provide these functions.
- *
- * The caller knows better which flags it relies on.
- */
-void __free_memcg_kmem_pages(struct page *page, unsigned int order)
-{
-	memcg_kmem_uncharge_pages(page, order);
-	__free_pages(page, order);
-}
-
-void free_memcg_kmem_pages(unsigned long addr, unsigned int order)
-{
-	if (addr != 0) {
-		VM_BUG_ON(!virt_addr_valid((void *)addr));
-		__free_memcg_kmem_pages(virt_to_page((void *)addr), order);
-	}
-}
-
 static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 {
 	if (addr) {
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 10/12] memcg: kill GFP_KMEMCG and stuff
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

No one uses it any more. Just get rid of it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 include/linux/gfp.h             |    5 --
 include/linux/memcontrol.h      |   90 ------------------------------
 include/trace/events/gfpflags.h |    1 -
 mm/memcontrol.c                 |  117 ---------------------------------------
 mm/page_alloc.c                 |   35 ------------
 5 files changed, 248 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 39b81dc7d01a..e37b662cd869 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -31,7 +31,6 @@ struct vm_area_struct;
 #define ___GFP_HARDWALL		0x20000u
 #define ___GFP_THISNODE		0x40000u
 #define ___GFP_RECLAIMABLE	0x80000u
-#define ___GFP_KMEMCG		0x100000u
 #define ___GFP_NOTRACK		0x200000u
 #define ___GFP_NO_KSWAPD	0x400000u
 #define ___GFP_OTHER_NODE	0x800000u
@@ -91,7 +90,6 @@ struct vm_area_struct;
 
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
-#define __GFP_KMEMCG	((__force gfp_t)___GFP_KMEMCG) /* Allocation comes from a memcg-accounted resource */
 #define __GFP_WRITE	((__force gfp_t)___GFP_WRITE)	/* Allocator intends to dirty page */
 
 /*
@@ -372,9 +370,6 @@ extern void free_pages(unsigned long addr, unsigned int order);
 extern void free_hot_cold_page(struct page *page, int cold);
 extern void free_hot_cold_page_list(struct list_head *list, int cold);
 
-extern void __free_memcg_kmem_pages(struct page *page, unsigned int order);
-extern void free_memcg_kmem_pages(unsigned long addr, unsigned int order);
-
 #define __free_page(page) __free_pages((page), 0)
 #define free_page(addr) free_pages((addr), 0)
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b11808e7e6ee..cd0f8d5095d7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -490,12 +490,6 @@ static inline bool memcg_kmem_enabled(void)
  * conditions, but because they are pretty simple, they are expected to be
  * fast.
  */
-bool __memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg,
-					int order);
-void __memcg_kmem_commit_charge(struct page *page,
-				       struct mem_cgroup *memcg, int order);
-void __memcg_kmem_uncharge_pages(struct page *page, int order);
-
 struct kmem_cache *__memcg_kmem_get_cache(struct kmem_cache *cachep, gfp_t gfp);
 int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages);
 void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages);
@@ -516,75 +510,6 @@ void memcg_update_array_size(int num_groups);
 void kmem_cache_destroy_memcg_children(struct kmem_cache *s);
 
 /**
- * memcg_kmem_newpage_charge: verify if a new kmem allocation is allowed.
- * @gfp: the gfp allocation flags.
- * @memcg: a pointer to the memcg this was charged against.
- * @order: allocation order.
- *
- * returns true if the memcg where the current task belongs can hold this
- * allocation.
- *
- * We return true automatically if this allocation is not to be accounted to
- * any memcg.
- */
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
-	if (!memcg_kmem_enabled())
-		return true;
-
-	/*
-	 * __GFP_NOFAIL allocations will move on even if charging is not
-	 * possible. Therefore we don't even try, and have this allocation
-	 * unaccounted. We could in theory charge it with
-	 * res_counter_charge_nofail, but we hope those allocations are rare,
-	 * and won't be worth the trouble.
-	 */
-	if (!(gfp & __GFP_KMEMCG) || (gfp & __GFP_NOFAIL))
-		return true;
-	if (in_interrupt() || (!current->mm) || (current->flags & PF_KTHREAD))
-		return true;
-
-	/* If the test is dying, just let it go. */
-	if (unlikely(fatal_signal_pending(current)))
-		return true;
-
-	return __memcg_kmem_newpage_charge(gfp, memcg, order);
-}
-
-/**
- * memcg_kmem_uncharge_pages: uncharge pages from memcg
- * @page: pointer to struct page being freed
- * @order: allocation order.
- *
- * there is no need to specify memcg here, since it is embedded in page_cgroup
- */
-static inline void
-memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-	if (memcg_kmem_enabled())
-		__memcg_kmem_uncharge_pages(page, order);
-}
-
-/**
- * memcg_kmem_commit_charge: embeds correct memcg in a page
- * @page: pointer to struct page recently allocated
- * @memcg: the memcg structure we charged against
- * @order: allocation order.
- *
- * Needs to be called after memcg_kmem_newpage_charge, regardless of success or
- * failure of the allocation. if @page is NULL, this function will revert the
- * charges. Otherwise, it will commit the memcg given by @memcg to the
- * corresponding page_cgroup.
- */
-static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
-{
-	if (memcg_kmem_enabled() && memcg)
-		__memcg_kmem_commit_charge(page, memcg, order);
-}
-
-/**
  * memcg_kmem_get_cache: selects the correct per-memcg cache for allocation
  * @cachep: the original global kmem cache
  * @gfp: allocation flags.
@@ -627,21 +552,6 @@ static inline bool memcg_kmem_enabled(void)
 	return false;
 }
 
-static inline bool
-memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **memcg, int order)
-{
-	return true;
-}
-
-static inline void memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-}
-
-static inline void
-memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg, int order)
-{
-}
-
 static inline int memcg_cache_id(struct mem_cgroup *memcg)
 {
 	return -1;
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 1eddbf1557f2..d6fd8e5b14b7 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -34,7 +34,6 @@
 	{(unsigned long)__GFP_HARDWALL,		"GFP_HARDWALL"},	\
 	{(unsigned long)__GFP_THISNODE,		"GFP_THISNODE"},	\
 	{(unsigned long)__GFP_RECLAIMABLE,	"GFP_RECLAIMABLE"},	\
-	{(unsigned long)__GFP_KMEMCG,		"GFP_KMEMCG"},		\
 	{(unsigned long)__GFP_MOVABLE,		"GFP_MOVABLE"},		\
 	{(unsigned long)__GFP_NOTRACK,		"GFP_NOTRACK"},		\
 	{(unsigned long)__GFP_NO_KSWAPD,	"GFP_NO_KSWAPD"},	\
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9b6f45607f4f..380292b88897 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3466,123 +3466,6 @@ out:
 	rcu_read_unlock();
 	return cachep;
 }
-EXPORT_SYMBOL(__memcg_kmem_get_cache);
-
-/*
- * We need to verify if the allocation against current->mm->owner's memcg is
- * possible for the given order. But the page is not allocated yet, so we'll
- * need a further commit step to do the final arrangements.
- *
- * It is possible for the task to switch cgroups in this mean time, so at
- * commit time, we can't rely on task conversion any longer.  We'll then use
- * the handle argument to return to the caller which cgroup we should commit
- * against. We could also return the memcg directly and avoid the pointer
- * passing, but a boolean return value gives better semantics considering
- * the compiled-out case as well.
- *
- * Returning true means the allocation is possible.
- */
-bool
-__memcg_kmem_newpage_charge(gfp_t gfp, struct mem_cgroup **_memcg, int order)
-{
-	struct mem_cgroup *memcg;
-	int ret;
-
-	*_memcg = NULL;
-
-	/*
-	 * Disabling accounting is only relevant for some specific memcg
-	 * internal allocations. Therefore we would initially not have such
-	 * check here, since direct calls to the page allocator that are marked
-	 * with GFP_KMEMCG only happen outside memcg core. We are mostly
-	 * concerned with cache allocations, and by having this test at
-	 * memcg_kmem_get_cache, we are already able to relay the allocation to
-	 * the root cache and bypass the memcg cache altogether.
-	 *
-	 * There is one exception, though: the SLUB allocator does not create
-	 * large order caches, but rather service large kmallocs directly from
-	 * the page allocator. Therefore, the following sequence when backed by
-	 * the SLUB allocator:
-	 *
-	 *	memcg_stop_kmem_account();
-	 *	kmalloc(<large_number>)
-	 *	memcg_resume_kmem_account();
-	 *
-	 * would effectively ignore the fact that we should skip accounting,
-	 * since it will drive us directly to this function without passing
-	 * through the cache selector memcg_kmem_get_cache. Such large
-	 * allocations are extremely rare but can happen, for instance, for the
-	 * cache arrays. We bring this test here.
-	 */
-	if (!current->mm || current->memcg_kmem_skip_account)
-		return true;
-
-	memcg = get_mem_cgroup_from_mm(current->mm);
-
-	if (!memcg_can_account_kmem(memcg)) {
-		css_put(&memcg->css);
-		return true;
-	}
-
-	ret = memcg_charge_kmem(memcg, gfp, PAGE_SIZE << order);
-	if (!ret)
-		*_memcg = memcg;
-
-	css_put(&memcg->css);
-	return (ret == 0);
-}
-
-void __memcg_kmem_commit_charge(struct page *page, struct mem_cgroup *memcg,
-			      int order)
-{
-	struct page_cgroup *pc;
-
-	VM_BUG_ON(mem_cgroup_is_root(memcg));
-
-	/* The page allocation failed. Revert */
-	if (!page) {
-		memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
-		return;
-	}
-
-	pc = lookup_page_cgroup(page);
-	lock_page_cgroup(pc);
-	pc->mem_cgroup = memcg;
-	SetPageCgroupUsed(pc);
-	unlock_page_cgroup(pc);
-}
-
-void __memcg_kmem_uncharge_pages(struct page *page, int order)
-{
-	struct mem_cgroup *memcg = NULL;
-	struct page_cgroup *pc;
-
-
-	pc = lookup_page_cgroup(page);
-	/*
-	 * Fast unlocked return. Theoretically might have changed, have to
-	 * check again after locking.
-	 */
-	if (!PageCgroupUsed(pc))
-		return;
-
-	lock_page_cgroup(pc);
-	if (PageCgroupUsed(pc)) {
-		memcg = pc->mem_cgroup;
-		ClearPageCgroupUsed(pc);
-	}
-	unlock_page_cgroup(pc);
-
-	/*
-	 * We trust that only if there is a memcg associated with the page, it
-	 * is a valid allocation
-	 */
-	if (!memcg)
-		return;
-
-	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
-	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
-}
 
 static void __init memcg_kmem_init(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 12aa8c255d8d..68d2c1708bca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2743,7 +2743,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	int migratetype = allocflags_to_migratetype(gfp_mask);
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET;
-	struct mem_cgroup *memcg = NULL;
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2762,13 +2761,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	if (unlikely(!zonelist->_zonerefs->zone))
 		return NULL;
 
-	/*
-	 * Will only have any effect when __GFP_KMEMCG is set.  This is
-	 * verified in the (always inline) callee
-	 */
-	if (!memcg_kmem_newpage_charge(gfp_mask, &memcg, order))
-		return NULL;
-
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
 
@@ -2811,8 +2803,6 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
-	memcg_kmem_commit_charge(page, memcg, order);
-
 	if (page)
 		set_page_owner(page, order, gfp_mask);
 
@@ -2868,31 +2858,6 @@ void free_pages(unsigned long addr, unsigned int order)
 
 EXPORT_SYMBOL(free_pages);
 
-/*
- * __free_memcg_kmem_pages and free_memcg_kmem_pages will free
- * pages allocated with __GFP_KMEMCG.
- *
- * Those pages are accounted to a particular memcg, embedded in the
- * corresponding page_cgroup. To avoid adding a hit in the allocator to search
- * for that information only to find out that it is NULL for users who have no
- * interest in that whatsoever, we provide these functions.
- *
- * The caller knows better which flags it relies on.
- */
-void __free_memcg_kmem_pages(struct page *page, unsigned int order)
-{
-	memcg_kmem_uncharge_pages(page, order);
-	__free_pages(page, order);
-}
-
-void free_memcg_kmem_pages(unsigned long addr, unsigned int order)
-{
-	if (addr != 0) {
-		VM_BUG_ON(!virt_addr_valid((void *)addr));
-		__free_memcg_kmem_pages(virt_to_page((void *)addr), order);
-	}
-}
-
 static void *make_alloc_exact(unsigned long addr, unsigned order, size_t size)
 {
 	if (addr) {
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 11/12] memcg: reparent slab on css offline
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Currently we take a css reference per each memcg cache. This is simple,
but extremely ugly - a memcg will hang around after its death until it
has any charges. Moreover, there are some clashes with the design
principles the cgroup subsystems implies.

However, there is nothing that prevents us from reparenting kmem charges
just like we do with user pages. Moreover, it is much easier to
implement: we already keep all memcg caches on a list, so all we have to
do is walk over the list and move the caches to the parent cgroup's list
changing the memcg ptr in the meanwhile. If somebody frees an object to
a cache being reparented, he might see a pointer to the old memcg, but
that's OK, we only need to use RCU to protect against use-after-free.
Let's just do it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |  306 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 189 insertions(+), 117 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 380292b88897..77079ff6bb65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -383,11 +383,20 @@ struct mem_cgroup {
 /* internal only representation about the status of kmem accounting. */
 enum {
 	KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
-	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_ACCOUNTED_REPARENTED, /* has reparented caches */
 };
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
+/*
+ * mem_cgroup::slab_caches_mutex nesting subclasses:
+ */
+enum memcg_slab_mutex_class
+{
+	MEMCG_SLAB_MUTEX_PARENT,
+	MEMCG_SLAB_MUTEX_CHILD,
+};
+
+static void memcg_kmem_set_active(struct mem_cgroup *memcg)
 {
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -397,21 +406,14 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
+static void memcg_kmem_set_reparented(struct mem_cgroup *memcg)
 {
-	/*
-	 * Our caller must use css_get() first, because memcg_uncharge_kmem()
-	 * will call css_put() if it sees the memcg is dead.
-	 */
-	smp_wmb();
-	if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
-		set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
+	set_bit(KMEM_ACCOUNTED_REPARENTED, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
+static bool memcg_kmem_is_reparented(struct mem_cgroup *memcg)
 {
-	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
-				  &memcg->kmem_account_flags);
+	return test_bit(KMEM_ACCOUNTED_REPARENTED, &memcg->kmem_account_flags);
 }
 #endif
 
@@ -656,11 +658,6 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 		static_key_slow_dec(&memcg_kmem_enabled_key);
 		ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id);
 	}
-	/*
-	 * This check can't live in kmem destruction function,
-	 * since the charges will outlive the cgroup
-	 */
-	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -2975,36 +2972,48 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 	res_counter_uncharge(&memcg->res, size);
 	if (do_swap_account)
 		res_counter_uncharge(&memcg->memsw, size);
+	res_counter_uncharge(&memcg->kmem, size);
+}
 
-	/* Not down to 0 */
-	if (res_counter_uncharge(&memcg->kmem, size))
-		return;
+static struct mem_cgroup *try_get_mem_cgroup_from_slab(struct kmem_cache *s)
+{
+	struct mem_cgroup *memcg;
 
-	/*
-	 * Releases a reference taken in kmem_cgroup_css_offline in case
-	 * this last uncharge is racing with the offlining code or it is
-	 * outliving the memcg existence.
-	 *
-	 * The memory barrier imposed by test&clear is paired with the
-	 * explicit one in memcg_kmem_mark_dead().
-	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
+	if (is_root_cache(s))
+		return NULL;
+
+	rcu_read_lock();
+	do {
+		memcg = s->memcg_params->memcg;
+		if (!memcg)
+			break;
+	} while (!css_tryget(&memcg->css));
+	rcu_read_unlock();
+	return memcg;
 }
 
 int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages)
 {
-	if (is_root_cache(s))
-		return 0;
-	return memcg_charge_kmem(s->memcg_params->memcg,
-				 gfp, nr_pages << PAGE_SHIFT);
+	struct mem_cgroup *memcg;
+	int ret = 0;
+
+	memcg = try_get_mem_cgroup_from_slab(s);
+	if (memcg) {
+		ret = memcg_charge_kmem(memcg, gfp, nr_pages << PAGE_SHIFT);
+		css_put(&memcg->css);
+	}
+	return ret;
 }
 
 void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
 {
-	if (is_root_cache(s))
-		return;
-	memcg_uncharge_kmem(s->memcg_params->memcg, nr_pages << PAGE_SHIFT);
+	struct mem_cgroup *memcg;
+
+	memcg = try_get_mem_cgroup_from_slab(s);
+	if (memcg) {
+		memcg_uncharge_kmem(memcg, nr_pages << PAGE_SHIFT);
+		css_put(&memcg->css);
+	}
 }
 
 /*
@@ -3149,7 +3158,6 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		INIT_WORK(&s->memcg_params->destroy,
 				kmem_cache_destroy_work_func);
 		atomic_set(&s->memcg_params->refcount, 1);
-		css_get(&memcg->css);
 	} else {
 		s->memcg_params->is_root_cache = true;
 		INIT_LIST_HEAD(&s->memcg_params->children);
@@ -3160,19 +3168,36 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 
 void memcg_free_cache_params(struct kmem_cache *s)
 {
-	if (!s->memcg_params)
-		return;
-	if (!s->memcg_params->is_root_cache)
-		css_put(&s->memcg_params->memcg->css);
 	kfree(s->memcg_params);
 }
 
-void memcg_register_cache(struct kmem_cache *s)
+static void __memcg_register_cache(struct kmem_cache *s, struct kmem_cache *root)
 {
-	struct kmem_cache *root;
 	struct mem_cgroup *memcg;
 	int id;
 
+	memcg = s->memcg_params->memcg;
+	/*
+	 * Special case: re-registering the cache on __kmem_cache_shutdown()
+	 * failure (see __kmem_cache_destroy()).
+	 */
+	if (!memcg)
+		return;
+
+	id = memcg_cache_id(memcg);
+	BUG_ON(id < 0);
+
+	mutex_lock(&memcg->slab_caches_mutex);
+	BUG_ON(root->memcg_params->memcg_caches[id]);
+	root->memcg_params->memcg_caches[id] = s;
+	list_add(&s->memcg_params->list, &memcg->memcg_slab_caches);
+	mutex_unlock(&memcg->slab_caches_mutex);
+}
+
+void memcg_register_cache(struct kmem_cache *s)
+{
+	struct kmem_cache *root;
+
 	if (is_root_cache(s))
 		return;
 
@@ -3182,10 +3207,6 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	lockdep_assert_held(&slab_mutex);
 
-	root = s->memcg_params->root_cache;
-	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
-
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
@@ -3193,21 +3214,61 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
+	root = s->memcg_params->root_cache;
 	list_add(&s->memcg_params->siblings, &root->memcg_params->children);
+	__memcg_register_cache(s, root);
 
-	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
-	root->memcg_params->memcg_caches[id] = s;
+	static_key_slow_inc(&memcg_kmem_enabled_key);
+}
+
+static void __memcg_unregister_cache(struct kmem_cache *s, struct kmem_cache *root)
+{
+	struct mem_cgroup *memcg;
+	int id;
+
+retry:
+	memcg = try_get_mem_cgroup_from_slab(s);
+
+	/*
+	 * This can happen if the cache's memcg was turned offline and it was
+	 * reparented to the root cgroup. In this case the cache must have
+	 * already been properly unregistered so we have nothing to do.
+	 */
+	if (!memcg)
+		return;
 
+	id = memcg_cache_id(memcg);
+
+	/*
+	 * To delete a cache from memcg_slab_caches list, we need to take the
+	 * correpsonding slab_caches_mutex. Since nothing prevents the cache
+	 * from being reparented while we are here, we recheck the cache's
+	 * memcg after taking the mutex and retry if it changed.
+	 */
 	mutex_lock(&memcg->slab_caches_mutex);
-	list_add(&s->memcg_params->list, &memcg->memcg_slab_caches);
+	if (memcg != s->memcg_params->memcg) {
+		mutex_unlock(&memcg->slab_caches_mutex);
+		css_put(&memcg->css);
+		goto retry;
+	}
+
+	list_del(&s->memcg_params->list);
+	s->memcg_params->memcg = NULL;
+
+	/*
+	 * Clear the slot in the memcg_caches array only if the cache hasn't
+	 * been reparented before.
+	 */
+	if (id >= 0 && root->memcg_params->memcg_caches[id] == s)
+		root->memcg_params->memcg_caches[id] = NULL;
+
 	mutex_unlock(&memcg->slab_caches_mutex);
+	css_put(&memcg->css);
 }
 
 void memcg_unregister_cache(struct kmem_cache *s)
 {
 	struct kmem_cache *root;
-	struct mem_cgroup *memcg;
-	int id;
 
 	if (is_root_cache(s))
 		return;
@@ -3219,17 +3280,10 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	lockdep_assert_held(&slab_mutex);
 
 	root = s->memcg_params->root_cache;
-	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
-
 	list_del(&s->memcg_params->siblings);
+	__memcg_unregister_cache(s, root);
 
-	mutex_lock(&memcg->slab_caches_mutex);
-	list_del(&s->memcg_params->list);
-	mutex_unlock(&memcg->slab_caches_mutex);
-
-	VM_BUG_ON(root->memcg_params->memcg_caches[id] != s);
-	root->memcg_params->memcg_caches[id] = NULL;
+	static_key_slow_dec(&memcg_kmem_enabled_key);
 }
 
 /*
@@ -3273,7 +3327,7 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 
 	if (atomic_read(&params->refcount) != 0) {
 		/*
-		 * We were scheduled from mem_cgroup_destroy_all_caches().
+		 * We were scheduled from mem_cgroup_reparent_slab().
 		 * Shrink the cache and drop the reference taken by memcg.
 		 */
 		kmem_cache_shrink(cachep);
@@ -3316,11 +3370,56 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	mutex_unlock(&activate_kmem_mutex);
 }
 
-static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_one_slab(struct kmem_cache *cachep,
+		struct mem_cgroup *from, struct mem_cgroup *to)
 {
+	struct kmem_cache *root;
+	int id;
+
+	root = cachep->memcg_params->root_cache;
+
+	if (to)
+		list_move(&cachep->memcg_params->list, &to->memcg_slab_caches);
+	else
+		list_del(&cachep->memcg_params->list);
+
+	BUG_ON(cachep->memcg_params->memcg != from);
+	cachep->memcg_params->memcg = to;
+
+	/*
+	 * We may access the cachep->memcg_params->memcg ptr lock-free so we
+	 * have to make sure readers will see the new value before the final
+	 * css put.
+	 */
+	smp_wmb();
+
+	/*
+	 * If the cache has already been reparented we are done here. Otherwise
+	 * we clear the reference to it in the memcg_caches array and schedule
+	 * shrink work.
+	 */
+	id = memcg_cache_id(from);
+	if (id < 0 || root->memcg_params->memcg_caches[id] != cachep)
+		return;
+
+	root->memcg_params->memcg_caches[id] = NULL;
+
+	/*
+	 * The work could not be scheduled from memcg_release_pages(), because
+	 * we haven't dropped cachep->memcg_params->refcount yet. That's why we
+	 * cannot fail here.
+	 */
+	if (!schedule_work(&cachep->memcg_params->destroy))
+		BUG();
+}
+
+static void mem_cgroup_reparent_slab(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent;
 	struct memcg_cache_params *params;
 
-	if (!memcg_kmem_is_active(memcg))
+	if (!memcg_kmem_is_active(memcg) &&
+	    !memcg_kmem_is_reparented(memcg))
 		return;
 
 	/*
@@ -3332,17 +3431,30 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 	 */
 	flush_workqueue(memcg_cache_create_wq);
 
-	mutex_lock(&memcg->slab_caches_mutex);
-	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
-		/*
-		 * Since we still hold the reference to the cache params from
-		 * the memcg, the work could not have been scheduled from
-		 * memcg_release_pages(), and this cannot fail.
-		 */
-		if (!schedule_work(&params->destroy))
-			BUG();
+	/*
+	 * We are going to modify memcg_caches arrays, so we have to protect
+	 * them against relocating.
+	 */
+	mutex_lock(&activate_kmem_mutex);
+
+	parent = parent_mem_cgroup(memcg);
+	if (parent)
+		mutex_lock_nested(&parent->slab_caches_mutex,
+				  MEMCG_SLAB_MUTEX_PARENT);
+	mutex_lock_nested(&memcg->slab_caches_mutex, MEMCG_SLAB_MUTEX_CHILD);
+	while (!list_empty(&memcg->memcg_slab_caches)) {
+		params = list_first_entry(&memcg->memcg_slab_caches,
+					  struct memcg_cache_params, list);
+		mem_cgroup_reparent_one_slab(params->cachep, memcg, parent);
 	}
 	mutex_unlock(&memcg->slab_caches_mutex);
+	if (parent)
+		mutex_unlock(&parent->slab_caches_mutex);
+
+	mutex_unlock(&activate_kmem_mutex);
+
+	if (parent)
+		memcg_kmem_set_reparented(parent);
 }
 
 struct create_work {
@@ -3473,7 +3585,7 @@ static void __init memcg_kmem_init(void)
 	BUG_ON(!memcg_cache_create_wq);
 }
 #else
-static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_slab(struct mem_cgroup *memcg)
 {
 }
 
@@ -5681,40 +5793,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-	if (!memcg_kmem_is_active(memcg))
-		return;
-
-	/*
-	 * kmem charges can outlive the cgroup. In the case of slab
-	 * pages, for instance, a page contain objects from various
-	 * processes. As we prevent from taking a reference for every
-	 * such allocation we have to be careful when doing uncharge
-	 * (see memcg_uncharge_kmem) and here during offlining.
-	 *
-	 * The idea is that that only the _last_ uncharge which sees
-	 * the dead memcg will drop the last reference. An additional
-	 * reference is taken here before the group is marked dead
-	 * which is then paired with css_put during uncharge resp. here.
-	 *
-	 * Although this might sound strange as this path is called from
-	 * css_offline() when the referencemight have dropped down to 0
-	 * and shouldn't be incremented anymore (css_tryget would fail)
-	 * we do not have other options because of the kmem allocations
-	 * lifetime.
-	 */
-	css_get(&memcg->css);
-
-	memcg_kmem_mark_dead(memcg);
-
-	if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
-		return;
-
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
-}
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
@@ -5724,10 +5802,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
 #endif
 
 /*
@@ -6335,8 +6409,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock(&memcg->event_list_lock);
 
-	kmem_cgroup_css_offline(memcg);
-
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 
 	/*
@@ -6346,7 +6418,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	css_for_each_descendant_post(iter, css)
 		mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
 
-	mem_cgroup_destroy_all_caches(memcg);
+	mem_cgroup_reparent_slab(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 11/12] memcg: reparent slab on css offline
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm; +Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel

Currently we take a css reference per each memcg cache. This is simple,
but extremely ugly - a memcg will hang around after its death until it
has any charges. Moreover, there are some clashes with the design
principles the cgroup subsystems implies.

However, there is nothing that prevents us from reparenting kmem charges
just like we do with user pages. Moreover, it is much easier to
implement: we already keep all memcg caches on a list, so all we have to
do is walk over the list and move the caches to the parent cgroup's list
changing the memcg ptr in the meanwhile. If somebody frees an object to
a cache being reparented, he might see a pointer to the old memcg, but
that's OK, we only need to use RCU to protect against use-after-free.
Let's just do it.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
---
 mm/memcontrol.c |  306 ++++++++++++++++++++++++++++++++++---------------------
 1 file changed, 189 insertions(+), 117 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 380292b88897..77079ff6bb65 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -383,11 +383,20 @@ struct mem_cgroup {
 /* internal only representation about the status of kmem accounting. */
 enum {
 	KMEM_ACCOUNTED_ACTIVE, /* accounted by this cgroup itself */
-	KMEM_ACCOUNTED_DEAD, /* dead memcg with pending kmem charges */
+	KMEM_ACCOUNTED_REPARENTED, /* has reparented caches */
 };
 
 #ifdef CONFIG_MEMCG_KMEM
-static inline void memcg_kmem_set_active(struct mem_cgroup *memcg)
+/*
+ * mem_cgroup::slab_caches_mutex nesting subclasses:
+ */
+enum memcg_slab_mutex_class
+{
+	MEMCG_SLAB_MUTEX_PARENT,
+	MEMCG_SLAB_MUTEX_CHILD,
+};
+
+static void memcg_kmem_set_active(struct mem_cgroup *memcg)
 {
 	set_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
@@ -397,21 +406,14 @@ static bool memcg_kmem_is_active(struct mem_cgroup *memcg)
 	return test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags);
 }
 
-static void memcg_kmem_mark_dead(struct mem_cgroup *memcg)
+static void memcg_kmem_set_reparented(struct mem_cgroup *memcg)
 {
-	/*
-	 * Our caller must use css_get() first, because memcg_uncharge_kmem()
-	 * will call css_put() if it sees the memcg is dead.
-	 */
-	smp_wmb();
-	if (test_bit(KMEM_ACCOUNTED_ACTIVE, &memcg->kmem_account_flags))
-		set_bit(KMEM_ACCOUNTED_DEAD, &memcg->kmem_account_flags);
+	set_bit(KMEM_ACCOUNTED_REPARENTED, &memcg->kmem_account_flags);
 }
 
-static bool memcg_kmem_test_and_clear_dead(struct mem_cgroup *memcg)
+static bool memcg_kmem_is_reparented(struct mem_cgroup *memcg)
 {
-	return test_and_clear_bit(KMEM_ACCOUNTED_DEAD,
-				  &memcg->kmem_account_flags);
+	return test_bit(KMEM_ACCOUNTED_REPARENTED, &memcg->kmem_account_flags);
 }
 #endif
 
@@ -656,11 +658,6 @@ static void disarm_kmem_keys(struct mem_cgroup *memcg)
 		static_key_slow_dec(&memcg_kmem_enabled_key);
 		ida_simple_remove(&kmem_limited_groups, memcg->kmemcg_id);
 	}
-	/*
-	 * This check can't live in kmem destruction function,
-	 * since the charges will outlive the cgroup
-	 */
-	WARN_ON(res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0);
 }
 #else
 static void disarm_kmem_keys(struct mem_cgroup *memcg)
@@ -2975,36 +2972,48 @@ static void memcg_uncharge_kmem(struct mem_cgroup *memcg, u64 size)
 	res_counter_uncharge(&memcg->res, size);
 	if (do_swap_account)
 		res_counter_uncharge(&memcg->memsw, size);
+	res_counter_uncharge(&memcg->kmem, size);
+}
 
-	/* Not down to 0 */
-	if (res_counter_uncharge(&memcg->kmem, size))
-		return;
+static struct mem_cgroup *try_get_mem_cgroup_from_slab(struct kmem_cache *s)
+{
+	struct mem_cgroup *memcg;
 
-	/*
-	 * Releases a reference taken in kmem_cgroup_css_offline in case
-	 * this last uncharge is racing with the offlining code or it is
-	 * outliving the memcg existence.
-	 *
-	 * The memory barrier imposed by test&clear is paired with the
-	 * explicit one in memcg_kmem_mark_dead().
-	 */
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
+	if (is_root_cache(s))
+		return NULL;
+
+	rcu_read_lock();
+	do {
+		memcg = s->memcg_params->memcg;
+		if (!memcg)
+			break;
+	} while (!css_tryget(&memcg->css));
+	rcu_read_unlock();
+	return memcg;
 }
 
 int __memcg_kmem_charge_slab(struct kmem_cache *s, gfp_t gfp, int nr_pages)
 {
-	if (is_root_cache(s))
-		return 0;
-	return memcg_charge_kmem(s->memcg_params->memcg,
-				 gfp, nr_pages << PAGE_SHIFT);
+	struct mem_cgroup *memcg;
+	int ret = 0;
+
+	memcg = try_get_mem_cgroup_from_slab(s);
+	if (memcg) {
+		ret = memcg_charge_kmem(memcg, gfp, nr_pages << PAGE_SHIFT);
+		css_put(&memcg->css);
+	}
+	return ret;
 }
 
 void __memcg_kmem_uncharge_slab(struct kmem_cache *s, int nr_pages)
 {
-	if (is_root_cache(s))
-		return;
-	memcg_uncharge_kmem(s->memcg_params->memcg, nr_pages << PAGE_SHIFT);
+	struct mem_cgroup *memcg;
+
+	memcg = try_get_mem_cgroup_from_slab(s);
+	if (memcg) {
+		memcg_uncharge_kmem(memcg, nr_pages << PAGE_SHIFT);
+		css_put(&memcg->css);
+	}
 }
 
 /*
@@ -3149,7 +3158,6 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 		INIT_WORK(&s->memcg_params->destroy,
 				kmem_cache_destroy_work_func);
 		atomic_set(&s->memcg_params->refcount, 1);
-		css_get(&memcg->css);
 	} else {
 		s->memcg_params->is_root_cache = true;
 		INIT_LIST_HEAD(&s->memcg_params->children);
@@ -3160,19 +3168,36 @@ int memcg_alloc_cache_params(struct mem_cgroup *memcg, struct kmem_cache *s,
 
 void memcg_free_cache_params(struct kmem_cache *s)
 {
-	if (!s->memcg_params)
-		return;
-	if (!s->memcg_params->is_root_cache)
-		css_put(&s->memcg_params->memcg->css);
 	kfree(s->memcg_params);
 }
 
-void memcg_register_cache(struct kmem_cache *s)
+static void __memcg_register_cache(struct kmem_cache *s, struct kmem_cache *root)
 {
-	struct kmem_cache *root;
 	struct mem_cgroup *memcg;
 	int id;
 
+	memcg = s->memcg_params->memcg;
+	/*
+	 * Special case: re-registering the cache on __kmem_cache_shutdown()
+	 * failure (see __kmem_cache_destroy()).
+	 */
+	if (!memcg)
+		return;
+
+	id = memcg_cache_id(memcg);
+	BUG_ON(id < 0);
+
+	mutex_lock(&memcg->slab_caches_mutex);
+	BUG_ON(root->memcg_params->memcg_caches[id]);
+	root->memcg_params->memcg_caches[id] = s;
+	list_add(&s->memcg_params->list, &memcg->memcg_slab_caches);
+	mutex_unlock(&memcg->slab_caches_mutex);
+}
+
+void memcg_register_cache(struct kmem_cache *s)
+{
+	struct kmem_cache *root;
+
 	if (is_root_cache(s))
 		return;
 
@@ -3182,10 +3207,6 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	lockdep_assert_held(&slab_mutex);
 
-	root = s->memcg_params->root_cache;
-	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
-
 	/*
 	 * Since readers won't lock (see cache_from_memcg_idx()), we need a
 	 * barrier here to ensure nobody will see the kmem_cache partially
@@ -3193,21 +3214,61 @@ void memcg_register_cache(struct kmem_cache *s)
 	 */
 	smp_wmb();
 
+	root = s->memcg_params->root_cache;
 	list_add(&s->memcg_params->siblings, &root->memcg_params->children);
+	__memcg_register_cache(s, root);
 
-	VM_BUG_ON(root->memcg_params->memcg_caches[id]);
-	root->memcg_params->memcg_caches[id] = s;
+	static_key_slow_inc(&memcg_kmem_enabled_key);
+}
+
+static void __memcg_unregister_cache(struct kmem_cache *s, struct kmem_cache *root)
+{
+	struct mem_cgroup *memcg;
+	int id;
+
+retry:
+	memcg = try_get_mem_cgroup_from_slab(s);
+
+	/*
+	 * This can happen if the cache's memcg was turned offline and it was
+	 * reparented to the root cgroup. In this case the cache must have
+	 * already been properly unregistered so we have nothing to do.
+	 */
+	if (!memcg)
+		return;
 
+	id = memcg_cache_id(memcg);
+
+	/*
+	 * To delete a cache from memcg_slab_caches list, we need to take the
+	 * correpsonding slab_caches_mutex. Since nothing prevents the cache
+	 * from being reparented while we are here, we recheck the cache's
+	 * memcg after taking the mutex and retry if it changed.
+	 */
 	mutex_lock(&memcg->slab_caches_mutex);
-	list_add(&s->memcg_params->list, &memcg->memcg_slab_caches);
+	if (memcg != s->memcg_params->memcg) {
+		mutex_unlock(&memcg->slab_caches_mutex);
+		css_put(&memcg->css);
+		goto retry;
+	}
+
+	list_del(&s->memcg_params->list);
+	s->memcg_params->memcg = NULL;
+
+	/*
+	 * Clear the slot in the memcg_caches array only if the cache hasn't
+	 * been reparented before.
+	 */
+	if (id >= 0 && root->memcg_params->memcg_caches[id] == s)
+		root->memcg_params->memcg_caches[id] = NULL;
+
 	mutex_unlock(&memcg->slab_caches_mutex);
+	css_put(&memcg->css);
 }
 
 void memcg_unregister_cache(struct kmem_cache *s)
 {
 	struct kmem_cache *root;
-	struct mem_cgroup *memcg;
-	int id;
 
 	if (is_root_cache(s))
 		return;
@@ -3219,17 +3280,10 @@ void memcg_unregister_cache(struct kmem_cache *s)
 	lockdep_assert_held(&slab_mutex);
 
 	root = s->memcg_params->root_cache;
-	memcg = s->memcg_params->memcg;
-	id = memcg_cache_id(memcg);
-
 	list_del(&s->memcg_params->siblings);
+	__memcg_unregister_cache(s, root);
 
-	mutex_lock(&memcg->slab_caches_mutex);
-	list_del(&s->memcg_params->list);
-	mutex_unlock(&memcg->slab_caches_mutex);
-
-	VM_BUG_ON(root->memcg_params->memcg_caches[id] != s);
-	root->memcg_params->memcg_caches[id] = NULL;
+	static_key_slow_dec(&memcg_kmem_enabled_key);
 }
 
 /*
@@ -3273,7 +3327,7 @@ static void kmem_cache_destroy_work_func(struct work_struct *w)
 
 	if (atomic_read(&params->refcount) != 0) {
 		/*
-		 * We were scheduled from mem_cgroup_destroy_all_caches().
+		 * We were scheduled from mem_cgroup_reparent_slab().
 		 * Shrink the cache and drop the reference taken by memcg.
 		 */
 		kmem_cache_shrink(cachep);
@@ -3316,11 +3370,56 @@ void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
 	mutex_unlock(&activate_kmem_mutex);
 }
 
-static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_one_slab(struct kmem_cache *cachep,
+		struct mem_cgroup *from, struct mem_cgroup *to)
 {
+	struct kmem_cache *root;
+	int id;
+
+	root = cachep->memcg_params->root_cache;
+
+	if (to)
+		list_move(&cachep->memcg_params->list, &to->memcg_slab_caches);
+	else
+		list_del(&cachep->memcg_params->list);
+
+	BUG_ON(cachep->memcg_params->memcg != from);
+	cachep->memcg_params->memcg = to;
+
+	/*
+	 * We may access the cachep->memcg_params->memcg ptr lock-free so we
+	 * have to make sure readers will see the new value before the final
+	 * css put.
+	 */
+	smp_wmb();
+
+	/*
+	 * If the cache has already been reparented we are done here. Otherwise
+	 * we clear the reference to it in the memcg_caches array and schedule
+	 * shrink work.
+	 */
+	id = memcg_cache_id(from);
+	if (id < 0 || root->memcg_params->memcg_caches[id] != cachep)
+		return;
+
+	root->memcg_params->memcg_caches[id] = NULL;
+
+	/*
+	 * The work could not be scheduled from memcg_release_pages(), because
+	 * we haven't dropped cachep->memcg_params->refcount yet. That's why we
+	 * cannot fail here.
+	 */
+	if (!schedule_work(&cachep->memcg_params->destroy))
+		BUG();
+}
+
+static void mem_cgroup_reparent_slab(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent;
 	struct memcg_cache_params *params;
 
-	if (!memcg_kmem_is_active(memcg))
+	if (!memcg_kmem_is_active(memcg) &&
+	    !memcg_kmem_is_reparented(memcg))
 		return;
 
 	/*
@@ -3332,17 +3431,30 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
 	 */
 	flush_workqueue(memcg_cache_create_wq);
 
-	mutex_lock(&memcg->slab_caches_mutex);
-	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
-		/*
-		 * Since we still hold the reference to the cache params from
-		 * the memcg, the work could not have been scheduled from
-		 * memcg_release_pages(), and this cannot fail.
-		 */
-		if (!schedule_work(&params->destroy))
-			BUG();
+	/*
+	 * We are going to modify memcg_caches arrays, so we have to protect
+	 * them against relocating.
+	 */
+	mutex_lock(&activate_kmem_mutex);
+
+	parent = parent_mem_cgroup(memcg);
+	if (parent)
+		mutex_lock_nested(&parent->slab_caches_mutex,
+				  MEMCG_SLAB_MUTEX_PARENT);
+	mutex_lock_nested(&memcg->slab_caches_mutex, MEMCG_SLAB_MUTEX_CHILD);
+	while (!list_empty(&memcg->memcg_slab_caches)) {
+		params = list_first_entry(&memcg->memcg_slab_caches,
+					  struct memcg_cache_params, list);
+		mem_cgroup_reparent_one_slab(params->cachep, memcg, parent);
 	}
 	mutex_unlock(&memcg->slab_caches_mutex);
+	if (parent)
+		mutex_unlock(&parent->slab_caches_mutex);
+
+	mutex_unlock(&activate_kmem_mutex);
+
+	if (parent)
+		memcg_kmem_set_reparented(parent);
 }
 
 struct create_work {
@@ -3473,7 +3585,7 @@ static void __init memcg_kmem_init(void)
 	BUG_ON(!memcg_cache_create_wq);
 }
 #else
-static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
+static void mem_cgroup_reparent_slab(struct mem_cgroup *memcg)
 {
 }
 
@@ -5681,40 +5793,6 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 	mem_cgroup_sockets_destroy(memcg);
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-	if (!memcg_kmem_is_active(memcg))
-		return;
-
-	/*
-	 * kmem charges can outlive the cgroup. In the case of slab
-	 * pages, for instance, a page contain objects from various
-	 * processes. As we prevent from taking a reference for every
-	 * such allocation we have to be careful when doing uncharge
-	 * (see memcg_uncharge_kmem) and here during offlining.
-	 *
-	 * The idea is that that only the _last_ uncharge which sees
-	 * the dead memcg will drop the last reference. An additional
-	 * reference is taken here before the group is marked dead
-	 * which is then paired with css_put during uncharge resp. here.
-	 *
-	 * Although this might sound strange as this path is called from
-	 * css_offline() when the referencemight have dropped down to 0
-	 * and shouldn't be incremented anymore (css_tryget would fail)
-	 * we do not have other options because of the kmem allocations
-	 * lifetime.
-	 */
-	css_get(&memcg->css);
-
-	memcg_kmem_mark_dead(memcg);
-
-	if (res_counter_read_u64(&memcg->kmem, RES_USAGE) != 0)
-		return;
-
-	if (memcg_kmem_test_and_clear_dead(memcg))
-		css_put(&memcg->css);
-}
 #else
 static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 {
@@ -5724,10 +5802,6 @@ static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
 static void memcg_destroy_kmem(struct mem_cgroup *memcg)
 {
 }
-
-static void kmem_cgroup_css_offline(struct mem_cgroup *memcg)
-{
-}
 #endif
 
 /*
@@ -6335,8 +6409,6 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	}
 	spin_unlock(&memcg->event_list_lock);
 
-	kmem_cgroup_css_offline(memcg);
-
 	mem_cgroup_invalidate_reclaim_iterators(memcg);
 
 	/*
@@ -6346,7 +6418,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	css_for_each_descendant_post(iter, css)
 		mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
 
-	mem_cgroup_destroy_all_caches(memcg);
+	mem_cgroup_reparent_slab(memcg);
 	vmpressure_cleanup(&memcg->vmpressure);
 }
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 12/12] slub: make sure all memcg caches have unique names on sysfs
  2014-03-13 15:06 ` Vladimir Davydov
@ 2014-03-13 15:06   ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

Since memcg caches are now reparented on memcg offline, a memcg cache
can outlive its cgroup. If the memcg id is then reused for a new cgroup
with the same name, we can get cache name collision, which will result
in failures while trying to add a sysfs entry for a new cgroup's cache.
Let's fix this by appending the cache address to sysfs names of all
memcg caches so that they are guaranteed to have unique names.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 mm/slub.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 3ea91fb54f41..f5c74daeb46d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5236,7 +5236,18 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	}
 
 	s->kobj.kset = cache_kset(s);
-	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, "%s", name);
+	/*
+	 * A memcg cache can outlive its cgroup. If the memcg id is then reused
+	 * for a new cgroup with the same name, we can get cache name
+	 * collision. To make sure all memcg caches have unique names on sysfs,
+	 * we append the cache address to its name.
+	 */
+	if (is_root_cache(s))
+		err = kobject_init_and_add(&s->kobj, &slab_ktype,
+					   NULL, "%s", name);
+	else
+		err = kobject_init_and_add(&s->kobj, &slab_ktype,
+					   NULL, "%s-%p", name, s);
 	if (err)
 		goto out_put_kobj;
 
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 42+ messages in thread

* [PATCH RESEND -mm 12/12] slub: make sure all memcg caches have unique names on sysfs
@ 2014-03-13 15:06   ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-13 15:06 UTC (permalink / raw)
  To: akpm
  Cc: hannes, mhocko, glommer, linux-kernel, linux-mm, devel,
	Christoph Lameter, Pekka Enberg

Since memcg caches are now reparented on memcg offline, a memcg cache
can outlive its cgroup. If the memcg id is then reused for a new cgroup
with the same name, we can get cache name collision, which will result
in failures while trying to add a sysfs entry for a new cgroup's cache.
Let's fix this by appending the cache address to sysfs names of all
memcg caches so that they are guaranteed to have unique names.

Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Glauber Costa <glommer@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@kernel.org>
---
 mm/slub.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/mm/slub.c b/mm/slub.c
index 3ea91fb54f41..f5c74daeb46d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -5236,7 +5236,18 @@ static int sysfs_slab_add(struct kmem_cache *s)
 	}
 
 	s->kobj.kset = cache_kset(s);
-	err = kobject_init_and_add(&s->kobj, &slab_ktype, NULL, "%s", name);
+	/*
+	 * A memcg cache can outlive its cgroup. If the memcg id is then reused
+	 * for a new cgroup with the same name, we can get cache name
+	 * collision. To make sure all memcg caches have unique names on sysfs,
+	 * we append the cache address to its name.
+	 */
+	if (is_root_cache(s))
+		err = kobject_init_and_add(&s->kobj, &slab_ktype,
+					   NULL, "%s", name);
+	else
+		err = kobject_init_and_add(&s->kobj, &slab_ktype,
+					   NULL, "%s-%p", name, s);
 	if (err)
 		goto out_put_kobj;
 
-- 
1.7.10.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
  2014-03-13 15:06   ` Vladimir Davydov
@ 2014-03-17 16:07     ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-03-17 16:07 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On Thu 13-03-14 19:06:39, Vladimir Davydov wrote:
> When we get to memcg cache destruction, either from the root cache
> destruction path or when turning memcg offline, there still might be
> memcg cache creation works pending that was scheduled before we
> initiated destruction. We need to flush them before starting to destroy
> memcg caches, otherwise we can get a leaked kmem cache or, even worse,
> an attempt to use after free.

How can we use-after-free? Even if there is a pending work item to
create a new cache then we keep the css reference for the memcg and
release it from the worker (memcg_create_cache_work_func). So although
this can race with memcg offlining the memcg itself will be still alive.

> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Glauber Costa <glommer@gmail.com>
> ---
>  mm/memcontrol.c |   32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9d489a9e7701..b183aaf1b616 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2904,6 +2904,7 @@ static DEFINE_MUTEX(set_limit_mutex);
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  static DEFINE_MUTEX(activate_kmem_mutex);
> +static struct workqueue_struct *memcg_cache_create_wq;
>  
>  static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>  {
> @@ -3327,6 +3328,15 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
>  	int i, failed = 0;
>  
>  	/*
> +	 * Since the cache is being destroyed, it shouldn't be allocated from
> +	 * any more, and therefore no new memcg cache creation works could be
> +	 * scheduled. However, there still might be pending works scheduled
> +	 * before the cache destruction was initiated. Flush them before
> +	 * destroying child caches to avoid nasty races.
> +	 */
> +	flush_workqueue(memcg_cache_create_wq);
> +
> +	/*
>  	 * If the cache is being destroyed, we trust that there is no one else
>  	 * requesting objects from it. Even if there are, the sanity checks in
>  	 * kmem_cache_destroy should caught this ill-case.
> @@ -3374,6 +3384,15 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
>  	if (!memcg_kmem_is_active(memcg))
>  		return;
>  
> +	/*
> +	 * By the time we get here, the cgroup must be empty. That said no new
> +	 * allocations can happen from its caches, and therefore no new memcg
> +	 * cache creation works can be scheduled. However, there still might be
> +	 * pending works scheduled before the cgroup was turned offline. Flush
> +	 * them before destroying memcg caches to avoid nasty races.
> +	 */
> +	flush_workqueue(memcg_cache_create_wq);
> +
>  	mutex_lock(&memcg->slab_caches_mutex);
>  	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
>  		cachep = memcg_params_to_cache(params);
> @@ -3418,7 +3437,7 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
>  	cw->cachep = cachep;
>  
>  	INIT_WORK(&cw->work, memcg_create_cache_work_func);
> -	schedule_work(&cw->work);
> +	queue_work(memcg_cache_create_wq, &cw->work);
>  }
>  
>  static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
> @@ -3621,10 +3640,20 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
>  	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
>  	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
>  }
> +
> +static void __init memcg_kmem_init(void)
> +{
> +	memcg_cache_create_wq = alloc_workqueue("memcg_cache_create", 0, 1);
> +	BUG_ON(!memcg_cache_create_wq);
> +}
>  #else
>  static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
>  {
>  }
> +
> +static void __init memcg_kmem_init(void)
> +{
> +}
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -7181,6 +7210,7 @@ static int __init mem_cgroup_init(void)
>  	enable_swap_cgroup();
>  	mem_cgroup_soft_limit_tree_init();
>  	memcg_stock_init();
> +	memcg_kmem_init();
>  	return 0;
>  }
>  subsys_initcall(mem_cgroup_init);
> -- 
> 1.7.10.4
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
@ 2014-03-17 16:07     ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-03-17 16:07 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On Thu 13-03-14 19:06:39, Vladimir Davydov wrote:
> When we get to memcg cache destruction, either from the root cache
> destruction path or when turning memcg offline, there still might be
> memcg cache creation works pending that was scheduled before we
> initiated destruction. We need to flush them before starting to destroy
> memcg caches, otherwise we can get a leaked kmem cache or, even worse,
> an attempt to use after free.

How can we use-after-free? Even if there is a pending work item to
create a new cache then we keep the css reference for the memcg and
release it from the worker (memcg_create_cache_work_func). So although
this can race with memcg offlining the memcg itself will be still alive.

> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Glauber Costa <glommer@gmail.com>
> ---
>  mm/memcontrol.c |   32 +++++++++++++++++++++++++++++++-
>  1 file changed, 31 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 9d489a9e7701..b183aaf1b616 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2904,6 +2904,7 @@ static DEFINE_MUTEX(set_limit_mutex);
>  
>  #ifdef CONFIG_MEMCG_KMEM
>  static DEFINE_MUTEX(activate_kmem_mutex);
> +static struct workqueue_struct *memcg_cache_create_wq;
>  
>  static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>  {
> @@ -3327,6 +3328,15 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
>  	int i, failed = 0;
>  
>  	/*
> +	 * Since the cache is being destroyed, it shouldn't be allocated from
> +	 * any more, and therefore no new memcg cache creation works could be
> +	 * scheduled. However, there still might be pending works scheduled
> +	 * before the cache destruction was initiated. Flush them before
> +	 * destroying child caches to avoid nasty races.
> +	 */
> +	flush_workqueue(memcg_cache_create_wq);
> +
> +	/*
>  	 * If the cache is being destroyed, we trust that there is no one else
>  	 * requesting objects from it. Even if there are, the sanity checks in
>  	 * kmem_cache_destroy should caught this ill-case.
> @@ -3374,6 +3384,15 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
>  	if (!memcg_kmem_is_active(memcg))
>  		return;
>  
> +	/*
> +	 * By the time we get here, the cgroup must be empty. That said no new
> +	 * allocations can happen from its caches, and therefore no new memcg
> +	 * cache creation works can be scheduled. However, there still might be
> +	 * pending works scheduled before the cgroup was turned offline. Flush
> +	 * them before destroying memcg caches to avoid nasty races.
> +	 */
> +	flush_workqueue(memcg_cache_create_wq);
> +
>  	mutex_lock(&memcg->slab_caches_mutex);
>  	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
>  		cachep = memcg_params_to_cache(params);
> @@ -3418,7 +3437,7 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
>  	cw->cachep = cachep;
>  
>  	INIT_WORK(&cw->work, memcg_create_cache_work_func);
> -	schedule_work(&cw->work);
> +	queue_work(memcg_cache_create_wq, &cw->work);
>  }
>  
>  static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
> @@ -3621,10 +3640,20 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
>  	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
>  	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
>  }
> +
> +static void __init memcg_kmem_init(void)
> +{
> +	memcg_cache_create_wq = alloc_workqueue("memcg_cache_create", 0, 1);
> +	BUG_ON(!memcg_cache_create_wq);
> +}
>  #else
>  static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
>  {
>  }
> +
> +static void __init memcg_kmem_init(void)
> +{
> +}
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> @@ -7181,6 +7210,7 @@ static int __init mem_cgroup_init(void)
>  	enable_swap_cgroup();
>  	mem_cgroup_soft_limit_tree_init();
>  	memcg_stock_init();
> +	memcg_kmem_init();
>  	return 0;
>  }
>  subsys_initcall(mem_cgroup_init);
> -- 
> 1.7.10.4
> 

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
  2014-03-13 15:06   ` Vladimir Davydov
@ 2014-03-17 16:42     ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-03-17 16:42 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On Thu 13-03-14 19:06:40, Vladimir Davydov wrote:
> We schedule memcg cache shrink+destruction work (memcg_params::destroy)
> from two places: when we turn memcg offline
> (mem_cgroup_destroy_all_caches) and when the last page of the cache is
> freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
> mem_cgroup_destroy_cache).

This is just ugly! Why do we mem_cgroup_destroy_all_caches from the
offline code at all? Just calling kmem_cache_shrink and then wait for
the last pages to go away should be sufficient to fix this, no?

Whether the current code is good (no it's not) is another question. But
this should be fixed also in the stable trees (is the bug there since
the very beginning?) so the fix should be as simple as possible IMO.
So if there is a simpler solution I would prefer it. But I am drowning
in the kmem trickiness spread out all over the place so I might be
missing something very easily.

> Since the latter can happen while the work
> scheduled from mem_cgroup_destroy_all_caches is in progress or still
> pending, we need to be cautious to avoid races there - we should
> accurately bail out in one of those functions if we see that the other
> is in progress. Currently we only check if memcg_params::nr_pages is 0
> in the destruction work handler and do not destroy the cache if so. But
> that's not enough. An example of race we can get is shown below:
> 
>   CPU0					CPU1
>   ----					----
>   kmem_cache_destroy_work_func:		memcg_release_pages:
> 					  atomic_sub_and_test(1<<order, &s->
> 							memcg_params->nr_pages)
> 					  /* reached 0 => schedule destroy */
> 
>     atomic_read(&cachep->memcg_params->nr_pages)
>     /* 0 => going to destroy the cache */
>     kmem_cache_destroy(cachep);
> 
> 					  mem_cgroup_destroy_cache(s):
> 					    /* the cache was destroyed on CPU0
> 					       - use after free */
> 
> An obvious way to fix this would be substituting the nr_pages counter
> with a reference counter and make memcg take a reference. The cache
> destruction would be then scheduled from that thread which decremented
> the refcount to 0. Generally, this is what this patch does, but there is
> one subtle thing here - the work handler serves not only for cache
> destruction, it also shrinks the cache if it's still in use (we can't
> call shrink directly from mem_cgroup_destroy_all_caches due to locking
> dependencies). We handle this by noting that we should only issue shrink
> if called from mem_cgroup_destroy_all_caches, because the cache is
> already empty when we release its last page. And if we drop the
> reference taken by memcg in the work handler, we can detect who exactly
> scheduled the worker - mem_cgroup_destroy_all_caches or
> memcg_release_pages.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Glauber Costa <glommer@gmail.com>
[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
@ 2014-03-17 16:42     ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-03-17 16:42 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On Thu 13-03-14 19:06:40, Vladimir Davydov wrote:
> We schedule memcg cache shrink+destruction work (memcg_params::destroy)
> from two places: when we turn memcg offline
> (mem_cgroup_destroy_all_caches) and when the last page of the cache is
> freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
> mem_cgroup_destroy_cache).

This is just ugly! Why do we mem_cgroup_destroy_all_caches from the
offline code at all? Just calling kmem_cache_shrink and then wait for
the last pages to go away should be sufficient to fix this, no?

Whether the current code is good (no it's not) is another question. But
this should be fixed also in the stable trees (is the bug there since
the very beginning?) so the fix should be as simple as possible IMO.
So if there is a simpler solution I would prefer it. But I am drowning
in the kmem trickiness spread out all over the place so I might be
missing something very easily.

> Since the latter can happen while the work
> scheduled from mem_cgroup_destroy_all_caches is in progress or still
> pending, we need to be cautious to avoid races there - we should
> accurately bail out in one of those functions if we see that the other
> is in progress. Currently we only check if memcg_params::nr_pages is 0
> in the destruction work handler and do not destroy the cache if so. But
> that's not enough. An example of race we can get is shown below:
> 
>   CPU0					CPU1
>   ----					----
>   kmem_cache_destroy_work_func:		memcg_release_pages:
> 					  atomic_sub_and_test(1<<order, &s->
> 							memcg_params->nr_pages)
> 					  /* reached 0 => schedule destroy */
> 
>     atomic_read(&cachep->memcg_params->nr_pages)
>     /* 0 => going to destroy the cache */
>     kmem_cache_destroy(cachep);
> 
> 					  mem_cgroup_destroy_cache(s):
> 					    /* the cache was destroyed on CPU0
> 					       - use after free */
> 
> An obvious way to fix this would be substituting the nr_pages counter
> with a reference counter and make memcg take a reference. The cache
> destruction would be then scheduled from that thread which decremented
> the refcount to 0. Generally, this is what this patch does, but there is
> one subtle thing here - the work handler serves not only for cache
> destruction, it also shrinks the cache if it's still in use (we can't
> call shrink directly from mem_cgroup_destroy_all_caches due to locking
> dependencies). We handle this by noting that we should only issue shrink
> if called from mem_cgroup_destroy_all_caches, because the cache is
> already empty when we release its last page. And if we drop the
> reference taken by memcg in the work handler, we can detect who exactly
> scheduled the worker - mem_cgroup_destroy_all_caches or
> memcg_release_pages.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.cz>
> Cc: Glauber Costa <glommer@gmail.com>
[...]
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
  2014-03-17 16:07     ` Michal Hocko
@ 2014-03-18  8:14       ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-18  8:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On 03/17/2014 08:07 PM, Michal Hocko wrote:
> On Thu 13-03-14 19:06:39, Vladimir Davydov wrote:
>> When we get to memcg cache destruction, either from the root cache
>> destruction path or when turning memcg offline, there still might be
>> memcg cache creation works pending that was scheduled before we
>> initiated destruction. We need to flush them before starting to destroy
>> memcg caches, otherwise we can get a leaked kmem cache or, even worse,
>> an attempt to use after free.
> How can we use-after-free? Even if there is a pending work item to
> create a new cache then we keep the css reference for the memcg and
> release it from the worker (memcg_create_cache_work_func). So although
> this can race with memcg offlining the memcg itself will be still alive.

There are actually two issues:

1) When we destroy a root cache using kmem_cache_destroy(), we should
ensure all pending memcg creation works for this root cache are over,
otherwise a work could be executed after the root cache is destroyed
resulting in use-after-free.

2) Memcg offline. In this case use-after-free is impossible in a memcg
creation work handler, because, as you mentioned, the work holds the css
reference. However, we still have to synchronize against pending
requests, otherwise a work handler can be executed after we destroyed
the caches corresponding to the memcg being offlined resulting in a
kmem_cache leak.

Thanks.

>
>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Glauber Costa <glommer@gmail.com>
>> ---
>>  mm/memcontrol.c |   32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9d489a9e7701..b183aaf1b616 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2904,6 +2904,7 @@ static DEFINE_MUTEX(set_limit_mutex);
>>  
>>  #ifdef CONFIG_MEMCG_KMEM
>>  static DEFINE_MUTEX(activate_kmem_mutex);
>> +static struct workqueue_struct *memcg_cache_create_wq;
>>  
>>  static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>>  {
>> @@ -3327,6 +3328,15 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
>>  	int i, failed = 0;
>>  
>>  	/*
>> +	 * Since the cache is being destroyed, it shouldn't be allocated from
>> +	 * any more, and therefore no new memcg cache creation works could be
>> +	 * scheduled. However, there still might be pending works scheduled
>> +	 * before the cache destruction was initiated. Flush them before
>> +	 * destroying child caches to avoid nasty races.
>> +	 */
>> +	flush_workqueue(memcg_cache_create_wq);
>> +
>> +	/*
>>  	 * If the cache is being destroyed, we trust that there is no one else
>>  	 * requesting objects from it. Even if there are, the sanity checks in
>>  	 * kmem_cache_destroy should caught this ill-case.
>> @@ -3374,6 +3384,15 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
>>  	if (!memcg_kmem_is_active(memcg))
>>  		return;
>>  
>> +	/*
>> +	 * By the time we get here, the cgroup must be empty. That said no new
>> +	 * allocations can happen from its caches, and therefore no new memcg
>> +	 * cache creation works can be scheduled. However, there still might be
>> +	 * pending works scheduled before the cgroup was turned offline. Flush
>> +	 * them before destroying memcg caches to avoid nasty races.
>> +	 */
>> +	flush_workqueue(memcg_cache_create_wq);
>> +
>>  	mutex_lock(&memcg->slab_caches_mutex);
>>  	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
>>  		cachep = memcg_params_to_cache(params);
>> @@ -3418,7 +3437,7 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
>>  	cw->cachep = cachep;
>>  
>>  	INIT_WORK(&cw->work, memcg_create_cache_work_func);
>> -	schedule_work(&cw->work);
>> +	queue_work(memcg_cache_create_wq, &cw->work);
>>  }
>>  
>>  static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
>> @@ -3621,10 +3640,20 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
>>  	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
>>  	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
>>  }
>> +
>> +static void __init memcg_kmem_init(void)
>> +{
>> +	memcg_cache_create_wq = alloc_workqueue("memcg_cache_create", 0, 1);
>> +	BUG_ON(!memcg_cache_create_wq);
>> +}
>>  #else
>>  static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
>>  {
>>  }
>> +
>> +static void __init memcg_kmem_init(void)
>> +{
>> +}
>>  #endif /* CONFIG_MEMCG_KMEM */
>>  
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> @@ -7181,6 +7210,7 @@ static int __init mem_cgroup_init(void)
>>  	enable_swap_cgroup();
>>  	mem_cgroup_soft_limit_tree_init();
>>  	memcg_stock_init();
>> +	memcg_kmem_init();
>>  	return 0;
>>  }
>>  subsys_initcall(mem_cgroup_init);
>> -- 
>> 1.7.10.4
>>


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
@ 2014-03-18  8:14       ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-18  8:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On 03/17/2014 08:07 PM, Michal Hocko wrote:
> On Thu 13-03-14 19:06:39, Vladimir Davydov wrote:
>> When we get to memcg cache destruction, either from the root cache
>> destruction path or when turning memcg offline, there still might be
>> memcg cache creation works pending that was scheduled before we
>> initiated destruction. We need to flush them before starting to destroy
>> memcg caches, otherwise we can get a leaked kmem cache or, even worse,
>> an attempt to use after free.
> How can we use-after-free? Even if there is a pending work item to
> create a new cache then we keep the css reference for the memcg and
> release it from the worker (memcg_create_cache_work_func). So although
> this can race with memcg offlining the memcg itself will be still alive.

There are actually two issues:

1) When we destroy a root cache using kmem_cache_destroy(), we should
ensure all pending memcg creation works for this root cache are over,
otherwise a work could be executed after the root cache is destroyed
resulting in use-after-free.

2) Memcg offline. In this case use-after-free is impossible in a memcg
creation work handler, because, as you mentioned, the work holds the css
reference. However, we still have to synchronize against pending
requests, otherwise a work handler can be executed after we destroyed
the caches corresponding to the memcg being offlined resulting in a
kmem_cache leak.

Thanks.

>
>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Glauber Costa <glommer@gmail.com>
>> ---
>>  mm/memcontrol.c |   32 +++++++++++++++++++++++++++++++-
>>  1 file changed, 31 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 9d489a9e7701..b183aaf1b616 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -2904,6 +2904,7 @@ static DEFINE_MUTEX(set_limit_mutex);
>>  
>>  #ifdef CONFIG_MEMCG_KMEM
>>  static DEFINE_MUTEX(activate_kmem_mutex);
>> +static struct workqueue_struct *memcg_cache_create_wq;
>>  
>>  static inline bool memcg_can_account_kmem(struct mem_cgroup *memcg)
>>  {
>> @@ -3327,6 +3328,15 @@ int __kmem_cache_destroy_memcg_children(struct kmem_cache *s)
>>  	int i, failed = 0;
>>  
>>  	/*
>> +	 * Since the cache is being destroyed, it shouldn't be allocated from
>> +	 * any more, and therefore no new memcg cache creation works could be
>> +	 * scheduled. However, there still might be pending works scheduled
>> +	 * before the cache destruction was initiated. Flush them before
>> +	 * destroying child caches to avoid nasty races.
>> +	 */
>> +	flush_workqueue(memcg_cache_create_wq);
>> +
>> +	/*
>>  	 * If the cache is being destroyed, we trust that there is no one else
>>  	 * requesting objects from it. Even if there are, the sanity checks in
>>  	 * kmem_cache_destroy should caught this ill-case.
>> @@ -3374,6 +3384,15 @@ static void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
>>  	if (!memcg_kmem_is_active(memcg))
>>  		return;
>>  
>> +	/*
>> +	 * By the time we get here, the cgroup must be empty. That said no new
>> +	 * allocations can happen from its caches, and therefore no new memcg
>> +	 * cache creation works can be scheduled. However, there still might be
>> +	 * pending works scheduled before the cgroup was turned offline. Flush
>> +	 * them before destroying memcg caches to avoid nasty races.
>> +	 */
>> +	flush_workqueue(memcg_cache_create_wq);
>> +
>>  	mutex_lock(&memcg->slab_caches_mutex);
>>  	list_for_each_entry(params, &memcg->memcg_slab_caches, list) {
>>  		cachep = memcg_params_to_cache(params);
>> @@ -3418,7 +3437,7 @@ static void __memcg_create_cache_enqueue(struct mem_cgroup *memcg,
>>  	cw->cachep = cachep;
>>  
>>  	INIT_WORK(&cw->work, memcg_create_cache_work_func);
>> -	schedule_work(&cw->work);
>> +	queue_work(memcg_cache_create_wq, &cw->work);
>>  }
>>  
>>  static void memcg_create_cache_enqueue(struct mem_cgroup *memcg,
>> @@ -3621,10 +3640,20 @@ void __memcg_kmem_uncharge_pages(struct page *page, int order)
>>  	VM_BUG_ON_PAGE(mem_cgroup_is_root(memcg), page);
>>  	memcg_uncharge_kmem(memcg, PAGE_SIZE << order);
>>  }
>> +
>> +static void __init memcg_kmem_init(void)
>> +{
>> +	memcg_cache_create_wq = alloc_workqueue("memcg_cache_create", 0, 1);
>> +	BUG_ON(!memcg_cache_create_wq);
>> +}
>>  #else
>>  static inline void mem_cgroup_destroy_all_caches(struct mem_cgroup *memcg)
>>  {
>>  }
>> +
>> +static void __init memcg_kmem_init(void)
>> +{
>> +}
>>  #endif /* CONFIG_MEMCG_KMEM */
>>  
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>> @@ -7181,6 +7210,7 @@ static int __init mem_cgroup_init(void)
>>  	enable_swap_cgroup();
>>  	mem_cgroup_soft_limit_tree_init();
>>  	memcg_stock_init();
>> +	memcg_kmem_init();
>>  	return 0;
>>  }
>>  subsys_initcall(mem_cgroup_init);
>> -- 
>> 1.7.10.4
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
  2014-03-17 16:42     ` Michal Hocko
@ 2014-03-18  8:19       ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-18  8:19 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On 03/17/2014 08:42 PM, Michal Hocko wrote:
> On Thu 13-03-14 19:06:40, Vladimir Davydov wrote:
>> We schedule memcg cache shrink+destruction work (memcg_params::destroy)
>> from two places: when we turn memcg offline
>> (mem_cgroup_destroy_all_caches) and when the last page of the cache is
>> freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
>> mem_cgroup_destroy_cache).
> This is just ugly! Why do we mem_cgroup_destroy_all_caches from the
> offline code at all? Just calling kmem_cache_shrink and then wait for
> the last pages to go away should be sufficient to fix this, no?

The problem is kmem_cache_shrink() can take the slab_mutex, and we
iterate over memcg caches to be destroyed under the memcg's
slab_caches_mutex, which is nested into the slab_mutex (see
mem_cgroup_destroy_all_caches()). So we can't call kmem_cache_shrink()
there directly due to lockdep. That's what all that trickery with using
the same work for both cache shrinking and destruction is for. I agree
this is ugly and somewhat difficult to understand. Let me share my
thoughts on this problem.

First, why do we need to call kmem_cache_shrink() there at all? AFAIU,
this is because most of memcg caches must be empty by the time the memcg
is destroyed, but since there are some pages left on the caches for
performance reasons, and those pages hold cache references
(kmem_cache::memcg_params::nr_pages) preventing it from being destroyed,
we try to shrink them to get rid of empty caches. If shrink fails (i.e.
the memcg cache is not empty) the cache will be pending for good in the
current implementation (until someone calls the shrink manually at
least). Glauber intended to fix the issue with pending caches by reaping
them on vmpressure, but he didn't have enough time to complete this,
unfortunately.

But why do we get cache references per slab? I mean why do we inc the
cache refcounter (kmem_cache::memcg_params::nr_pages currently) when
allocating a slab, not an individual object? If we took the reference to
the cache per individual object, we would not have to call
kmem_cache_shrink() on memcg offline - if the cache is empty it will be
destroyed immediately then, because its refcounter reaches 0, otherwise
we could leave it hanging around for a while and only try to shrink it
on vmpressure when we really need free mem. That would make the
destroy_work straightforward - it would simply call kmem_cache_destroy()
and that's it.

I guess I foresee the answer to the question I've just raised - using a
per cache refcounter and taking it on each alloc/free would hurt
scalability too much. However, we could use percpu refcounter to
overcome this, couldn't we?

There is one more argument for taking the cache refcount on a per-object
(not per-slab) basis. There seems to be a race in kmem allocation path.
The point is there is a time window between we get the cache to allocate
from (memcg_kmem_get_cache()) and the actual allocating from the cache
(see slab_alloc_node()). Actually, nothing prevents the cache from going
away in this time window - the task can change its cgroup and the former
cgroup can be taken offline resulting in the cache destruction. This is
very unlikely, but still possible. A similar problem with freeing
objects - currently we might continue using a cache after we actually
freed the last object and dropped the reference - look at
kmem_freepages(), there we dereference the cache pointer after calling
memcg_release_pages(), which drops the cache reference. The latter is
more-or-less easy to fix though by ensuring we always drop the reference
after we stopped using the cache, but this would imply heavy intrusion
into slab internals AFAIU, which is bad. OTOH if we took the cache
reference per allocated object, these problems would be resolved
automatically and clearly.

I haven't included that in this set, because I tried not to blow it too
much, I just wanted to introduce cache reparenting, in the meanwhile
fixing only those issues that had become really painful. Not sure if it
was the right decision though :-/

Anyway, I would appreciate if you could share your thoughts about that.

> Whether the current code is good (no it's not) is another question. But
> this should be fixed also in the stable trees (is the bug there since
> the very beginning?) so the fix should be as simple as possible IMO.
> So if there is a simpler solution I would prefer it. But I am drowning
> in the kmem trickiness spread out all over the place so I might be
> missing something very easily.

Frankly, I'm not bothering about stable trees by now, because I don't
think anybody is using kmemcg since w/o fs cache shrinking it looks
pretty useless. May be, I'm wrong :-/

Thanks.

>> Since the latter can happen while the work
>> scheduled from mem_cgroup_destroy_all_caches is in progress or still
>> pending, we need to be cautious to avoid races there - we should
>> accurately bail out in one of those functions if we see that the other
>> is in progress. Currently we only check if memcg_params::nr_pages is 0
>> in the destruction work handler and do not destroy the cache if so. But
>> that's not enough. An example of race we can get is shown below:
>>
>>   CPU0					CPU1
>>   ----					----
>>   kmem_cache_destroy_work_func:		memcg_release_pages:
>> 					  atomic_sub_and_test(1<<order, &s->
>> 							memcg_params->nr_pages)
>> 					  /* reached 0 => schedule destroy */
>>
>>     atomic_read(&cachep->memcg_params->nr_pages)
>>     /* 0 => going to destroy the cache */
>>     kmem_cache_destroy(cachep);
>>
>> 					  mem_cgroup_destroy_cache(s):
>> 					    /* the cache was destroyed on CPU0
>> 					       - use after free */
>>
>> An obvious way to fix this would be substituting the nr_pages counter
>> with a reference counter and make memcg take a reference. The cache
>> destruction would be then scheduled from that thread which decremented
>> the refcount to 0. Generally, this is what this patch does, but there is
>> one subtle thing here - the work handler serves not only for cache
>> destruction, it also shrinks the cache if it's still in use (we can't
>> call shrink directly from mem_cgroup_destroy_all_caches due to locking
>> dependencies). We handle this by noting that we should only issue shrink
>> if called from mem_cgroup_destroy_all_caches, because the cache is
>> already empty when we release its last page. And if we drop the
>> reference taken by memcg in the work handler, we can detect who exactly
>> scheduled the worker - mem_cgroup_destroy_all_caches or
>> memcg_release_pages.
>>
>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Glauber Costa <glommer@gmail.com>
> [...]


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
@ 2014-03-18  8:19       ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-18  8:19 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On 03/17/2014 08:42 PM, Michal Hocko wrote:
> On Thu 13-03-14 19:06:40, Vladimir Davydov wrote:
>> We schedule memcg cache shrink+destruction work (memcg_params::destroy)
>> from two places: when we turn memcg offline
>> (mem_cgroup_destroy_all_caches) and when the last page of the cache is
>> freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
>> mem_cgroup_destroy_cache).
> This is just ugly! Why do we mem_cgroup_destroy_all_caches from the
> offline code at all? Just calling kmem_cache_shrink and then wait for
> the last pages to go away should be sufficient to fix this, no?

The problem is kmem_cache_shrink() can take the slab_mutex, and we
iterate over memcg caches to be destroyed under the memcg's
slab_caches_mutex, which is nested into the slab_mutex (see
mem_cgroup_destroy_all_caches()). So we can't call kmem_cache_shrink()
there directly due to lockdep. That's what all that trickery with using
the same work for both cache shrinking and destruction is for. I agree
this is ugly and somewhat difficult to understand. Let me share my
thoughts on this problem.

First, why do we need to call kmem_cache_shrink() there at all? AFAIU,
this is because most of memcg caches must be empty by the time the memcg
is destroyed, but since there are some pages left on the caches for
performance reasons, and those pages hold cache references
(kmem_cache::memcg_params::nr_pages) preventing it from being destroyed,
we try to shrink them to get rid of empty caches. If shrink fails (i.e.
the memcg cache is not empty) the cache will be pending for good in the
current implementation (until someone calls the shrink manually at
least). Glauber intended to fix the issue with pending caches by reaping
them on vmpressure, but he didn't have enough time to complete this,
unfortunately.

But why do we get cache references per slab? I mean why do we inc the
cache refcounter (kmem_cache::memcg_params::nr_pages currently) when
allocating a slab, not an individual object? If we took the reference to
the cache per individual object, we would not have to call
kmem_cache_shrink() on memcg offline - if the cache is empty it will be
destroyed immediately then, because its refcounter reaches 0, otherwise
we could leave it hanging around for a while and only try to shrink it
on vmpressure when we really need free mem. That would make the
destroy_work straightforward - it would simply call kmem_cache_destroy()
and that's it.

I guess I foresee the answer to the question I've just raised - using a
per cache refcounter and taking it on each alloc/free would hurt
scalability too much. However, we could use percpu refcounter to
overcome this, couldn't we?

There is one more argument for taking the cache refcount on a per-object
(not per-slab) basis. There seems to be a race in kmem allocation path.
The point is there is a time window between we get the cache to allocate
from (memcg_kmem_get_cache()) and the actual allocating from the cache
(see slab_alloc_node()). Actually, nothing prevents the cache from going
away in this time window - the task can change its cgroup and the former
cgroup can be taken offline resulting in the cache destruction. This is
very unlikely, but still possible. A similar problem with freeing
objects - currently we might continue using a cache after we actually
freed the last object and dropped the reference - look at
kmem_freepages(), there we dereference the cache pointer after calling
memcg_release_pages(), which drops the cache reference. The latter is
more-or-less easy to fix though by ensuring we always drop the reference
after we stopped using the cache, but this would imply heavy intrusion
into slab internals AFAIU, which is bad. OTOH if we took the cache
reference per allocated object, these problems would be resolved
automatically and clearly.

I haven't included that in this set, because I tried not to blow it too
much, I just wanted to introduce cache reparenting, in the meanwhile
fixing only those issues that had become really painful. Not sure if it
was the right decision though :-/

Anyway, I would appreciate if you could share your thoughts about that.

> Whether the current code is good (no it's not) is another question. But
> this should be fixed also in the stable trees (is the bug there since
> the very beginning?) so the fix should be as simple as possible IMO.
> So if there is a simpler solution I would prefer it. But I am drowning
> in the kmem trickiness spread out all over the place so I might be
> missing something very easily.

Frankly, I'm not bothering about stable trees by now, because I don't
think anybody is using kmemcg since w/o fs cache shrinking it looks
pretty useless. May be, I'm wrong :-/

Thanks.

>> Since the latter can happen while the work
>> scheduled from mem_cgroup_destroy_all_caches is in progress or still
>> pending, we need to be cautious to avoid races there - we should
>> accurately bail out in one of those functions if we see that the other
>> is in progress. Currently we only check if memcg_params::nr_pages is 0
>> in the destruction work handler and do not destroy the cache if so. But
>> that's not enough. An example of race we can get is shown below:
>>
>>   CPU0					CPU1
>>   ----					----
>>   kmem_cache_destroy_work_func:		memcg_release_pages:
>> 					  atomic_sub_and_test(1<<order, &s->
>> 							memcg_params->nr_pages)
>> 					  /* reached 0 => schedule destroy */
>>
>>     atomic_read(&cachep->memcg_params->nr_pages)
>>     /* 0 => going to destroy the cache */
>>     kmem_cache_destroy(cachep);
>>
>> 					  mem_cgroup_destroy_cache(s):
>> 					    /* the cache was destroyed on CPU0
>> 					       - use after free */
>>
>> An obvious way to fix this would be substituting the nr_pages counter
>> with a reference counter and make memcg take a reference. The cache
>> destruction would be then scheduled from that thread which decremented
>> the refcount to 0. Generally, this is what this patch does, but there is
>> one subtle thing here - the work handler serves not only for cache
>> destruction, it also shrinks the cache if it's still in use (we can't
>> call shrink directly from mem_cgroup_destroy_all_caches due to locking
>> dependencies). We handle this by noting that we should only issue shrink
>> if called from mem_cgroup_destroy_all_caches, because the cache is
>> already empty when we release its last page. And if we drop the
>> reference taken by memcg in the work handler, we can detect who exactly
>> scheduled the worker - mem_cgroup_destroy_all_caches or
>> memcg_release_pages.
>>
>> Signed-off-by: Vladimir Davydov <vdavydov@parallels.com>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Cc: Michal Hocko <mhocko@suse.cz>
>> Cc: Glauber Costa <glommer@gmail.com>
> [...]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
  2014-03-18  8:14       ` Vladimir Davydov
@ 2014-03-18  8:55         ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-03-18  8:55 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On Tue 18-03-14 12:14:37, Vladimir Davydov wrote:
> On 03/17/2014 08:07 PM, Michal Hocko wrote:
> > On Thu 13-03-14 19:06:39, Vladimir Davydov wrote:
> >> When we get to memcg cache destruction, either from the root cache
> >> destruction path or when turning memcg offline, there still might be
> >> memcg cache creation works pending that was scheduled before we
> >> initiated destruction. We need to flush them before starting to destroy
> >> memcg caches, otherwise we can get a leaked kmem cache or, even worse,
> >> an attempt to use after free.
> > How can we use-after-free? Even if there is a pending work item to
> > create a new cache then we keep the css reference for the memcg and
> > release it from the worker (memcg_create_cache_work_func). So although
> > this can race with memcg offlining the memcg itself will be still alive.
> 
> There are actually two issues:
> 
> 1) When we destroy a root cache using kmem_cache_destroy(), we should
> ensure all pending memcg creation works for this root cache are over,
> otherwise a work could be executed after the root cache is destroyed
> resulting in use-after-free.

Dunno, but this sounds backwards to me. If we are using a root cache for
a new child creation then the child should make sure that the root
doesn't go away, no? Cannot we take a reference to the root cache before
we schedule memcg_create_cache_work_func?

But I admit that the root cache concept is not entirely clear to me.

> 2) Memcg offline. In this case use-after-free is impossible in a memcg
> creation work handler, because, as you mentioned, the work holds the css
> reference. However, we still have to synchronize against pending
> requests, otherwise a work handler can be executed after we destroyed
> the caches corresponding to the memcg being offlined resulting in a
> kmem_cache leak.

If that is a case then we should come up with a proper synchronization
because synchronization by workqueues and explicit flushing and
canceling is really bad.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
@ 2014-03-18  8:55         ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-03-18  8:55 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On Tue 18-03-14 12:14:37, Vladimir Davydov wrote:
> On 03/17/2014 08:07 PM, Michal Hocko wrote:
> > On Thu 13-03-14 19:06:39, Vladimir Davydov wrote:
> >> When we get to memcg cache destruction, either from the root cache
> >> destruction path or when turning memcg offline, there still might be
> >> memcg cache creation works pending that was scheduled before we
> >> initiated destruction. We need to flush them before starting to destroy
> >> memcg caches, otherwise we can get a leaked kmem cache or, even worse,
> >> an attempt to use after free.
> > How can we use-after-free? Even if there is a pending work item to
> > create a new cache then we keep the css reference for the memcg and
> > release it from the worker (memcg_create_cache_work_func). So although
> > this can race with memcg offlining the memcg itself will be still alive.
> 
> There are actually two issues:
> 
> 1) When we destroy a root cache using kmem_cache_destroy(), we should
> ensure all pending memcg creation works for this root cache are over,
> otherwise a work could be executed after the root cache is destroyed
> resulting in use-after-free.

Dunno, but this sounds backwards to me. If we are using a root cache for
a new child creation then the child should make sure that the root
doesn't go away, no? Cannot we take a reference to the root cache before
we schedule memcg_create_cache_work_func?

But I admit that the root cache concept is not entirely clear to me.

> 2) Memcg offline. In this case use-after-free is impossible in a memcg
> creation work handler, because, as you mentioned, the work holds the css
> reference. However, we still have to synchronize against pending
> requests, otherwise a work handler can be executed after we destroyed
> the caches corresponding to the memcg being offlined resulting in a
> kmem_cache leak.

If that is a case then we should come up with a proper synchronization
because synchronization by workqueues and explicit flushing and
canceling is really bad.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
  2014-03-18  8:55         ` Michal Hocko
@ 2014-03-18  9:28           ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-18  9:28 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On 03/18/2014 12:55 PM, Michal Hocko wrote:
> On Tue 18-03-14 12:14:37, Vladimir Davydov wrote:
>> On 03/17/2014 08:07 PM, Michal Hocko wrote:
>>> On Thu 13-03-14 19:06:39, Vladimir Davydov wrote:
>>>> When we get to memcg cache destruction, either from the root cache
>>>> destruction path or when turning memcg offline, there still might be
>>>> memcg cache creation works pending that was scheduled before we
>>>> initiated destruction. We need to flush them before starting to destroy
>>>> memcg caches, otherwise we can get a leaked kmem cache or, even worse,
>>>> an attempt to use after free.
>>> How can we use-after-free? Even if there is a pending work item to
>>> create a new cache then we keep the css reference for the memcg and
>>> release it from the worker (memcg_create_cache_work_func). So although
>>> this can race with memcg offlining the memcg itself will be still alive.
>> There are actually two issues:
>>
>> 1) When we destroy a root cache using kmem_cache_destroy(), we should
>> ensure all pending memcg creation works for this root cache are over,
>> otherwise a work could be executed after the root cache is destroyed
>> resulting in use-after-free.
> Dunno, but this sounds backwards to me. If we are using a root cache for
> a new child creation then the child should make sure that the root
> doesn't go away, no? Cannot we take a reference to the root cache before
> we schedule memcg_create_cache_work_func?

Yeah, that would work of course. We already have kmem_cache::refcount,
which is currently used for alias handling, and I guess we could reuse
it here. We would only have to make it atomic, because we can't take the
slab_mutex in memcg_kmem_get_cache(), but it shouldn't be a problem.

> But I admit that the root cache concept is not entirely clear to me.
>
>> 2) Memcg offline. In this case use-after-free is impossible in a memcg
>> creation work handler, because, as you mentioned, the work holds the css
>> reference. However, we still have to synchronize against pending
>> requests, otherwise a work handler can be executed after we destroyed
>> the caches corresponding to the memcg being offlined resulting in a
>> kmem_cache leak.
> If that is a case then we should come up with a proper synchronization
> because synchronization by workqueues and explicit flushing and
> canceling is really bad.

Would be something like this suitable as proper synchronization:

mem_cgroup_destroy_all_caches():
    /* currently we don't take the slab_mutex here,
     * so we'd have to add this line */
    take slab_mutex
    mark the memcg dead
    schedule the memcg's caches destruction
    release slab_mutex

kmem_cache_create_memcg():
    take slab_mutex
    if memcg is not dead, then create a cache
    release slab_mutex

?

Thanks.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction
@ 2014-03-18  9:28           ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-18  9:28 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On 03/18/2014 12:55 PM, Michal Hocko wrote:
> On Tue 18-03-14 12:14:37, Vladimir Davydov wrote:
>> On 03/17/2014 08:07 PM, Michal Hocko wrote:
>>> On Thu 13-03-14 19:06:39, Vladimir Davydov wrote:
>>>> When we get to memcg cache destruction, either from the root cache
>>>> destruction path or when turning memcg offline, there still might be
>>>> memcg cache creation works pending that was scheduled before we
>>>> initiated destruction. We need to flush them before starting to destroy
>>>> memcg caches, otherwise we can get a leaked kmem cache or, even worse,
>>>> an attempt to use after free.
>>> How can we use-after-free? Even if there is a pending work item to
>>> create a new cache then we keep the css reference for the memcg and
>>> release it from the worker (memcg_create_cache_work_func). So although
>>> this can race with memcg offlining the memcg itself will be still alive.
>> There are actually two issues:
>>
>> 1) When we destroy a root cache using kmem_cache_destroy(), we should
>> ensure all pending memcg creation works for this root cache are over,
>> otherwise a work could be executed after the root cache is destroyed
>> resulting in use-after-free.
> Dunno, but this sounds backwards to me. If we are using a root cache for
> a new child creation then the child should make sure that the root
> doesn't go away, no? Cannot we take a reference to the root cache before
> we schedule memcg_create_cache_work_func?

Yeah, that would work of course. We already have kmem_cache::refcount,
which is currently used for alias handling, and I guess we could reuse
it here. We would only have to make it atomic, because we can't take the
slab_mutex in memcg_kmem_get_cache(), but it shouldn't be a problem.

> But I admit that the root cache concept is not entirely clear to me.
>
>> 2) Memcg offline. In this case use-after-free is impossible in a memcg
>> creation work handler, because, as you mentioned, the work holds the css
>> reference. However, we still have to synchronize against pending
>> requests, otherwise a work handler can be executed after we destroyed
>> the caches corresponding to the memcg being offlined resulting in a
>> kmem_cache leak.
> If that is a case then we should come up with a proper synchronization
> because synchronization by workqueues and explicit flushing and
> canceling is really bad.

Would be something like this suitable as proper synchronization:

mem_cgroup_destroy_all_caches():
    /* currently we don't take the slab_mutex here,
     * so we'd have to add this line */
    take slab_mutex
    mark the memcg dead
    schedule the memcg's caches destruction
    release slab_mutex

kmem_cache_create_memcg():
    take slab_mutex
    if memcg is not dead, then create a cache
    release slab_mutex

?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
  2014-03-18  8:19       ` Vladimir Davydov
@ 2014-03-18 10:01         ` Michal Hocko
  -1 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-03-18 10:01 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On Tue 18-03-14 12:19:00, Vladimir Davydov wrote:
> On 03/17/2014 08:42 PM, Michal Hocko wrote:
> > On Thu 13-03-14 19:06:40, Vladimir Davydov wrote:
> >> We schedule memcg cache shrink+destruction work (memcg_params::destroy)
> >> from two places: when we turn memcg offline
> >> (mem_cgroup_destroy_all_caches) and when the last page of the cache is
> >> freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
> >> mem_cgroup_destroy_cache).
> > This is just ugly! Why do we mem_cgroup_destroy_all_caches from the
> > offline code at all? Just calling kmem_cache_shrink and then wait for
> > the last pages to go away should be sufficient to fix this, no?
> 
> The problem is kmem_cache_shrink() can take the slab_mutex, and we
> iterate over memcg caches to be destroyed under the memcg's
> slab_caches_mutex, which is nested into the slab_mutex (see
> mem_cgroup_destroy_all_caches()). So we can't call kmem_cache_shrink()
> there directly due to lockdep. That's what all that trickery with using
> the same work for both cache shrinking and destruction is for. I agree
> this is ugly and somewhat difficult to understand. Let me share my
> thoughts on this problem.

Nothing prevents mem_cgroup_destroy_all_caches to call kmem_cache_shrink
from the workqueue context, no? And then we can move all caches which
still have some pages to the parent memcg + update back-pointers from
respective page_cgroups.

> First, why do we need to call kmem_cache_shrink() there at all? AFAIU,
> this is because most of memcg caches must be empty by the time the memcg
> is destroyed, but since there are some pages left on the caches for
> performance reasons, and those pages hold cache references
> (kmem_cache::memcg_params::nr_pages) preventing it from being destroyed,
> we try to shrink them to get rid of empty caches. If shrink fails (i.e.
> the memcg cache is not empty) the cache will be pending for good in the
> current implementation (until someone calls the shrink manually at
> least). Glauber intended to fix the issue with pending caches by reaping
> them on vmpressure, but he didn't have enough time to complete this,
> unfortunately.
> 
> But why do we get cache references per slab?

I guess this is natural from the charging point of view. It is also less
intrusive because this is a slow path.

> I mean why do we inc the
> cache refcounter (kmem_cache::memcg_params::nr_pages currently) when
> allocating a slab, not an individual object? If we took the reference to
> the cache per individual object, we would not have to call
> kmem_cache_shrink() on memcg offline - if the cache is empty it will be
> destroyed immediately then, because its refcounter reaches 0, otherwise
> we could leave it hanging around for a while and only try to shrink it
> on vmpressure when we really need free mem. That would make the
> destroy_work straightforward - it would simply call kmem_cache_destroy()
> and that's it.
> 
> I guess I foresee the answer to the question I've just raised - using a
> per cache refcounter and taking it on each alloc/free would hurt
> scalability too much. However, we could use percpu refcounter to
> overcome this, couldn't we?

I am afraid this would still be too invasive for the fast path. That
would be a question for slab guys though.
 
> There is one more argument for taking the cache refcount on a per-object
> (not per-slab) basis. There seems to be a race in kmem allocation path.
> The point is there is a time window between we get the cache to allocate
> from (memcg_kmem_get_cache()) and the actual allocating from the cache
> (see slab_alloc_node()). Actually, nothing prevents the cache from going
> away in this time window

By "the cache" you mean the memcg variant or the global one?

> - the task can change its cgroup and the former cgroup can be taken
> offline resulting in the cache destruction.

With a proper synchronization this shouldn't be a big deal I suppose.
kmem_cache_create_memcg should check that the memcg is still alive. We
mark caches dead and we would need something like that per-memcg as well
during css_offline (after all memcg_slab_caches were shrunk). The create
worker would then back off and fail when trying to register the cache.

> This is very unlikely, but still possible.

> A similar problem with freeing
> objects - currently we might continue using a cache after we actually
> freed the last object and dropped the reference - look at
> kmem_freepages(), there we dereference the cache pointer after calling
> memcg_release_pages(), which drops the cache reference.

This sounds like an ordering issue. memcg_release_pages should be called
as the last. I do not see why the ordering was done this way.

> The latter is
> more-or-less easy to fix though by ensuring we always drop the reference
> after we stopped using the cache, but this would imply heavy intrusion
> into slab internals AFAIU, which is bad. OTOH if we took the cache
> reference per allocated object, these problems would be resolved
> automatically and clearly.
> 
> I haven't included that in this set, because I tried not to blow it too
> much, I just wanted to introduce cache reparenting, in the meanwhile
> fixing only those issues that had become really painful. Not sure if it
> was the right decision though :-/

I would really prefer to go with simplifications first and build
reparenting on top of that.
 
> Anyway, I would appreciate if you could share your thoughts about that.
> 
> > Whether the current code is good (no it's not) is another question. But
> > this should be fixed also in the stable trees (is the bug there since
> > the very beginning?) so the fix should be as simple as possible IMO.
> > So if there is a simpler solution I would prefer it. But I am drowning
> > in the kmem trickiness spread out all over the place so I might be
> > missing something very easily.
> 
> Frankly, I'm not bothering about stable trees by now, because I don't
> think anybody is using kmemcg since w/o fs cache shrinking it looks
> pretty useless. May be, I'm wrong :-/

OK, it is all opt-in so there shouldn't be any harm for those who do not
use the feature which makes it less urgent but I can still imagine that
somebody might want to use the feature even on older kernels.

You are right that the feature is really dubious without proper
shrinking. Which was btw. my objection at the time when we have
discussed that at LSF (before it got merged).
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
@ 2014-03-18 10:01         ` Michal Hocko
  0 siblings, 0 replies; 42+ messages in thread
From: Michal Hocko @ 2014-03-18 10:01 UTC (permalink / raw)
  To: Vladimir Davydov; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On Tue 18-03-14 12:19:00, Vladimir Davydov wrote:
> On 03/17/2014 08:42 PM, Michal Hocko wrote:
> > On Thu 13-03-14 19:06:40, Vladimir Davydov wrote:
> >> We schedule memcg cache shrink+destruction work (memcg_params::destroy)
> >> from two places: when we turn memcg offline
> >> (mem_cgroup_destroy_all_caches) and when the last page of the cache is
> >> freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
> >> mem_cgroup_destroy_cache).
> > This is just ugly! Why do we mem_cgroup_destroy_all_caches from the
> > offline code at all? Just calling kmem_cache_shrink and then wait for
> > the last pages to go away should be sufficient to fix this, no?
> 
> The problem is kmem_cache_shrink() can take the slab_mutex, and we
> iterate over memcg caches to be destroyed under the memcg's
> slab_caches_mutex, which is nested into the slab_mutex (see
> mem_cgroup_destroy_all_caches()). So we can't call kmem_cache_shrink()
> there directly due to lockdep. That's what all that trickery with using
> the same work for both cache shrinking and destruction is for. I agree
> this is ugly and somewhat difficult to understand. Let me share my
> thoughts on this problem.

Nothing prevents mem_cgroup_destroy_all_caches to call kmem_cache_shrink
from the workqueue context, no? And then we can move all caches which
still have some pages to the parent memcg + update back-pointers from
respective page_cgroups.

> First, why do we need to call kmem_cache_shrink() there at all? AFAIU,
> this is because most of memcg caches must be empty by the time the memcg
> is destroyed, but since there are some pages left on the caches for
> performance reasons, and those pages hold cache references
> (kmem_cache::memcg_params::nr_pages) preventing it from being destroyed,
> we try to shrink them to get rid of empty caches. If shrink fails (i.e.
> the memcg cache is not empty) the cache will be pending for good in the
> current implementation (until someone calls the shrink manually at
> least). Glauber intended to fix the issue with pending caches by reaping
> them on vmpressure, but he didn't have enough time to complete this,
> unfortunately.
> 
> But why do we get cache references per slab?

I guess this is natural from the charging point of view. It is also less
intrusive because this is a slow path.

> I mean why do we inc the
> cache refcounter (kmem_cache::memcg_params::nr_pages currently) when
> allocating a slab, not an individual object? If we took the reference to
> the cache per individual object, we would not have to call
> kmem_cache_shrink() on memcg offline - if the cache is empty it will be
> destroyed immediately then, because its refcounter reaches 0, otherwise
> we could leave it hanging around for a while and only try to shrink it
> on vmpressure when we really need free mem. That would make the
> destroy_work straightforward - it would simply call kmem_cache_destroy()
> and that's it.
> 
> I guess I foresee the answer to the question I've just raised - using a
> per cache refcounter and taking it on each alloc/free would hurt
> scalability too much. However, we could use percpu refcounter to
> overcome this, couldn't we?

I am afraid this would still be too invasive for the fast path. That
would be a question for slab guys though.
 
> There is one more argument for taking the cache refcount on a per-object
> (not per-slab) basis. There seems to be a race in kmem allocation path.
> The point is there is a time window between we get the cache to allocate
> from (memcg_kmem_get_cache()) and the actual allocating from the cache
> (see slab_alloc_node()). Actually, nothing prevents the cache from going
> away in this time window

By "the cache" you mean the memcg variant or the global one?

> - the task can change its cgroup and the former cgroup can be taken
> offline resulting in the cache destruction.

With a proper synchronization this shouldn't be a big deal I suppose.
kmem_cache_create_memcg should check that the memcg is still alive. We
mark caches dead and we would need something like that per-memcg as well
during css_offline (after all memcg_slab_caches were shrunk). The create
worker would then back off and fail when trying to register the cache.

> This is very unlikely, but still possible.

> A similar problem with freeing
> objects - currently we might continue using a cache after we actually
> freed the last object and dropped the reference - look at
> kmem_freepages(), there we dereference the cache pointer after calling
> memcg_release_pages(), which drops the cache reference.

This sounds like an ordering issue. memcg_release_pages should be called
as the last. I do not see why the ordering was done this way.

> The latter is
> more-or-less easy to fix though by ensuring we always drop the reference
> after we stopped using the cache, but this would imply heavy intrusion
> into slab internals AFAIU, which is bad. OTOH if we took the cache
> reference per allocated object, these problems would be resolved
> automatically and clearly.
> 
> I haven't included that in this set, because I tried not to blow it too
> much, I just wanted to introduce cache reparenting, in the meanwhile
> fixing only those issues that had become really painful. Not sure if it
> was the right decision though :-/

I would really prefer to go with simplifications first and build
reparenting on top of that.
 
> Anyway, I would appreciate if you could share your thoughts about that.
> 
> > Whether the current code is good (no it's not) is another question. But
> > this should be fixed also in the stable trees (is the bug there since
> > the very beginning?) so the fix should be as simple as possible IMO.
> > So if there is a simpler solution I would prefer it. But I am drowning
> > in the kmem trickiness spread out all over the place so I might be
> > missing something very easily.
> 
> Frankly, I'm not bothering about stable trees by now, because I don't
> think anybody is using kmemcg since w/o fs cache shrinking it looks
> pretty useless. May be, I'm wrong :-/

OK, it is all opt-in so there shouldn't be any harm for those who do not
use the feature which makes it less urgent but I can still imagine that
somebody might want to use the feature even on older kernels.

You are right that the feature is really dubious without proper
shrinking. Which was btw. my objection at the time when we have
discussed that at LSF (before it got merged).
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
  2014-03-18 10:01         ` Michal Hocko
@ 2014-03-18 12:14           ` Vladimir Davydov
  -1 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-18 12:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On 03/18/2014 02:01 PM, Michal Hocko wrote:
> On Tue 18-03-14 12:19:00, Vladimir Davydov wrote:
>> On 03/17/2014 08:42 PM, Michal Hocko wrote:
>>> On Thu 13-03-14 19:06:40, Vladimir Davydov wrote:
>>>> We schedule memcg cache shrink+destruction work (memcg_params::destroy)
>>>> from two places: when we turn memcg offline
>>>> (mem_cgroup_destroy_all_caches) and when the last page of the cache is
>>>> freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
>>>> mem_cgroup_destroy_cache).
>>> This is just ugly! Why do we mem_cgroup_destroy_all_caches from the
>>> offline code at all? Just calling kmem_cache_shrink and then wait for
>>> the last pages to go away should be sufficient to fix this, no?
>> The problem is kmem_cache_shrink() can take the slab_mutex, and we
>> iterate over memcg caches to be destroyed under the memcg's
>> slab_caches_mutex, which is nested into the slab_mutex (see
>> mem_cgroup_destroy_all_caches()). So we can't call kmem_cache_shrink()
>> there directly due to lockdep. That's what all that trickery with using
>> the same work for both cache shrinking and destruction is for. I agree
>> this is ugly and somewhat difficult to understand. Let me share my
>> thoughts on this problem.
> Nothing prevents mem_cgroup_destroy_all_caches to call kmem_cache_shrink
> from the workqueue context, no? And then we can move all caches which
> still have some pages to the parent memcg

Currently, when reparenting memcg caches (see patch 11), I move memcg
caches from the memcg_slab_caches list of the memcg being offlined to
its parent under both memcg's slab_caches_mutexes held - this makes
synchronization with kmem_cache_destroy easier. However, if we could
take the reference to the root cache there protecting it from being
destroyed, I guess we could safely release memcg::slab_caches_mutex
between removal the caches from the old list (of the memcg being
destroyed) and inserting them to the new list (parent). That said, you
must be right. It seems that introducing full-fledged refcounting for
kmem_caches would simplify the code a lot.

> + update back-pointers from respective page_cgroups.

As Johannes mentioned, we don't need page_cgroups for kmem pages since
we can track kmem objects through page->slab_cache, and this set gets
rid of them (patch 7).

>> First, why do we need to call kmem_cache_shrink() there at all? AFAIU,
>> this is because most of memcg caches must be empty by the time the memcg
>> is destroyed, but since there are some pages left on the caches for
>> performance reasons, and those pages hold cache references
>> (kmem_cache::memcg_params::nr_pages) preventing it from being destroyed,
>> we try to shrink them to get rid of empty caches. If shrink fails (i.e.
>> the memcg cache is not empty) the cache will be pending for good in the
>> current implementation (until someone calls the shrink manually at
>> least). Glauber intended to fix the issue with pending caches by reaping
>> them on vmpressure, but he didn't have enough time to complete this,
>> unfortunately.
>>
>> But why do we get cache references per slab?
> I guess this is natural from the charging point of view. It is also less
> intrusive because this is a slow path.
>
>> I mean why do we inc the
>> cache refcounter (kmem_cache::memcg_params::nr_pages currently) when
>> allocating a slab, not an individual object? If we took the reference to
>> the cache per individual object, we would not have to call
>> kmem_cache_shrink() on memcg offline - if the cache is empty it will be
>> destroyed immediately then, because its refcounter reaches 0, otherwise
>> we could leave it hanging around for a while and only try to shrink it
>> on vmpressure when we really need free mem. That would make the
>> destroy_work straightforward - it would simply call kmem_cache_destroy()
>> and that's it.
>>
>> I guess I foresee the answer to the question I've just raised - using a
>> per cache refcounter and taking it on each alloc/free would hurt
>> scalability too much. However, we could use percpu refcounter to
>> overcome this, couldn't we?
> I am afraid this would still be too invasive for the fast path. That
> would be a question for slab guys though.

Yeah, that's bothering me too. I guess I'll try to send it as RFC in a
separate patch to get more feedback, because this would be a great
simplification, IMO.

>> There is one more argument for taking the cache refcount on a per-object
>> (not per-slab) basis. There seems to be a race in kmem allocation path.
>> The point is there is a time window between we get the cache to allocate
>> from (memcg_kmem_get_cache()) and the actual allocating from the cache
>> (see slab_alloc_node()). Actually, nothing prevents the cache from going
>> away in this time window
> By "the cache" you mean the memcg variant or the global one?

memcg one

>> - the task can change its cgroup and the former cgroup can be taken
>> offline resulting in the cache destruction.
> With a proper synchronization this shouldn't be a big deal I suppose.
> kmem_cache_create_memcg should check that the memcg is still alive. We
> mark caches dead and we would need something like that per-memcg as well
> during css_offline (after all memcg_slab_caches were shrunk). The create
> worker would then back off and fail when trying to register the cache.

Let's consider the following scenario:

1) A process inside a memcg calls kmem_cache_alloc(). Suppose the
memcg's cache already exists and is empty.
2) kmem_cache_alloc() calls memcg_kmem_get_cache(), which returns the
pointer to the memcg cache to allocate from.
3) At this time, the process is moved out of the memcg, and the memcg is
destroyed along with its empty caches.
4) The process then proceeds to an object allocation from the cache it
was given by memcg_kmem_get_cache() on step 2, but the cache was freed.

Does it make sense?

>> This is very unlikely, but still possible.
>> A similar problem with freeing
>> objects - currently we might continue using a cache after we actually
>> freed the last object and dropped the reference - look at
>> kmem_freepages(), there we dereference the cache pointer after calling
>> memcg_release_pages(), which drops the cache reference.
> This sounds like an ordering issue. memcg_release_pages should be called
> as the last. I do not see why the ordering was done this way.

Fixing ordering only in this function wouldn't solve the problem
completely. Look at slab's cache_free_alien() - it unlocks
cachep->node[nodeid].list_lock after calling free_block() that might
schedule the cache destruction. I mean the problem is that
kmem_freepages and __free_slab(), which issue memcg_release_pages(), are
not necessarily the last functions using the cache - the code issuing
them may still want to touch the cache.

>> The latter is
>> more-or-less easy to fix though by ensuring we always drop the reference
>> after we stopped using the cache, but this would imply heavy intrusion
>> into slab internals AFAIU, which is bad. OTOH if we took the cache
>> reference per allocated object, these problems would be resolved
>> automatically and clearly.
>>
>> I haven't included that in this set, because I tried not to blow it too
>> much, I just wanted to introduce cache reparenting, in the meanwhile
>> fixing only those issues that had become really painful. Not sure if it
>> was the right decision though :-/
> I would really prefer to go with simplifications first and build
> reparenting on top of that.

Agree, but first we need to come to agreement about what simplifications
we should go with.

IMO, the most confusing part of kmemcg is how memcg cache
creation/destruction is handled. I mean all those synchronization tricks
between kmem_cache_destroy, kmem_cache_create of root cache, memcg
cache, etc. I think we would benefit if we introduced fully-fledged
refcounting for kmem caches. This would also solve the problems first
three patches of this set address.

Next, as I've already mentioned we can handle kmem w/o using
page_cgroups provided we don't charge kmalloc large allocations. See
patches 7-10 - they remove ~270 lines of code while adding only ~70.
When looking at those patches now, I start to think it would be better
if I sent them first in a separate set.

Currently, nothing else springs to my mind. I guess I need to think a
bit more...

Thanks.

>> Anyway, I would appreciate if you could share your thoughts about that.
>>
>>> Whether the current code is good (no it's not) is another question. But
>>> this should be fixed also in the stable trees (is the bug there since
>>> the very beginning?) so the fix should be as simple as possible IMO.
>>> So if there is a simpler solution I would prefer it. But I am drowning
>>> in the kmem trickiness spread out all over the place so I might be
>>> missing something very easily.
>> Frankly, I'm not bothering about stable trees by now, because I don't
>> think anybody is using kmemcg since w/o fs cache shrinking it looks
>> pretty useless. May be, I'm wrong :-/
> OK, it is all opt-in so there shouldn't be any harm for those who do not
> use the feature which makes it less urgent but I can still imagine that
> somebody might want to use the feature even on older kernels.
>
> You are right that the feature is really dubious without proper
> shrinking. Which was btw. my objection at the time when we have
> discussed that at LSF (before it got merged).


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path
@ 2014-03-18 12:14           ` Vladimir Davydov
  0 siblings, 0 replies; 42+ messages in thread
From: Vladimir Davydov @ 2014-03-18 12:14 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, glommer, linux-kernel, linux-mm, devel

On 03/18/2014 02:01 PM, Michal Hocko wrote:
> On Tue 18-03-14 12:19:00, Vladimir Davydov wrote:
>> On 03/17/2014 08:42 PM, Michal Hocko wrote:
>>> On Thu 13-03-14 19:06:40, Vladimir Davydov wrote:
>>>> We schedule memcg cache shrink+destruction work (memcg_params::destroy)
>>>> from two places: when we turn memcg offline
>>>> (mem_cgroup_destroy_all_caches) and when the last page of the cache is
>>>> freed (memcg_params::nr_pages reachs zero, see memcg_release_pages,
>>>> mem_cgroup_destroy_cache).
>>> This is just ugly! Why do we mem_cgroup_destroy_all_caches from the
>>> offline code at all? Just calling kmem_cache_shrink and then wait for
>>> the last pages to go away should be sufficient to fix this, no?
>> The problem is kmem_cache_shrink() can take the slab_mutex, and we
>> iterate over memcg caches to be destroyed under the memcg's
>> slab_caches_mutex, which is nested into the slab_mutex (see
>> mem_cgroup_destroy_all_caches()). So we can't call kmem_cache_shrink()
>> there directly due to lockdep. That's what all that trickery with using
>> the same work for both cache shrinking and destruction is for. I agree
>> this is ugly and somewhat difficult to understand. Let me share my
>> thoughts on this problem.
> Nothing prevents mem_cgroup_destroy_all_caches to call kmem_cache_shrink
> from the workqueue context, no? And then we can move all caches which
> still have some pages to the parent memcg

Currently, when reparenting memcg caches (see patch 11), I move memcg
caches from the memcg_slab_caches list of the memcg being offlined to
its parent under both memcg's slab_caches_mutexes held - this makes
synchronization with kmem_cache_destroy easier. However, if we could
take the reference to the root cache there protecting it from being
destroyed, I guess we could safely release memcg::slab_caches_mutex
between removal the caches from the old list (of the memcg being
destroyed) and inserting them to the new list (parent). That said, you
must be right. It seems that introducing full-fledged refcounting for
kmem_caches would simplify the code a lot.

> + update back-pointers from respective page_cgroups.

As Johannes mentioned, we don't need page_cgroups for kmem pages since
we can track kmem objects through page->slab_cache, and this set gets
rid of them (patch 7).

>> First, why do we need to call kmem_cache_shrink() there at all? AFAIU,
>> this is because most of memcg caches must be empty by the time the memcg
>> is destroyed, but since there are some pages left on the caches for
>> performance reasons, and those pages hold cache references
>> (kmem_cache::memcg_params::nr_pages) preventing it from being destroyed,
>> we try to shrink them to get rid of empty caches. If shrink fails (i.e.
>> the memcg cache is not empty) the cache will be pending for good in the
>> current implementation (until someone calls the shrink manually at
>> least). Glauber intended to fix the issue with pending caches by reaping
>> them on vmpressure, but he didn't have enough time to complete this,
>> unfortunately.
>>
>> But why do we get cache references per slab?
> I guess this is natural from the charging point of view. It is also less
> intrusive because this is a slow path.
>
>> I mean why do we inc the
>> cache refcounter (kmem_cache::memcg_params::nr_pages currently) when
>> allocating a slab, not an individual object? If we took the reference to
>> the cache per individual object, we would not have to call
>> kmem_cache_shrink() on memcg offline - if the cache is empty it will be
>> destroyed immediately then, because its refcounter reaches 0, otherwise
>> we could leave it hanging around for a while and only try to shrink it
>> on vmpressure when we really need free mem. That would make the
>> destroy_work straightforward - it would simply call kmem_cache_destroy()
>> and that's it.
>>
>> I guess I foresee the answer to the question I've just raised - using a
>> per cache refcounter and taking it on each alloc/free would hurt
>> scalability too much. However, we could use percpu refcounter to
>> overcome this, couldn't we?
> I am afraid this would still be too invasive for the fast path. That
> would be a question for slab guys though.

Yeah, that's bothering me too. I guess I'll try to send it as RFC in a
separate patch to get more feedback, because this would be a great
simplification, IMO.

>> There is one more argument for taking the cache refcount on a per-object
>> (not per-slab) basis. There seems to be a race in kmem allocation path.
>> The point is there is a time window between we get the cache to allocate
>> from (memcg_kmem_get_cache()) and the actual allocating from the cache
>> (see slab_alloc_node()). Actually, nothing prevents the cache from going
>> away in this time window
> By "the cache" you mean the memcg variant or the global one?

memcg one

>> - the task can change its cgroup and the former cgroup can be taken
>> offline resulting in the cache destruction.
> With a proper synchronization this shouldn't be a big deal I suppose.
> kmem_cache_create_memcg should check that the memcg is still alive. We
> mark caches dead and we would need something like that per-memcg as well
> during css_offline (after all memcg_slab_caches were shrunk). The create
> worker would then back off and fail when trying to register the cache.

Let's consider the following scenario:

1) A process inside a memcg calls kmem_cache_alloc(). Suppose the
memcg's cache already exists and is empty.
2) kmem_cache_alloc() calls memcg_kmem_get_cache(), which returns the
pointer to the memcg cache to allocate from.
3) At this time, the process is moved out of the memcg, and the memcg is
destroyed along with its empty caches.
4) The process then proceeds to an object allocation from the cache it
was given by memcg_kmem_get_cache() on step 2, but the cache was freed.

Does it make sense?

>> This is very unlikely, but still possible.
>> A similar problem with freeing
>> objects - currently we might continue using a cache after we actually
>> freed the last object and dropped the reference - look at
>> kmem_freepages(), there we dereference the cache pointer after calling
>> memcg_release_pages(), which drops the cache reference.
> This sounds like an ordering issue. memcg_release_pages should be called
> as the last. I do not see why the ordering was done this way.

Fixing ordering only in this function wouldn't solve the problem
completely. Look at slab's cache_free_alien() - it unlocks
cachep->node[nodeid].list_lock after calling free_block() that might
schedule the cache destruction. I mean the problem is that
kmem_freepages and __free_slab(), which issue memcg_release_pages(), are
not necessarily the last functions using the cache - the code issuing
them may still want to touch the cache.

>> The latter is
>> more-or-less easy to fix though by ensuring we always drop the reference
>> after we stopped using the cache, but this would imply heavy intrusion
>> into slab internals AFAIU, which is bad. OTOH if we took the cache
>> reference per allocated object, these problems would be resolved
>> automatically and clearly.
>>
>> I haven't included that in this set, because I tried not to blow it too
>> much, I just wanted to introduce cache reparenting, in the meanwhile
>> fixing only those issues that had become really painful. Not sure if it
>> was the right decision though :-/
> I would really prefer to go with simplifications first and build
> reparenting on top of that.

Agree, but first we need to come to agreement about what simplifications
we should go with.

IMO, the most confusing part of kmemcg is how memcg cache
creation/destruction is handled. I mean all those synchronization tricks
between kmem_cache_destroy, kmem_cache_create of root cache, memcg
cache, etc. I think we would benefit if we introduced fully-fledged
refcounting for kmem caches. This would also solve the problems first
three patches of this set address.

Next, as I've already mentioned we can handle kmem w/o using
page_cgroups provided we don't charge kmalloc large allocations. See
patches 7-10 - they remove ~270 lines of code while adding only ~70.
When looking at those patches now, I start to think it would be better
if I sent them first in a separate set.

Currently, nothing else springs to my mind. I guess I need to think a
bit more...

Thanks.

>> Anyway, I would appreciate if you could share your thoughts about that.
>>
>>> Whether the current code is good (no it's not) is another question. But
>>> this should be fixed also in the stable trees (is the bug there since
>>> the very beginning?) so the fix should be as simple as possible IMO.
>>> So if there is a simpler solution I would prefer it. But I am drowning
>>> in the kmem trickiness spread out all over the place so I might be
>>> missing something very easily.
>> Frankly, I'm not bothering about stable trees by now, because I don't
>> think anybody is using kmemcg since w/o fs cache shrinking it looks
>> pretty useless. May be, I'm wrong :-/
> OK, it is all opt-in so there shouldn't be any harm for those who do not
> use the feature which makes it less urgent but I can still imagine that
> somebody might want to use the feature even on older kernels.
>
> You are right that the feature is really dubious without proper
> shrinking. Which was btw. my objection at the time when we have
> discussed that at LSF (before it got merged).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2014-03-18 12:22 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-13 15:06 [PATCH RESEND -mm 00/12] kmemcg reparenting Vladimir Davydov
2014-03-13 15:06 ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 01/12] memcg: flush cache creation works before memcg cache destruction Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-17 16:07   ` Michal Hocko
2014-03-17 16:07     ` Michal Hocko
2014-03-18  8:14     ` Vladimir Davydov
2014-03-18  8:14       ` Vladimir Davydov
2014-03-18  8:55       ` Michal Hocko
2014-03-18  8:55         ` Michal Hocko
2014-03-18  9:28         ` Vladimir Davydov
2014-03-18  9:28           ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 02/12] memcg: fix race in memcg cache destruction path Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-17 16:42   ` Michal Hocko
2014-03-17 16:42     ` Michal Hocko
2014-03-18  8:19     ` Vladimir Davydov
2014-03-18  8:19       ` Vladimir Davydov
2014-03-18 10:01       ` Michal Hocko
2014-03-18 10:01         ` Michal Hocko
2014-03-18 12:14         ` Vladimir Davydov
2014-03-18 12:14           ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 03/12] memcg: fix root vs memcg cache destruction race Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 04/12] memcg: move slab caches list/mutex init to memcg creation Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 05/12] memcg: add pointer from memcg_cache_params to cache Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 06/12] memcg: keep all children of each root cache on a list Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 07/12] memcg: rework slab charging Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 08/12] memcg: do not charge kmalloc_large allocations Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 09/12] fork: do not charge thread_info to kmemcg Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 10/12] memcg: kill GFP_KMEMCG and stuff Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 11/12] memcg: reparent slab on css offline Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov
2014-03-13 15:06 ` [PATCH RESEND -mm 12/12] slub: make sure all memcg caches have unique names on sysfs Vladimir Davydov
2014-03-13 15:06   ` Vladimir Davydov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.